Forums

Similarity between two data points

Started by sia4uin 6 years ago8 replieslatest reply 6 years ago985 views

Hello All,

What is the best way to find of if two data points are similar?

For example, I have a point X with 3 features, point Y with similar 3 features and a point Z. One way is to measure the Euclidean Distance and two points with smallest distance are similar.

But can I use correlation between X, Y and Z to find out if they are similar?

Best Regards

Sia

[ - ]
Reply by T T FJune 20, 2016

Hello Sia:

Other posts are correct. There are also other considerations such as whether the vector you are observing has been properly normalized so that you are comparing apples to apples.

You didn't define "how similar is similar" ... whether similarity is spherical, rectangular, cubic, etc. A very vague problem statement.

You also didn't state whether "similar" is "in reference to" a "standard" vector, or relative to each other. Similarity at a "distance" of one billion miles is a lot different than similarity at a "distance" of one millimeter ... so you also didn't state the magnitudes of your vectors.

For digital demodulators in radio receiving equipment, similar is defined as a percentage versus a reference vector, the reference vector being the "perfect" vector.

Good luck ...

[ - ]
Reply by JOSJune 20, 2016

Hi Sia,

Note that for points normalized by their norm, minimizing Euclidean distance is is equivalent to maximizing cross-correlation (aka "vector cosine"):

\|x-y\|^2 = \; <x-y,x-y> \\

= \; <x,x> - <x,y> - <y,x> + <y,y> \\

= \|x\|^2 + \|y\|^2 - 2<x,y>

codecogseqn_23506.gif

(image courtesy of https://www.codecogs.com/latex/eqneditor.php)

- Julius

[ - ]
Reply by Tim WescottJune 20, 2016

Thanks JOS -- I intuitively knew that there was a strong relationship between correlation and Euclidean distance, but I didn't want to say anything at risk of putting my foot in my mouth, or take the time to go look it up.

It's nice when someone else can do my cross-referencing for me!

[ - ]
Reply by MichaelRWJune 20, 2016

Hi, Sia:

What is the data type (i.e. real-value, categorical, etc.) of each of the three features for each point?


Michael.

[ - ]
Reply by sia4uinJune 20, 2016

The data type is real-value.

[ - ]
Reply by Tim WescottJune 20, 2016

First, I suspect that with just three dimensions, "correlation" between X, Y and Z will boil down to either the Elucidean distance, or to some other simpler measure that depends on how you define "correlation".

Second, what "similar" means for your problem really depends on what your problem is.  A problem in data transmission & reception may say that the minimum of ||X - a*Z|| over all real a is "most similar", where a problem in locating a point in space would say that the Euclidean distance is, indeed, the measure of similarity.

Sorry for the vague answer, but I really think that you need to define the term "similar" in the context of your problem before you can get a sensible answer.

[ - ]
Reply by sia4uinJune 20, 2016

I understand your point from communication perspective. But the term "similar" is not defined in the question. So I have to assume or define it. I can solve it using Euclidean distance.

But I wanted to know if correlation can be applied. For example, cross-correlation is used in communication to get channel impulse response. 

[ - ]
Reply by Tim WescottJune 20, 2016

Well, I'd like to know if correlation can be applied, too.  But you have absolutely positively not given enough information to know the answer.

Can the degree of correlation between two signals be used as a definition of "similar"?  Absolutely.  Does it mean anything at all in the context of your actual problem?  I don't know, and no one else can know without knowing more about your problem.

When you're dealing with technical problems you can't just do a keyword search and then cherry-pick things that have the same keywords associated with them.  One problem's "similar" may be another problem's "completely different" -- in fact, there are almost certainly sets of problems out there where each problem's "similar" is the other one's "completely different".

I'll make a deal with you: I will, absolutely and positively give you a yes or no answer to whether correlation means "similar" in your problem's context, if you first give me an answer, in engineering units and to a precision of at least 1%, and which applies to all contexts in which the question may be asked, to the question "how long is a string?"