Distance calculation is a common element in data science. Lots of machine learning algorithms are based on a distance measure. Though nowadays you do not need to calculate the distances manually. There are packages and functions available for that. But still, it is necessary to know how different distance measures work to know which distance measure is suitable for your project.
This article will focus on giving a solid mathematical idea about different distance measures. That way when you need to use a distance measure you know clearly what is going on behind the scene. That helps a lot to understand which distance measure is useful for what purposes.
This will cover the following distance measures:
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance
- Hamming Distance
- Cosine Similarity
Let’s dive into the details!
Very commonly used and maybe a lot of us remember this from our middle school or high school. Euclidean distance is actually the length of a line between two points. If the cartesian coordinates of two points are known, the distance can be calculated using the Pythagorean theorem.
Here is the formula:
These are two points x and y : (5, 1), (9, -2). Using Euclidean distance, here is the distance between these two points:
But it is important to remember that Euclidean distance has some disadvantages as well. Eucidean distance is not scale invariant. So, the distances might be skewed based on the units of the features.
That’s why it is necessary to normalize the data before applying Euclidean distance.
Euclidean distance may be helpful for 2 or 3-dimensional data but is not very helpful for the higher dimensionality.
Manhattan distance is especially helpful to the vectors that describe objects on a uniform grid such as a city or a chessboard. An example can be to calculate the shortest distance between two points in a city a taxicab would take. It is calculated as the sum of the absolute differences between the two vectors. Here is the formula:
Here is an x and y:
x = [3, 6, 11, 8]
y = [0, 9, 5, 3]
The Manhattan distance between x and y:
d = |3–0| + |6–9| + |11–5| + |8–3| =3+3+6+5 = 17
Manhattan distance works better in higher dimensionality data. But it is less intuitive. Though it’s not a big problem.
Another popular distance measure. If you look at the formula, you will see that it represents both Euclidean and Manhattan distance. Let’s discuss the formula first.
Here, x and y are two p-dimensional data objects and ‘h’ is the order. The distance defined this way is also called the L-h norm.
If ‘h’ is 1, it becomes the manhattan distance, and if h = 2, it becomes the Euclidean distance.
Here is an example:
x = [9, 2, -1, 12]
y = [-4, 5, 10, 13]
When h = 1, The formula becomes the Manhattan distance formula which is referred to as the L-1 norm:
When, h= 2, the formula becomes Euclidean distance formula which is also referred to as L-2 norm:
As you can see, Minkowski distance represents other distance measures based on the ‘h’ values. So, caution should be taken as per the h values. If it is Euclidean distance, the disadvantages need to be taken into account.
Hamming distance is useful for finding the distance between two binary vectors. In Data Science or in machine learning you will often encounter the one-hot encoded data. Hamming distance would be useful in those cases and many others.
The formula is:
Here n is the length of x or y. They are supposed to have the same length. If x and y are:
x = [0, 1, 0, 1]
y = [1, 1, 0, 0]
Then the distance between x and y is:
d = (|0–1| + |1–1| + |0–0| + |1–0|) / 4= 1+0+0+1 = 2/4 = 0.5
Here 4 is the length of x or y. Both have 4 elements in them.
As you can see from the example above, vectors need to be of the same length.
Similarity Measure for Nominal Attributes
Almost a similar idea can be used to find the distance between Nominal Features or vectors.
Where m is the number of matches,
p is the length of x or y.
For example, x and y are some answers from Polly and Molly:
x = [‘high’, ‘A’, ‘yes’, ‘Asian’]
y = [‘high’, ‘A’, ‘yes’, ‘Latino’]
Here, there are four elements and 3 of them are the same. The distance between x and y is:
d = (4–3)/4 = 1/4 = 0.25
Now, if we need similarity instead of dissimilarity or distance, it is simply the opposite of distance. So the similarity between x and y is:
similarity = 1–0.25 = 0.75
Cosine similarity is used especially in Natural Language Processing. It represents the cosine of the angle between two vectors and determines if two vectors point to almost similar directions. Now, what is the vector here? A document can be made out of thousands of words. A vector is the frequency of words in that document. If we have five documents, five documents will have five vectors made out of the frequency of words. Here is an example:
Let’s find the cosine similarity between d1 and d2
Here is the formula:
Here, d1.d2 means the dot product of two vectors d1 and d2.
d1.d2 = 5*4 + 2*0 + 1*0 + 0*2 + 1*2 + 3*2 + 0*1 = 28
||d1|| = (5*5 + 2*2 + 1*1 + 0*0 + 1*1 + 3*3 + 0*0)**0.5 = 6.32
||d2|| = (4*4 + 0*0 + 0*0 + 2*2+ 2*2 + 2*2+1*1)**0.5 = 5.39
cos(d1, d2) = 28 / (6.32*5.39) = 0.82
So, this is how we get cosine similarity.
There are many other distance measures out there. I tried to pick some very common ones for this article. Thank you so much for reading.
Feel free to follow me on Twitter and like my Facebook page.
#DataScience #MachineLearning #ArtificialInteligence #DataAnalytics