Topics
Many machine learning algorithms rely on distances between data points as their input, sometimes the only input, especially so for clustering and ranking algorithms. Also, increased dimensionality leads to data sparsity. This means that even if a data point ranks closest to another data point, it can still be very, very far.
In the paper When is Nearest Neighbors Meaningful? the authors argue that for many data distribution and distance functions, the ratio of distances between nearest and farthest neighbors is almost 1 (so more or less the same).
Example
For any point , let’s assume is the minimum distance between and its nearest neighbor, while is the maximum distance between and its farthest neighbor.
In 1-D, 2-D or even 3-D,
But, as the dimensions increase, i.e. ,
That is, for a -dimensional space, given random points, any given pair of points are almost equidistant to each other as . In such cases, any machine learning algorithms which are based on the distance measure including KNN algorithm (k-Nearest Neighbor) tend to fail.