curse of dimensionality affects nearest neighbors algorithms

Topics

curse of dimensionality

Many machine learning algorithms rely on distances between data points as their input, sometimes the only input, especially so for clustering and ranking algorithms. Also, increased dimensionality leads to data sparsity. This means that even if a data point ranks closest to another data point, it can still be very, very far.

In the paper When is Nearest Neighbors Meaningful? the authors argue that for many data distribution and distance functions, the ratio of distances between nearest and farthest neighbors is almost 1 (so more or less the same).

Example

For any point $A$ , let’s assume $dist_{m i n} (A)$ is the minimum distance between $A$ and its nearest neighbor, while $dist_{m a x} (A)$ is the maximum distance between $A$ and its farthest neighbor.

In 1-D, 2-D or even 3-D,

\frac{dist _{m a x} ( A ) - dist _{m i n} ( A )}{dist _{m i n} ( A )} > 0

But, as the dimensions increase, i.e. $dim \to \infty$ ,

dim \to \infty lim \frac{dist _{m a x} ( A ) - dist _{m i n} ( A )}{dist _{m i n} ( A )} \to 0

That is, for a $d$ -dimensional space, given $n$ random points, any given pair of points are almost equidistant to each other as $d \to \infty$ . In such cases, any machine learning algorithms which are based on the distance measure including KNN algorithm (k-Nearest Neighbor) tend to fail.

Altamash Khan

Altamash Khan

curse of dimensionality affects nearest neighbors algorithms

Example

Backlinks

Altamash Khan

curse of dimensionality affects nearest neighbors algorithms

Example

Related

Backlinks