A paired index structure for k-Nearest Neighbor Search Algorithms

Existing k-Nearest-Neighbors algorithms become overwhelmed when presented with large data. While space partitioning data structures such as k-d tree and ball-tree improve the performance in a larger data set, they suffer when the data is high dimensional. We observe that space partitioning becomes ineffective in high-dimensional spaces and explores most of the search space to find the nearest neighbors. Our strategy is to partition the data into small clusters of points that are similarly distanced from a reference point in a B+tree data structure that is easy to parallelize and in a way that narrows down the search of k-Nearest-Neighbors for a query to a few clusters. Further, we establish that any indexing applicable to the entire data set can be extended to each cluster, enabling even faster query time at the node level. We then present our algorithm and its theoretical complexities and show that our approach outperforms the naive, k-d tree, ball-tree based k-NN and MRPT index-based approximate k-NN search in high dimensional data. Finally, we conclude with experimental results that illustrate the performance improvements outlined in our approach.

Hasan Kurban
Hasan Kurban
Computer & Data Scientist

I’m a computer scientist & machine learning researcher who loves building intelligent systems to find data-driven solutions to real-world problems.