KD Trees

As all KNN queries can be considered search problems, one of the most efficient way to reduce the overall complexity is to reorganize the dataset into a tree structure. In a binary tree (one-dimensional data), the average computational complexity of a query is O(log M), because we assume we have almost the same number of elements in each branch (if the tree is completely unbalanced, all the elements are inserted sequentially and the resulting structure has a single branch,  so the complexity becomes O(M)). In general, the real complexity is slightly higher than O(log M), but the operation is always much more efficient than a vanilla search, which is O(M2).

However, we normally work with N-dimensional data and the previous structure cannot be immediately employed. KD Trees extend the concept of a binary for N > 1. In this case, a split cannot be immediately performed and a different strategy must be chosen. The easiest way to solve this problem is to select a feature at each level (1, 2, ..., N) and repeat the process until the desired depth is reached. In the following diagram, there's an example of KD Trees with three-dimensional points:

Example of three-dimensional KD Tree

The root is point (5, 3, 7). The first split is performed considering the first feature, so two children are (2, 1, 1) and (8, 4, 3). The second one operates on the second feature and so on. The average computational complexity is O(N log M), but if the distribution is very asymmetric, the probability that the tree becomes unbalanced is very high. To mitigate this issue, it's possible to select the feature corresponding to the median of the (sub-)dataset and to continue splitting with this criterion. In this way, the tree is guaranteed to be balanced. However, the average complexity is always proportional to the dimensionality and this can dramatically affect the performance.

For example, if M = 10,000 and N = 10, using the log10O(N log M) = O(40), while, with N = 1,000, the complexity becomes O(40,000). Generally, KD Trees suffers the curse of dimensionality and when N becomes large, the average complexity is about O(MN), which is always better than the vanilla algorithm, but often too expensive for real-life applications.  Therefore, KD Trees is really effective only when the dimensionality is not too high. In all other cases, the probability of having an unbalanced tree and the resulting computational complexity suggest employing a different method.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.68.18