A new compressed cover tree for k-nearest neighbour search and the stable-under-noise mergegram of a point cloud



Elkin, Yury
(2022) A new compressed cover tree for k-nearest neighbour search and the stable-under-noise mergegram of a point cloud. PhD thesis, University of Liverpool.

[img] Text
201327089_November2022.pdf - Author Accepted Manuscript

Download (2MB) | Preview

Abstract

The analysis of data sets mathematically representable as finite metric spaces plays a significant role in every scientific study. In this thesis we focus on constructing new effective algorithms in the area of computational geometry that can be effectively deployed for the study and classification of large data sets prevalent in natural science, economic analysis, medicine, environmental protection etc. The first major contribution of this thesis is a new near-linear time algorithm, that resolves the classical problem of finding $k$-nearest neighbors (KNN) to of query set $Q$ in a larger reference set $R$, where $Q$ and $R$ both belong to some metric space $X$. This project was inspired by the work of Beygelzimer, Kakade, and Langford in ICML 2006 that attempted to show that such problem is resolvable for $k=1$ having a near-linear time complexity. However, in 2015 it was pointed out that the proof of their time complexity might contain mistakes, which has been ascertained in this thesis by showing that the proposed proof does not withstand a concrete counterexample. An important application of the KNN algorithm is a KNN graph on a finite metric space $R$ whose edge set is formed by connecting every point $p \in R$ with its $k$-nearest neighbors. The KNN graph finds its application in areas of data-skeletonization, where it can serve as an initial skeleton of the data set, or in cluster analysis, where connected components of the KNN graph can represent the clusters. Another application of the the KNN algorithm is Minimum spanning tree (MST), which is an efficient way to visualize any unstructured data while knowing only distances, for example any metric graph connecting abstract data points. Although many efficient algorithms for the MST in metric spaces have been devised, there existed only one past attempt to justify a near-linear time complexity in the size of a given metric space. In 2010 March, Ram, and Gray claimed that MST of any finite metric space can be built in a parametrized near-linear time. In this work we have demonstrated, with multiple counterexamples, that the attempted proof was incorrect by showing that one of its step fails. Encouraged by the results of the work of 2010 this thesis produces a new algorithm that is based on Boruvka algorithm, which is combined with the KNN method to resolve the metric MST problem in a near-linear time complexity. In the thesis final chapter the MST algorithm is applied in the computation of a new isometry invariant mergegram of Topological Data Analysis (TDA). TDA quantifies topological shapes hidden in unorganized data such as clouds of unordered points. In the $0$-dimensional ($0$D) case, the distance-based persistence is determined by a single-linkage (SL) clustering of a finite set in a metric space. Equivalently, the $0D$ persistence captures only edge lengths of a Minimum Spanning Tree (MST). Both the SL dendrogram and the MST are unstable under perturbations of points. In this thesis, we define the new stable-under-noise mergegram which outperforms previous isometry invariants on a classification of point clouds. In conclusion, the developed fast algorithms of this thesis can cater to a vast varieties of tasks in data science and beyond. The newly proposed corrected time complexity analysis of KNN and MST not only rectifies the past issues in their theoretical justifications but also gives a way to fix analogous issues in other similar methods based on the cover tree data structure.

Item Type: Thesis (PhD)
Divisions: Faculty of Science and Engineering > School of Electrical Engineering, Electronics and Computer Science
Depositing User: Symplectic Admin
Date Deposited: 15 Dec 2022 11:54
Last Modified: 18 Jan 2023 19:41
DOI: 10.17638/03166570
Supervisors:
  • Kurlin, Vitaliy
  • Kankaanrinta, Marja
URI: https://livrepository.liverpool.ac.uk/id/eprint/3166570