Machine learning algorithms have the potential to unlock solutions to ambitious problems in a myriad of scientific and industrial fields. Clustering is an unsupervised machine learning technique in which data objects are grouped, based on similarity, into clusters. Objects within a cluster are alike in some mathematical way and ob- jects in different clusters are distinctly different. In this way, clustering uncovers the hidden structure of a dataset to discover how many data objects are similar and the number of clusters. Unfortunately clustering techniques vary in two experimental ways: clustering execution time and the accuracy of the final clustering results. Fur- thermore, many clustering algorithms require multiple experimental evaluations that, in turn, generate prohibitive clustering execution time.
Kmeans is the primary clustering algorithm in the data mining and unsupervised machine learning spheres that defines a single clustering experiment by defining the set of similar points that are each closest to K cluster centroids. The algorithm finds a solution of centroids in the dataset by iteratively calculating the multi-dimensional D distances between each of the experimental K cluster centers and the N data points. While the base Kmeans algorithm is parallelizable across these iterations, there are two exacerbating problematic conditions of the algorithmâ€™s use: starting seed selection and the selectivity the number of K-centroids for a dataset. Each of these conditions requires multiple E experiments to evaluate within each dataset. The work outlined in this thesis investigates the optimization and the restructuring of the Kmeans algorithm to use multiple sub-samplings of a target dataset to generate accurate K-centroid seeds that eliminate excessive iterations of the base clustering algorithm.