Tk-merge: Computationally Efficient Robust Clustering Under General Assumptions

Abstract

We address general-shaped clustering problems under very weak parametric assumptions with a two-step hybrid robust clustering algorithm based on trimmed $k$-means and hierarchical agglomeration. The algorithm has low computational complexity and effectively identifies the clusters also in presence of data contamination. We also present natural generalizations of the approach as well as an adaptive procedure to estimate the amount of contamination in a data-driven fashion. Our proposal outperforms state-of-the-art robust, model-based methods in our numerical simulations and real-world applications related to color quantization for image analysis, human mobility patterns based on GPS data, biomedical images of diabetic retinopathy, and functional data across weather stations.

Publication
arXiv preprint
Luca Insolia
Luca Insolia
Postdoctoral Researcher

My primary research interests concern robust statistics and high-dimensional modeling. During my PhD, I developed statistical methodologies for analyzing sparse regression problems affected by different forms of adversarial data contamination. The developed methodologies encompass continuous optimization methods as well as mixed-integer programming techniques. I applied these tools to analyze biomedical data and to investigate the main possible drivers of honey bee colony loss.