In this paper we discuss techniques for potential speedups in k-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the k-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed k-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.
Martino, Alessio; Rizzi, Antonello; Massimo Frattale Mascioli, Fabio. (2018). Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering. In 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1- 8). Institute of Electrical and Electronics Engineers (IEEE). Doi: 10.1109/IJCNN.2018.8489101. https://ieeexplore.ieee.org/document/8489101.
Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering
Alessio Martino
;
2018
Abstract
In this paper we discuss techniques for potential speedups in k-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the k-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed k-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.| File | Dimensione | Formato | |
|---|---|---|---|
|
Martino_Distance-Matrix_2018.pdf
Solo gestori archivio
Tipologia:
Versione dell'editore
Licenza:
DRM (Digital rights management) non definiti
Dimensione
1.42 MB
Formato
Adobe PDF
|
1.42 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



