Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering

Martino, Alessio; Rizzi, Antonello; Fabio Massimo Frattale Mascioli,

doi:10.1109/IJCNN.2018.8489101

In this paper we discuss techniques for potential speedups in k-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the k-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed k-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.

Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering / Martino, Alessio; Rizzi, Antonello; Massimo Frattale Mascioli, Fabio. - 2018 International Joint Conference on Neural Networks (IJCNN), (2018), pp. 1-8. (IJCNN 2018 - 2018 International Joint Conference on Neural Networks, Rio De Janeiro, Brazil, 8-13 July, 2018). [10.1109/IJCNN.2018.8489101].

Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering

Alessio Martino;Antonello Rizzi;Fabio Massimo Frattale Mascioli

2018

Abstract

In this paper we discuss techniques for potential speedups in k-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the k-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed k-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del convegno
	
				2018
			
	Parole chiave
	
				data clustering
unsupervised learning
big data mining
large-scale pattern recognition
distributed computing
			
	Appare nelle tipologie:
	
				04.1 - Contributo in Atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Martino_Distance-Matrix_2018.pdf Solo gestori archivio Tipologia: Versione dell'editore Licenza: DRM (Digital rights management) non definiti Dimensione 1.42 MB Formato Adobe PDF Visualizza/Apri	1.42 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/214593

Citazioni

22

0

ND

Nome	Dominio	Durata	Descrizione
s_.*	plu.mx	sessione	recupero grafico citazioni sociali da plumx
A_.*	core.ac.uk	7 giorni	recupero pubblicazioni consigliate per il pannello core-recommander
GS_.*	gstatic.com	richiesta http	visualizza grafico citazioni
CC_.*	creativecommons.org	richiesta http	visualizza licenza bitstream

IRIS - Institutional Research Information System