An ecology-based index for text embedding and classification

Martino, Alessio; Enrico De Santis,; Rizzi, Antonello

doi:10.1109/IJCNN48605.2020.9207299

Natural language processing and text mining applications have gained a growing attention and diffusion in the computer science and machine learning communities. In this work, a new embedding scheme is proposed for solving text classification problems. The embedding scheme relies on a statistical assessment of relevant words within a corpus using a compound index originally proposed in ecology: this allows to spot relevant parts of the overall text (e.g., words) on the top of which the embedding is performed following a Granular Computing approach. The employment of statistically meaningful words not only eases the computational burden and the embedding space dimensionality, but also returns a more interpretable model. Our approach is tested on both synthetic datasets and benchmark datasets against well-known embedding techniques, with remarkable results both in terms of performances and computational complexity.

Martino, Alessio; De Santis, Enrico; Rizzi, Antonello. (2020). An ecology-based index for text embedding and classification. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1- 8). Institute of Electrical and Electronics Engineers (IEEE). Isbn: 978-1-7281-6926-2. Doi: 10.1109/IJCNN48605.2020.9207299. https://ieeexplore.ieee.org/document/9207299.

An ecology-based index for text embedding and classification

Alessio Martino;Enrico De Santis;Antonello Rizzi

2020

Abstract

Natural language processing and text mining applications have gained a growing attention and diffusion in the computer science and machine learning communities. In this work, a new embedding scheme is proposed for solving text classification problems. The embedding scheme relies on a statistical assessment of relevant words within a corpus using a compound index originally proposed in ecology: this allows to spot relevant parts of the overall text (e.g., words) on the top of which the embedding is performed following a Granular Computing approach. The employment of statistically meaningful words not only eases the computational burden and the embedding space dimensionality, but also returns a more interpretable model. Our approach is tested on both synthetic datasets and benchmark datasets against well-known embedding techniques, with remarkable results both in terms of performances and computational complexity.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del convegno
	
				2020
			
	Codice ISBN
	
				978-1-7281-6926-2
			
	Parole chiave
	
				embedding spaces, explainable artificial intelligence, granular computing, natural language processing, supervised learning, support vector machine, text classification
			
	Citazione
	
				Martino, Alessio; De Santis, Enrico; Rizzi, Antonello. (2020). An ecology-based index for text embedding and classification. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1- 8).  Institute of Electrical and Electronics Engineers (IEEE). Isbn: 978-1-7281-6926-2. Doi: 10.1109/IJCNN48605.2020.9207299. https://ieeexplore.ieee.org/document/9207299.
			
	Appare nelle tipologie:
	
				04.1 - Contributo in Atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Martino_Copertina-indice_Ecology-based_2020.pdf Solo gestori archivio Tipologia: Altro materiale allegato Licenza: DRM (Digital rights management) non definiti Dimensione 607.48 kB Formato Adobe PDF Visualizza/Apri	607.48 kB	Adobe PDF	Visualizza/Apri
Martino_Ecology-based_2020.pdf Solo gestori archivio Tipologia: Versione dell'editore Licenza: DRM (Digital rights management) non definiti Dimensione 235.12 kB Formato Adobe PDF Visualizza/Apri	235.12 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/214505

Citazioni

11

5

ND

IRIS - Institutional Research Information System