Natural language processing and text mining applications have gained a growing attention and diffusion in the computer science and machine learning communities. In this work, a new embedding scheme is proposed for solving text classification problems. The embedding scheme relies on a statistical assessment of relevant words within a corpus using a compound index originally proposed in ecology: this allows to spot relevant parts of the overall text (e.g., words) on the top of which the embedding is performed following a Granular Computing approach. The employment of statistically meaningful words not only eases the computational burden and the embedding space dimensionality, but also returns a more interpretable model. Our approach is tested on both synthetic datasets and benchmark datasets against well-known embedding techniques, with remarkable results both in terms of performances and computational complexity.
An ecology-based index for text embedding and classification / Martino, Alessio; De Santis, Enrico; Rizzi, Antonello. - 2020 International Joint Conference on Neural Networks (IJCNN), (2020), pp. 1-8. (IJCNN 2020 - 2020 International Joint Conference on Neural Networks, Online Event due to COVID-19 (formerly Glasgow, UK), 19-24 July 2020). [10.1109/IJCNN48605.2020.9207299].
An ecology-based index for text embedding and classification
Alessio Martino
;
2020
Abstract
Natural language processing and text mining applications have gained a growing attention and diffusion in the computer science and machine learning communities. In this work, a new embedding scheme is proposed for solving text classification problems. The embedding scheme relies on a statistical assessment of relevant words within a corpus using a compound index originally proposed in ecology: this allows to spot relevant parts of the overall text (e.g., words) on the top of which the embedding is performed following a Granular Computing approach. The employment of statistically meaningful words not only eases the computational burden and the embedding space dimensionality, but also returns a more interpretable model. Our approach is tested on both synthetic datasets and benchmark datasets against well-known embedding techniques, with remarkable results both in terms of performances and computational complexity.File | Dimensione | Formato | |
---|---|---|---|
Martino_Copertina-indice_Ecology-based_2020.pdf
Solo gestori archivio
Tipologia:
Altro materiale allegato
Licenza:
DRM (Digital rights management) non definiti
Dimensione
607.48 kB
Formato
Adobe PDF
|
607.48 kB | Adobe PDF | Visualizza/Apri |
Martino_Ecology-based_2020.pdf
Solo gestori archivio
Tipologia:
Versione dell'editore
Licenza:
DRM (Digital rights management) non definiti
Dimensione
235.12 kB
Formato
Adobe PDF
|
235.12 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.