From Bag-of-Words to Transformers: A Comparative Study for Text Classification in Healthcare Discussions in Social Media

De Santis, Enrico; Martino, Alessio; Ronci, Francesca; Rizzi, Antonello

doi:10.1109/tetci.2024.3423444

One notable paradigm shift in Natural Language Processing has been the introduction of Transformers, revolutionizing language modeling as Convolutional Neural Networks did for Computer Vision. The power of Transformers, along with many other innovative features, also lies in the integration of word embedding techniques, traditionally used to represent words in a text and to build classification systems directly. This study delves into the comparison of text representation techniques for classifying users who generate medical topic posts on Facebook discussion groups. Short and noisy social media texts in Italian pose challenges for user categorization. The study employs two datasets, one for word embedding model estimation and another comprising discussions from users. The main objective is to achieve optimal user categorization through different pre-processing and embedding techniques, aiming at high generalization performance despite class imbalance. The paper has a dual purpose, i.e., to build an effective classifier, ensuring accurate information dissemination in medical discussions and combating fake news, and to explore also the representational capabilities of various LLMs, especially concerning BERT, Mistral and GPT-4. The latter is investigated using the in-context learning approach. Finally, data visualization tools are used to evaluate the semantic embeddings with respect to the achieved performance. This investigation, focusing on classification performance, compares the classic BERT and several hybrid versions (i.e., employing different training strategies and approximate Support Vector Machines in the classification layer) against LLMs and several Bag-of-Words based embedding (notably, one of the earliest approaches in text classification). This research offers insights into the latest developments in language modeling, advancing in the field of text representation and its practical application for user classification within medical discussions.

De Santis, Enrico; Martino, Alessio; Ronci, Francesca; Rizzi, Antonello. (2025). From Bag-of-Words to Transformers: A Comparative Study for Text Classification in Healthcare Discussions in Social Media. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, (ISSN: 2471-285X), 9:1, 1063-1077. Doi: 10.1109/tetci.2024.3423444.

From Bag-of-Words to Transformers: A Comparative Study for Text Classification in Healthcare Discussions in Social Media

De Santis, Enrico;Martino, Alessio;Ronci, Francesca;Rizzi, Antonello

2025

Abstract

One notable paradigm shift in Natural Language Processing has been the introduction of Transformers, revolutionizing language modeling as Convolutional Neural Networks did for Computer Vision. The power of Transformers, along with many other innovative features, also lies in the integration of word embedding techniques, traditionally used to represent words in a text and to build classification systems directly. This study delves into the comparison of text representation techniques for classifying users who generate medical topic posts on Facebook discussion groups. Short and noisy social media texts in Italian pose challenges for user categorization. The study employs two datasets, one for word embedding model estimation and another comprising discussions from users. The main objective is to achieve optimal user categorization through different pre-processing and embedding techniques, aiming at high generalization performance despite class imbalance. The paper has a dual purpose, i.e., to build an effective classifier, ensuring accurate information dissemination in medical discussions and combating fake news, and to explore also the representational capabilities of various LLMs, especially concerning BERT, Mistral and GPT-4. The latter is investigated using the in-context learning approach. Finally, data visualization tools are used to evaluate the semantic embeddings with respect to the achieved performance. This investigation, focusing on classification performance, compares the classic BERT and several hybrid versions (i.e., employing different training strategies and approximate Support Vector Machines in the classification layer) against LLMs and several Bag-of-Words based embedding (notably, one of the earliest approaches in text classification). This research offers insights into the latest developments in language modeling, advancing in the field of text representation and its practical application for user classification within medical discussions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Parole chiave
	
				Embedding techniques, healthcare, natural language processing, social network analysis, text categorization, text mining, transformers, mistral, GPT-4, large language models
			
	Citazione
	
				De Santis, Enrico; Martino, Alessio; Ronci, Francesca; Rizzi, Antonello. (2025). From Bag-of-Words to Transformers: A Comparative Study for Text Classification in Healthcare Discussions in Social Media. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, (ISSN: 2471-285X), 9:1, 1063-1077. Doi: 10.1109/tetci.2024.3423444.
			
	Appare nelle tipologie:
	
				01.1 - Articolo su rivista (Article)

File in questo prodotto:

File	Dimensione	Formato
From_Bag-of-Words_to_Transformers_A_Comparative_Study_for_Text_Classification_in_Healthcare_Discussions_in_Social_Media.pdf Open Access Tipologia: Documento in Post-print Licenza: Creative commons Dimensione 9.09 MB Formato Adobe PDF Visualizza/Apri	9.09 MB	Adobe PDF	Visualizza/Apri
From_Bag-of-Words_to_Transformers_A_Comparative_Study_for_Text_Classification_in_Healthcare_Discussions_in_Social_Media.pdf Open Access Tipologia: Versione dell'editore Licenza: Creative commons Dimensione 9.08 MB Formato Adobe PDF Visualizza/Apri	9.08 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/239878

Citazioni

13

12

16

IRIS - Institutional Research Information System