Background: Dense cross-layer connectivity can shorten gradient paths and promote feature reuse, potentially improving optimization under fixed training budgets. Objective: We test whether concatenation-based dense historical connectivity improves decoder-only autoregressive language modeling under controlled comparison protocols. Methods: We compare a standard Transformer decoder and a dense decoder on Penn Treebank and WikiText-2 under two fairness regimes: (i) a same training recipe setting with a fixed baseline and a bounded dense architectural search, and (ii) a same parameter budget setting where the dense model is resized to not exceed the baseline parameter count. Results: Dense connectivity does not consistently reduce test perplexity; on WikiText-2, the baseline remains better in both regimes, while gains on Penn Treebank are small and regime-dependent. Ablations within the dense family show that depth and feed-forward capacity are the most reliable drivers of perplexity improvements. Conclusions: Probes and attention diagnostics do not reveal a clear advantage for dense connectivity in our limited probe set, while Zipf–RQA analysis of long-form generations reveals systematic structural differences between baseline and dense outputs. Specifically, Zipf–RQA is used here as a descriptive structural probe rather than a performance metric.

De Santis, Enrico; Martino, Alessio; Rizzi, Antonello. (2026). Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer. APPLIED SCIENCES, (ISSN: 2076-3417), 16:6, 2721-2721. Doi: 10.3390/app16062721.

Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer

Martino, Alessio
;
2026

Abstract

Background: Dense cross-layer connectivity can shorten gradient paths and promote feature reuse, potentially improving optimization under fixed training budgets. Objective: We test whether concatenation-based dense historical connectivity improves decoder-only autoregressive language modeling under controlled comparison protocols. Methods: We compare a standard Transformer decoder and a dense decoder on Penn Treebank and WikiText-2 under two fairness regimes: (i) a same training recipe setting with a fixed baseline and a bounded dense architectural search, and (ii) a same parameter budget setting where the dense model is resized to not exceed the baseline parameter count. Results: Dense connectivity does not consistently reduce test perplexity; on WikiText-2, the baseline remains better in both regimes, while gains on Penn Treebank are small and regime-dependent. Ablations within the dense family show that depth and feed-forward capacity are the most reliable drivers of perplexity improvements. Conclusions: Probes and attention diagnostics do not reveal a clear advantage for dense connectivity in our limited probe set, while Zipf–RQA analysis of long-form generations reveals systematic structural differences between baseline and dense outputs. Specifically, Zipf–RQA is used here as a descriptive structural probe rather than a performance metric.
2026
Transformer; dense connectivity; decoder-only language modeling; perplexity; causal masking; parameter budget; ablation study; probing tasks; Zipf–RQA
De Santis, Enrico; Martino, Alessio; Rizzi, Antonello. (2026). Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer. APPLIED SCIENCES, (ISSN: 2076-3417), 16:6, 2721-2721. Doi: 10.3390/app16062721.
File in questo prodotto:
File Dimensione Formato  
applsci-16-02721.pdf

Open Access

Tipologia: Versione dell'editore
Licenza: Creative commons
Dimensione 865.42 kB
Formato Adobe PDF
865.42 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/259898
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact