Background: Dense cross-layer connectivity can shorten gradient paths and promote feature reuse, potentially improving optimization under fixed training budgets. Objective: We test whether concatenation-based dense historical connectivity improves decoder-only autoregressive language modeling under controlled comparison protocols. Methods: We compare a standard Transformer decoder and a dense decoder on Penn Treebank and WikiText-2 under two fairness regimes: (i) a same training recipe setting with a fixed baseline and a bounded dense architectural search, and (ii) a same parameter budget setting where the dense model is resized to not exceed the baseline parameter count. Results: Dense connectivity does not consistently reduce test perplexity; on WikiText-2, the baseline remains better in both regimes, while gains on Penn Treebank are small and regime-dependent. Ablations within the dense family show that depth and feed-forward capacity are the most reliable drivers of perplexity improvements. Conclusions: Probes and attention diagnostics do not reveal a clear advantage for dense connectivity in our limited probe set, while Zipf–RQA analysis of long-form generations reveals systematic structural differences between baseline and dense outputs. Specifically, Zipf–RQA is used here as a descriptive structural probe rather than a performance metric.
De Santis, Enrico; Martino, Alessio; Rizzi, Antonello. (2026). Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer. APPLIED SCIENCES, (ISSN: 2076-3417), 16:6, 2721-2721. Doi: 10.3390/app16062721.
Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer
Martino, Alessio
;
2026
Abstract
Background: Dense cross-layer connectivity can shorten gradient paths and promote feature reuse, potentially improving optimization under fixed training budgets. Objective: We test whether concatenation-based dense historical connectivity improves decoder-only autoregressive language modeling under controlled comparison protocols. Methods: We compare a standard Transformer decoder and a dense decoder on Penn Treebank and WikiText-2 under two fairness regimes: (i) a same training recipe setting with a fixed baseline and a bounded dense architectural search, and (ii) a same parameter budget setting where the dense model is resized to not exceed the baseline parameter count. Results: Dense connectivity does not consistently reduce test perplexity; on WikiText-2, the baseline remains better in both regimes, while gains on Penn Treebank are small and regime-dependent. Ablations within the dense family show that depth and feed-forward capacity are the most reliable drivers of perplexity improvements. Conclusions: Probes and attention diagnostics do not reveal a clear advantage for dense connectivity in our limited probe set, while Zipf–RQA analysis of long-form generations reveals systematic structural differences between baseline and dense outputs. Specifically, Zipf–RQA is used here as a descriptive structural probe rather than a performance metric.| File | Dimensione | Formato | |
|---|---|---|---|
|
applsci-16-02721.pdf
Open Access
Tipologia:
Versione dell'editore
Licenza:
Creative commons
Dimensione
865.42 kB
Formato
Adobe PDF
|
865.42 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



