Self-Play Q-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma

Bertrand, Q.; Duque, J. A.; Calvano, Emilio; Gidel, G.

doi:10.48550/arXiv.2312.08484

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

Bertrand, Q.; Duque, J. A.; Calvano, Emilio; Gidel, G.. (2025). Self-Play Q-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma. In Self-Play -Learners Can Provably Collude in the Iterated Prisoner's Dilemma (pp. 3952- 3975). Doi: 10.48550/arXiv.2312.08484. https://arxiv.org/abs/2312.08484.

Self-Play Q-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma

Bertrand Q.;Duque J. A.;Calvano E.;Gidel G.

2025

Abstract

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del convegno
	
				2025
			
	Citazione
	
				Bertrand, Q.; Duque, J. A.; Calvano, Emilio; Gidel, G.. (2025). Self-Play Q-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma. In Self-Play  -Learners Can Provably Collude in the Iterated Prisoner's Dilemma (pp. 3952- 3975). Doi: 10.48550/arXiv.2312.08484. https://arxiv.org/abs/2312.08484.
			
	Appare nelle tipologie:
	
				04.1 - Contributo in Atti di convegno (Paper in Proceedings)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11385/261039

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

2

ND

ND

IRIS - Institutional Research Information System