Pan, Ling, Rashid, Tabish, Peng, Bei ORCID: 0000-0003-0152-3180, Huang, Longbo and Whiteson, Shimon
(2021)
Regularized Softmax Deep Multi-Agent <i>Q-</i>Learning.
In: Thirty-fifth Conference on Neural Information Processing Systems, 2021-12-6 - 2021-12-14, Online.
Text
sr_marl.pdf - Author Accepted Manuscript Download (11MB) | Preview |
Abstract
Tackling overestimation in Q-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multiagent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multiagent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent Q-Learning, is general and can be applied to any Q-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
Item Type: | Conference or Workshop Item (Unspecified) |
---|---|
Uncontrolled Keywords: | multi-agent reinforcement learning, value factorization, overestimation |
Divisions: | Faculty of Science and Engineering > School of Electrical Engineering, Electronics and Computer Science |
Depositing User: | Symplectic Admin |
Date Deposited: | 27 Oct 2021 08:30 |
Last Modified: | 05 Jun 2024 00:32 |
Related URLs: | |
URI: | https://livrepository.liverpool.ac.uk/id/eprint/3141742 |