Srinivasan, Sriram, Lanctot, Marc, Zambaldi, Vinicius, Perolat, Julien, Tuyls, Karl, Munos, Remi and Bowling, Michael
(2018)
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 31.
pp. 3422-3435.
Text
1810.09026v3.pdf - Submitted version Download (1MB) |
Abstract
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero sum games, without any domain-specific state space reductions.
Item Type: | Article |
---|---|
Additional Information: | NeurIPS 2018 |
Uncontrolled Keywords: | cs.LG, cs.LG, cs.AI, cs.GT, cs.MA, stat.ML |
Depositing User: | Symplectic Admin |
Date Deposited: | 10 Dec 2018 15:23 |
Last Modified: | 19 Jan 2023 01:09 |
Related URLs: | |
URI: | https://livrepository.liverpool.ac.uk/id/eprint/3029650 |