Skip to content

Latest commit

 

History

History
90 lines (48 loc) · 5.07 KB

1902.00506.md

File metadata and controls

90 lines (48 loc) · 5.07 KB

The Hanabi Challenge: A New Frontier for AI Research, Bard et al., 2019

Paper, Repo, Tags: #reinforcement-learning

See references 4,5 to play online.

We propose the game of Hanabi as a new challenge domain with novel problems that arise from its combination of purely cooperative gameplay with two to five players and imperfect information.

We argue that Hanabi elevates reasoning about the beliefs and intentions of other agents to the foreground.

We introduce the open-source Hanabi Learning Environment, propose an experimental framework for the research community to evaluate algorithmic advances.

Introduction

Humans routinely work in multi-agent settings by using theory of mind, which can be described as the human ability to imagine the world frmo another person's point of view.

Hanabi is a cooperative game of imperfect information for 2-5 players, best described as a type of team solitaire.

The players can't see their own cards, each of which has a color and rank. Players must coordinate to efficiently reveal information to their teammates, but the communication can only be done through grounded hint actions that point out all of a player's cards of a chosen rank or color.

Importantly, performing a hint action consumes the limited resource of information tokens, making it impossible to fully resolve each player’s uncertainty about the cards they hold based on this grounded information alone. Successful play involves communicating extra information implicitly through the choice of actions themselves, which are observable by all players.

Because Hanabi is neither (exclusively) two-player nor zero-sum, the value of an agent's policy depends critically on the policies used by its teammates.

How the other players will respond to a chosen signal will depend upon what other situations use the same signal.

Humans appear to be approaching Hanabi differently than most multiagent reinforcement learning approaches. Even beginners with no experience will start signalling playable cards, reasoning that their teammates’ perspective precludes them from knowing this on their own.

The combination of cooperation, imperfect information, and limited communication make Hanabi an ideal challenge domain for learning in both the self-play and ad-hoc team settings.

2. Hanabi: the game

(see paper)

3. Hanabi: the challenge

How humans play the game suggests 2 different challenges.

One is to learn a fixed policy for all players through self-play. The learning process is in control of all players, and the objective is to maximise the expected utility of the resulting joint policy. This represents the case where humans can pre-coordinate their play.

The second challenge is to learn to play with a set of unknown partners, with only a few games of interaction.

We make some recommendations on how research should be carried out under each challenge.

3.1. Open source environment

We've released an open source Hanabi RL environment called the Hanabi Learning Environment. The code provides an interface similar to OpenAI Gym. (read more in section 3.1).

3.2. Challenge one: self-play learning

This challenge is focused on finding a joint policy that achieves a high expected score entirely through self-play learning. The environment of Hanabi is very lightweight, but developing sample efficient algorithms is also important. We propose 2 regimes for the Hanabi benchmark.

  • Sample limited regime (SL): 100M steps limit
  • Unlimited regime (UL): no restrictions on the amount of time

3.2. Challenge two: ad-hoc teams

For this challenge, a policy which achieves a high score in self-play is of little use if it must be followed exactly by teammates.

4. Hanabi: state of the art

4.1. Learning agents

  • Actor-critic-hanabi-agent
  • Rainbow-agent
  • BAD-Agent

4.2. Rule-based approaches

4.3. Experimental results: self-play

There's a large performance gap between what's possible (as demonstrated by the hand-coded agents) and what state-of-the-art learning algorithms achieves.

We believe that BAD's tracking of all agents' beliefs leads to a marked improvement in performance for an agent learning to play.

Both BAD and ACHA agent require vast amounts of training data to establish good performance.

Conclusion

We evaluate state-of-the-art reinforcement learning algorithms using deep neural networks and demonstrate that they are largely insufficient to even surpass current hand-coded bots when evaluated in a self-play setting.

Furthermore, we show that in the ad-hoc team setting, where agents must play with unknown teammates, such techniques fail to collaborate at all. Theory of mind appears to play an important role in how humans learn and play Hanabi.

To promote effective and consistent comparison between techniques, we provide a new open source code framework for Hanabi and propose evaluation methodology for practitioners.