Skip to content

Latest commit

 

History

History
12 lines (8 loc) · 2.87 KB

README.md

File metadata and controls

12 lines (8 loc) · 2.87 KB

Connect4

For detailed reading, refer here

Introduction

In this study, we design the game of Connect-4, implement 2 reinforcement learning agents and observe the results as they play against each other with varying parameters. Reinforcement algorithms have an agent that finds itself in various situations, or states. The agent performs an action to go to the next state. A policy is the strategy of choosing which action to take, given a state, in expectation of better outcomes. After the transition, the agent receives a reward (positive or negative) in return. This is how the agent learns. The game of Connect 4 has a special property - there is an exponential increase in the number of possible actions that can be played. To combat this we have used 2 methods of reinforcement learning, namely Monte Carlo Tree Search (MCTS) and Q-Learning with the concept of afterstates. Monte Carlo Tree Search constructs a 'game tree' and performs a number of simulations to predict the next move that should be taken by the player. Q-Learning estimates values for each state of the game and learns this value (Q-value) by convergence after training through many episodes. Then the state with the best Q value is selected.

Game Implementation

A smaller version (5 columns × 6 rows) of Connect 4 game has been implemented using a class 'Board'. The class describes the current state of the board as a 2D grid. The 'state' is represented as a board object in MCTS, and as a 2D integer matrix in Q-learning. On their turn, a 'Player' object (Table 2) evaluates all the empty positions available in the board (actions) using the method getEmptyPositions(), the MCTS or Q-Learning algorithm helps them decide which move to play, and then the play() method is called to execute the move, and the next state of the board is returned. After every move, the checkWinner() method is called to evaluate if any player has won.

Connect4.py asks the user whether to show the output for part (a) or part (c). For part (a), MC200 is the first player and MC40 is the second player. The output is for the unsimplified Connect-4 game (i.e. 5 columns × 6 rows). For part (c), MC25 is the first player and Q- learning algorithm is the second player. The output is for the simplified Connect-4 game (i.e. 5 columns × 4 rows). In the program, the Q-learning algorithm makes a greedy move (i.e. ε = 0) at each step. The estimates found by the Q-learning algorithm are stored as a separate “2019A7PS0063G_SHIKHA.dat” file. The file contains estimates of values of 81,371 states of the board. Each state is represented as a tuple of tuples with the player chip number 1, 2 or 0 (if not filled yet) in the position (i,j). These values are put into the Q-table dictionary of the Q-learning algorithm. For both part (a) and part (c), the program plays one game between Player 1 and Player 2.