CS63 HW6: Learning to play tic-tac-toe

Due by noon Friday, Nov. 16

You may work with a partner on this assignment. Be sure to add both names to the top of every file. Only one of the partners needs to turn in the code. Before you begin, do an update63 to get the starting point files.


Introduction

In this assignment you will use reinforcement learning to create a program that learns to play the game tic-tac-toe by playing games against itself. We will consider X to be the maximizer and O to be the minimizer. Therefore, a win for X will result in an external reward of +1, while a win for O will result in an external reward of -1. Any other state of the game will result in an external reward of 0 (including tied games).

Specifically, you will be using Q-learning, a temporal difference algorithm, to try to learn the optimal playing strategy. Q-learning creates a table that maps states and actions to expected rewards. The goal of temporal difference learning is to make the learner's current prediction for the reward that will be received by taking the action in the current state more closely match the next prediction at the next time step.

You will need to modify the Q-learning algorithm to handle both a maximizing player and a minimizing player. The state of the game will be represented by two features: the current player and the board. When it is X's turn, you will want to choose the action with the highest expected value. When it is O's turn, you will want to choose the action with the lowest expected value. When updating the Q-values, if it is X's turn, then the expected return should be based on the lowest expected value of the next state. Similarly, when it is O's turn, then the expected return should be based on the highest expected value of the next state. This is similar to how minimax works. We assume that the opponent will always make the best possible move.


Implementation details

Optional additions
The following suggestions are NOT required, and should only be attempted once you have successfully completed tic-tac-toe.
Submit

Once you are satisfied with your programs and analysis, hand them in by typing handin63 at the unix prompt. Be sure to update all of the following files: