CS63 Lab 6: Learning to play tic-tac-toe

Due by 11:59pm Thursday, Nov. 17


In this assignment you will use reinforcement learning to create a program that learns to play the game tic-tac-toe by playing games against itself. We will consider X to be the maximizer and O to be the minimizer. Therefore, a win for X will result in an external reward of +1, while a win for O will result in an external reward of -1. Any other state of the game will result in an external reward of 0 (including tied games). We will assume that either player may go first.

Specifically, you will be using Q-learning, a temporal difference algorithm, to try to learn the optimal playing strategy. Q-learning creates a table that maps states and actions to expected rewards. The goal of temporal difference learning is to make the learner's current prediction for the reward that will be received by taking the action in the current state more closely match the next prediction at the next time step. You will need to modify the Q-learning algorithm to handle both a maximizing player and a minimizing player.

Getting Started
  1. Run update63 to get the starting point files for this lab.
  2. I have provided you with a version of tic-tac-toe. Review the program in the file tictactoe.py. Then execute it and see how it works.
  3. Review the code in the file pickleExample.py. In demonstrates how to easily store and retrieve the state of objects in a python program.
  4. In the file qlearn.py, implement a class for Q-learning. Be sure to read through all of the implementation details below before you begin.

Implementation Details

Optional Extensions

The following suggestions are NOT required, and should only be attempted once you have successfully completed tic-tac-toe.


Once you are satisfied with your programs and analysis, hand them in by typing handin63 at the unix prompt. Be sure to update all of the following files: