CS63: Homework

You may work with a partner on this assignment. Be sure to add both names to the top of every file. Only one of the partners needs to turn in the code. Before you begin, do an update63 to get the starting point files.

Introduction

In this assignment you will use reinforcement learning to create a program that learns to play the game tic-tac-toe by playing games against itself. We will consider X to be the maximizer and O to be the minimizer. Therefore, a win for X will result in an external reward of +1, while a win for O will result in an external reward of -1. Any other state of the game will result in an external reward of 0 (including tied games).

Specifically, you will be using Q-learning, a temporal difference algorithm, to try to learn the optimal playing strategy. Q-learning creates a table that maps states and actions to expected rewards. The goal of temporal difference learning is to make the learner's current prediction for the reward that will be received by taking the action in the current state more closely match the next prediction at the next time step.

You will need to modify the Q-learning algorithm to handle both a maximizing player and a minimizing player. The state of the game will be represented by two features: the current player and the board. When it is X's turn, you will want to choose the action with the highest expected value. When it is O's turn, you will want to choose the action with the lowest expected value. When updating the Q-values, if it is X's turn, then the expected return should be based on the lowest expected value of the next state. Similarly, when it is O's turn, then the expected return should be based on the highest expected value of the next state. This is similar to how minimax works. We assume that the opponent will always make the best possible move.

Implementation details

You should represent the Q-values table as a dictionary keyed on states, where the state is a combination of the current player and the current board. The value associated with each state should be another dictionary keyed on actions. The value associated with each action should be the current prediction of the expected value of taking that action in the current state. When a new state is added to the dictionary, you should initialize all of the action values to zero.

The method that executes Q-learning should run for a certain number of games, rather than steps. Be sure to call the TicTacToe method resetGame() at the end of every game. This will reset the board and switch the starting player from the previous game. With a simplistic exploration strategy, it generally takes at least 200,000 games to start to learn reasonable play. Even then, you may find that the opening moves still have Q-values that are zero.

Because the learning process takes some time, you should save the state of the Q-learning object at the end of training. You should use Python's pickle module to do this. The pickle module implements an algorithm for serializing and de-serializing a Python object structure. ``Pickling'' is the process whereby a Python object hierarchy is converted into a byte stream, and ``unpickling'' is the inverse operation, whereby a byte stream is converted back into an object hierarchy. In other words, you can save the entire state of an object to a file, and then later restore that object.

Once Q-learning is complete, you will need a way to test the learned strategy. Add a method to the Q-learning class called playOptimally(board, player). It takes a board and a player and uses the Q-values table to determine the best move. It should print out all the Q-values for a given state, so that you can verify that the learned values make sense. Then it should return the best move. For X, the best move will be the highest valued action. For O. the best move will be the lowest valued action. Edit the file useLearnedStrategy.py, so that it unpickles the saved Q-learning object and then allows a human player to play against the Q-learning player choosing optimal moves in a game of tic-tac-toe.

In the file analyzeResults.txt, discuss the learned strategy. Give details on the amount of training you did as well as the best parameter settings for the learning rate and discount. Give specific examples of how your Q-learner values states. Does it always take an immediate win, if one exists? Does it always block an opponent's possible win in the next move? Does it set itself up for guaranteed wins, when possible? What are its shortcomings?

Optional additions

The following suggestions are NOT required, and should only be attempted once you have successfully completed tic-tac-toe.

Implement a more sophisticated exploration strategy. One possibililty is to monitor how many times each state and action pair has been visited, and ensure that they are visited about equally early on in the learning process.

Apply Q-learning to a different game, such as 4 by 4 Konane.

Rather than explicity representing the Q-values table, use a neural network to learn the expected value for each state and action pair.

Submit

Once you are satisfied with your programs and analysis, hand them in by typing handin63 at the unix prompt. Be sure to update all of the following files:

qlearn.py: Implement Q-learning to handle a game setting
useLearnedStrategy.py: Implement a program to use the learned Q-values table to play tic-tac-toe versus a human player
analyzeResults.txt: Describe the learned strategy

CS63 HW6: Learning to play tic-tac-toe