CS63: Lab 6

Introduction

In this assignment you will use reinforcement learning to create a program that learns to play the game tic-tac-toe by playing games against itself. We will consider X to be the maximizer and O to be the minimizer. Therefore, a win for X will result in an external reward of +1, while a win for O will result in an external reward of -1. Any other state of the game will result in an external reward of 0 (including tied games). We will assume that either player may go first.

Specifically, you will be using Q-learning, a temporal difference algorithm, to try to learn the optimal playing strategy. Q-learning creates a table that maps states and actions to expected rewards. The goal of temporal difference learning is to make the learner's current prediction for the reward that will be received by taking the action in the current state more closely match the next prediction at the next time step. You will need to modify the Q-learning algorithm to handle both a maximizing player and a minimizing player.

Getting Started

Run update63 to get the starting point files for this lab.
I have provided you with a version of tic-tac-toe. Review the program in the file tictactoe.py. Then execute it and see how it works.
Review the code in the file pickleExample.py. In demonstrates how to easily store and retrieve the state of objects in a python program.
In the file qlearn.py, implement a class for Q-learning. Be sure to read through all of the implementation details below before you begin.

Implementation Details

Here is the pseudocode for the main loop of the game version of Q-learning. Rather than creating the complete table before beginning, this version adds rows to the table as new states are encountered.

gameLearning(startState, maxGames)
  state = startState
  games = 0
  while games < maxGames  
    stateKey = makeKey(state, player)
    if stateKey not in table
       addKey(stateKey)
    action = chooseAction(stateKey, table)
    nextState = execute(action)
    nextKey = makeKey(nextState, opponent(player))
    reward = reinforcement(nextState)
    if nextKey not in table
       addKey(nextKey)
    updateQvalues(stateKey, action, nextKey, reward)
    if game over
       reset game
       state = startState
       games += 1
    else
       state = nextState
       switchPlayers()

Here is the pseudocode for updating the Q-values.

updateQValues(stateKey, action, nextKey, reward)
  if game over
     expected = reward
  else
     if player is X
        expected = reward + (discount * lowestQvalue(nextKey))
     else
        expected = reward + (discount * highestQvalue(nextKey))
  change = learningRate * (expected - table[stateKey][action])
  table[stateKey][action] += change

You should represent the Q-values table as a dictionary keyed on states, where the state is a string representing a combination of the current player and the current board. The value associated with each state should be another dictionary keyed on actions. There are nine possible locations to play on a tic-tac-toe board which will be numbered 0-8. The value associated with each action should be the current prediction of the expected value of taking that action in the current state. When a new state is added to the dictionary, you should initialize all of the legal action values to small random values in the range [-0.15, 0.15].

In class we discussed the issue of exploration vs exploitation with respect to how best to choose actions. When it is X's turn, you will prefer to choose actions with higher expected value. When it is O's turn, you will prefer to choose actions with lower expected value. As a first step, half of the time make the greedy choice and half of the time choose a random action. This will provide you with a simple way to choose actions intially. Keep in mind, though, that this simple exploration strategy will lead to sub-opimal control policies. You'll need a more sophisticaed strategy to more thoroughly explore the space.

Once your Q-learning system is working, you can replace the simple action selection mechanism described above with something more sophisticated. Read section 13.3.5 on page 379 of Tom Mitchell's chapter on reinforcement learning. The formula given here looks a lot like the roulette wheel selection of a genetic algorithm. The only difference is the inclusion of k which is used as in simulated annealing to control the likelihood of choosing the best action. Lower values of k allow for more randomness. We can begin with k set to a very small value such as 0.05. Then every x number of games, say 5000, we can increase k by some fixed amount, say 0.05. You may want to play around with these parameters to see what gives you the best results. One issue with implementing this like the roulette wheel is that the Q-learning table for tic-tac-toe will have both negative and positive values. Here's some pseudocode for handling the negative values properly:
```
chooseAction(stateKey)
  actionDict = table[stateKey]
  actions = list of keys in actionDict
  values = list of values in actionDict
  if player is X
     lowest = min(values)
  else
     reverse the sign of each value in values list
     lowest = min(values)
  if lowest < 0
     constant = abs(lowest) + 0.1
     add this constant to each value in values list to make all values positive
  now perform modified roulette wheel selection using k
```

When updating the Q-values, if it is X's turn, then the expected return should be based on the lowest expected value of the next state. Similarly, when it is O's turn, then the expected return should be based on the highest expected value of the next state. This is similar to how minimax works. We assume that the opponent will always make the best possible move.

The method that executes Q-learning should run for a certain number of games, rather than steps. Be sure to call the TicTacToe method resetGame() at the end of every game. This will reset the board and switch the starting player from the previous game. Using the exploration strategy discussed about, it should take about 30,000 games to start to learn reasonable play.

Because the learning process takes some time, you should save the state of the Q-learning object at the end of training. You should use Python's pickle module to do this. The pickle module implements an algorithm for serializing and de-serializing a Python object structure. ``Pickling'' is the process whereby a Python object hierarchy is converted into a byte stream, and ``unpickling'' is the inverse operation, whereby a byte stream is converted back into an object hierarchy. In other words, you can save the entire state of an object to a file, and then later restore that object.

Once Q-learning is complete, you will need a way to test the learned strategy. Add a method to the Q-learning class called playOptimally(board, player). It takes a board and a player and uses the Q-values table to determine the best move. It should print out all the Q-values for a given state, so that you can verify that the learned values make sense. Then it should return the best move. For X, the best move will be the highest valued action. For O. the best move will be the lowest valued action. Edit the file useLearnedStrategy.py, so that it unpickles the saved Q-learning object and then allows a human player to play against the Q-learning player choosing optimal moves in a game of tic-tac-toe.

In the file analyzeResults.txt discuss the learned strategy. Give details on the amount of training you did as well as the best parameter settings for the learning rate and discount. Give specific examples of how your Q-learner values states. Does it always take an immediate win, if one exists? Does it always block an opponent's possible win in the next move? Does it set itself up for guaranteed wins, when possible? What are its shortcomings?

Optional Extensions

The following suggestions are NOT required, and should only be attempted once you have successfully completed tic-tac-toe.

Reduce the size of the Q-table for tic-tac-toe by using the symmetry of the board.

Rather than explicity representing the Q-values table, use a neural network to approximate the table.

Apply Q-learning to a different game.

Submit

Once you are satisfied with your programs and analysis, hand them in by typing handin63 at the unix prompt. Be sure to update all of the following files:

qlearn.py: Implement Q-learning to handle a game setting.
useLearnedStrategy.py: Implement a program to use the learned Q-values table to play tic-tac-toe versus a human player.
analyzeResults.txt: Describe the learned strategy.

CS63 Lab 6: Learning to play tic-tac-toe