In this assignment you will use reinforcement learning to create a
program that learns to play the game tic-tac-toe by playing games
against itself. We will consider X to be the maximizer and
O to be the minimizer. Therefore, a win for X will
result in an external reward of +1, while a win for O will
result in an external reward of -1. Any other state of the game will
result in an external reward of 0 (including tied games). We will
assume that either player may go first.
Specifically, you will be using Q-learning, a temporal difference
algorithm, to try to learn the optimal playing strategy. Q-learning
creates a table that maps states and actions to expected rewards. The
goal of temporal difference learning is to make the learner's current
prediction for the reward that will be received by taking the action
in the current state more closely match the next prediction at the
next time step. You will need to modify the Q-learning algorithm to
handle both a maximizing player and a minimizing player.
- Run update63 to get the starting point files for this lab.
- I have provided you with a version of tic-tac-toe. Review the
program in the file tictactoe.py. Then execute it and see
how it works.
- Review the code in the file pickleExample.py. In
demonstrates how to easily store and retrieve the state of objects in
a python program.
- In the file qlearn.py, implement a class for Q-learning.
Be sure to read through all of the implementation details below
before you begin.
- Here is the pseudocode for the main loop of the game
version of Q-learning. Rather than creating the complete table before
beginning, this version adds rows to the table as new states are
state = startState
games = 0
while games < maxGames
stateKey = makeKey(state, player)
if stateKey not in table
action = chooseAction(stateKey, table)
nextState = execute(action)
nextKey = makeKey(nextState, opponent(player))
reward = reinforcement(nextState)
if nextKey not in table
updateQvalues(stateKey, action, nextKey, reward)
if game over
state = startState
games += 1
state = nextState
- Here is the pseudocode for updating the Q-values.
updateQValues(stateKey, action, nextKey, reward)
if game over
expected = reward
if player is X
expected = reward + (discount * lowestQvalue(nextKey))
expected = reward + (discount * highestQvalue(nextKey))
change = learningRate * (expected - table[stateKey][action])
table[stateKey][action] += change
You should represent the Q-values table as a dictionary keyed on
states, where the state is a string representing a combination of the
current player and the current board. The value associated with each
state should be another dictionary keyed on actions. There are nine
possible locations to play on a tic-tac-toe board which will be
numbered 0-8. The value associated with each action should be the
current prediction of the expected value of taking that action in the
current state. When a new state is added to the dictionary, you
should initialize all of the legal action values to small random
values in the range [-0.15, 0.15].
In class we discussed the issue of exploration vs exploitation with
respect to how best to choose actions. When it is X's turn,
you will prefer to choose actions with higher expected value. When it
is O's turn, you will prefer to choose actions with lower
expected value. As a first step, half of the time make the greedy
choice and half of the time choose a random action. This will provide
you with a simple way to choose actions intially. Keep in mind,
though, that this simple exploration strategy will lead to sub-opimal
control policies. You'll need a more sophisticaed strategy to more
thoroughly explore the space.
Once your Q-learning system is working, you can replace the simple
action selection mechanism described above with something more
sophisticated. Read section 13.3.5 on page 379 of Tom
Mitchell's chapter on reinforcement
learning. The formula given here looks a lot like the roulette
wheel selection of a genetic algorithm. The only difference is the
inclusion of k which is used as in simulated annealing to
control the likelihood of choosing the best action. Lower values
of k allow for more randomness. We can begin with k
set to a very small value such as 0.05. Then every x number of games,
say 5000, we can increase k by some fixed amount, say 0.05.
You may want to play around with these parameters to see what gives
you the best results. One issue with implementing this like the
roulette wheel is that the Q-learning table for tic-tac-toe will have
both negative and positive values. Here's some pseudocode for
handling the negative values properly:
actionDict = table[stateKey]
actions = list of keys in actionDict
values = list of values in actionDict
if player is X
lowest = min(values)
reverse the sign of each value in values list
lowest = min(values)
if lowest < 0
constant = abs(lowest) + 0.1
add this constant to each value in values list to make all values positive
now perform modified roulette wheel selection using k
When updating the Q-values, if it is X's turn, then the
expected return should be based on the lowest expected value of
the next state. Similarly, when it is O's turn, then the
expected return should be based on the highest expected value
of the next state. This is similar to how minimax works. We assume
that the opponent will always make the best possible move.
The method that executes Q-learning should run for a certain number of
games, rather than steps. Be sure to call the TicTacToe
method resetGame() at the end of every game. This will reset
the board and switch the starting player from the previous game.
Using the exploration strategy discussed about, it should take about
30,000 games to start to learn reasonable play.
Because the learning process takes some time, you should save the
state of the Q-learning object at the end of training. You should use
Python's pickle module to do this. The pickle
module implements an algorithm for serializing and de-serializing a
Python object structure. ``Pickling'' is the process whereby a Python
object hierarchy is converted into a byte stream, and ``unpickling''
is the inverse operation, whereby a byte stream is converted back into
an object hierarchy. In other words, you can save the entire state of
an object to a file, and then later restore that object.
Once Q-learning is complete, you will need a way to test the learned
strategy. Add a method to the Q-learning class called
playOptimally(board, player). It takes a board and a player
and uses the Q-values table to determine the best move. It should
print out all the Q-values for a given state, so that you can verify
that the learned values make sense. Then it should return the best
move. For X, the best move will be the highest valued action.
For O. the best move will be the lowest valued action. Edit
the file useLearnedStrategy.py, so that it unpickles the
saved Q-learning object and then allows a human player to play against
the Q-learning player choosing optimal moves in a game of tic-tac-toe.
In the file analyzeResults.txt discuss the learned strategy.
Give details on the amount of training you did as well as the best
parameter settings for the learning rate and discount. Give specific
examples of how your Q-learner values states. Does it always take an
immediate win, if one exists? Does it always block an opponent's
possible win in the next move? Does it set itself up for guaranteed
wins, when possible? What are its shortcomings?
The following suggestions are NOT required, and should only be
attempted once you have successfully completed tic-tac-toe.
Reduce the size of the Q-table for tic-tac-toe by using the symmetry
of the board.
Rather than explicity representing the Q-values table, use a neural
network to approximate the table.
Apply Q-learning to a different game.
Once you are satisfied with your programs and analysis, hand them in by typing
handin63 at the unix prompt. Be sure to update all of the
- qlearn.py: Implement Q-learning to handle a game setting.
- useLearnedStrategy.py: Implement a program to use the
learned Q-values table to play tic-tac-toe versus a human player.
- analyzeResults.txt: Describe the learned strategy.