You may work with a partner on this assignment. Be sure to add
both names to the top of every file. Only one of the partners needs
to turn in the code. Before you begin, do an update63 to get
the starting point files.
Introduction
In this assignment you will use reinforcement learning to create a
program that learns to play the game tic-tac-toe by playing games
against itself. We will consider X to be the maximizer and
O to be the minimizer. Therefore, a win for X will
result in an external reward of +1, while a win for O will
result in an external reward of -1. Any other state of the game will
result in an external reward of 0 (including tied games).
Specifically, you will be using Q-learning, a temporal difference
algorithm, to try to learn the optimal playing strategy. Q-learning
creates a table that maps states and actions to expected rewards. The
goal of temporal difference learning is to make the learner's current
prediction for the reward that will be received by taking the action
in the current state more closely match the next prediction at the
next time step.
You will need to modify the Q-learning algorithm to handle both a
maximizing player and a minimizing player. The state of the game will
be represented by two features: the current player and the board.
When it is X's turn, you will want to choose the action with
the highest expected value. When it is O's turn, you will
want to choose the action with the lowest expected value. When
updating the Q-values, if it is X's turn, then the expected
return should be based on the lowest expected value of the next state.
Similarly, when it is O's turn, then the expected return
should be based on the highest expected value of the next state. This
is similar to how minimax works. We assume that the opponent will
always make the best possible move.
Implementation details
-
You should represent the Q-values table as a dictionary keyed on
states, where the state is a combination of the current player and the
current board. The value associated with each state should be another
dictionary keyed on actions. The value associated with each action
should be the current prediction of the expected value of taking that
action in the current state. When a new state is added to the
dictionary, you should initialize all of the action values to zero.
-
The method that executes Q-learning should run for a certain number of
games, rather than steps. Be sure to call the TicTacToe
method resetGame() at the end of every game. This will reset
the board and switch the starting player from the previous game. With
a simplistic exploration strategy, it generally takes at least 200,000
games to start to learn reasonable play. Even then, you may find that
the opening moves still have Q-values that are zero.
-
Because the learning process takes some time, you should save the
state of the Q-learning object at the end of training. You should use
Python's pickle module to do this. The pickle
module implements an algorithm for serializing and de-serializing a
Python object structure. ``Pickling'' is the process whereby a Python
object hierarchy is converted into a byte stream, and ``unpickling''
is the inverse operation, whereby a byte stream is converted back into
an object hierarchy. In other words, you can save the entire state of
an object to a file, and then later restore that object.
-
Once Q-learning is complete, you will need a way to test the learned
strategy. Add a method to the Q-learning class called
playOptimally(board, player). It takes a board and a player
and uses the Q-values table to determine the best move. It should
print out all the Q-values for a given state, so that you can verify
that the learned values make sense. Then it should return the best
move. For X, the best move will be the highest valued action.
For O. the best move will be the lowest valued action. Edit
the file useLearnedStrategy.py, so that it unpickles the
saved Q-learning object and then allows a human player to play against
the Q-learning player choosing optimal moves in a game of tic-tac-toe.
-
In the file analyzeResults.txt, discuss the learned strategy.
Give details on the amount of training you did as well as the best
parameter settings for the learning rate and discount. Give specific
examples of how your Q-learner values states. Does it always take an
immediate win, if one exists? Does it always block an opponent's
possible win in the next move? Does it set itself up for guaranteed
wins, when possible? What are its shortcomings?
Optional additions
The following suggestions are NOT required, and should only be
attempted once you have successfully completed tic-tac-toe.
-
Implement a more sophisticated exploration strategy. One possibililty
is to monitor how many times each state and action pair has been
visited, and ensure that they are visited about equally early on in
the learning process.
-
Apply Q-learning to a different game, such as 4 by 4 Konane.
-
Rather than explicity representing the Q-values table, use a neural
network to learn the expected value for each state and action pair.
Submit
Once you are satisfied with your programs and analysis, hand them in by typing
handin63 at the unix prompt. Be sure to update all of the
following files:
- qlearn.py: Implement Q-learning to handle a game setting
- useLearnedStrategy.py: Implement a program to use the
learned Q-values table to play tic-tac-toe versus a human player
- analyzeResults.txt: Describe the learned strategy