Lab 7: Reinforcement Learning
Due March 28 by midnight

The Pac-Man interface and the inspiration for this lab were developed by John DeNero and Dan Klein at UC Berkeley.
capsuleClassic layout

Starting point code

You are encouraged to work with a partner on this lab. Please read the following instructions carefully.

  1. First you need to run setup63 to create a git repository for the lab.

    If you want to work alone do:

    setup63-Lisa labs/07 none
    If you want to work with a partner, then one of you needs to run the following command while the other one waits until it finishes:
    setup63-Lisa labs/07 partnerUsername
    Once the script finishes, the other partner should run it on their account.

  2. For the next step, only one partner should copy over the starting point code:
    cd ~/cs63/labs/07
    cp -r /home/meeden/public/cs63/labs/07/* .

  3. Whether you are working alone or with a partner, you should now add, commit, and push these files as shown below:
    git add *
    git commit -m "lab7 start"
    git push

  4. If you are working with a partner, your partner can now pull the changes in:
    cd ~/cs63/labs/07
    git pull
You are now ready to start the lab.


In this lab, you will be exploring sequential decision problems that can be modeled as Markov Decision Processes (or MDPs). You will begin by experimenting with some simple grid worlds and then once you have gained some intuition about how these work, you will focus on Pac-Man.

You will start by testing Value Iteration agents that have access to the MDP model. Then you will look at Q-learning agents that instead must explore their environment and try to discover the underlying model.

The MDP and Pac-Man interfaces that you have copied to your directory contain many files. Most of these files you can ignore. The only files you will modify are: analysis.txt,, and


An MDP describes an environment with observable states and stochastic actions. To experience this for yourself, run Gridworld in manual control mode, and use the arrow keys to move the agent:

python -m

You will see a two-exit environment. The blue dot is the agent. Note that each action you try only has its intended outcome 80% of the time, while 10% of the time it will go off in either of the orthogonal directions.

You can control many aspects of the simulation. A full list of options is available by running:

 python -h

Rather than manually controlling the agent as you did above, the default agent moves randomly until it stumbles upon an exit. Try the default agent in another grid world:

 python -g MazeGrid

Note: The Gridworld MDP is such that you first must enter a pre-terminal state (the double boxes shown in the GUI) and then take the special 'exit' action before the episode actually ends (in the true terminal state called TERMINAL_STATE, which is not shown in the GUI).

Look at the console output that accompanies the graphical output (or use -t for all text). You will be told about each transition the agent experiences (to turn this off, use -q). 

Positions are represented by (x,y) Cartesian coordinates and any arrays are indexed by [x][y], with 'north' being the direction of increasing y, etc.  By default, most transitions will receive a reward of zero, though you can change this with the living reward option (-r). 

An MDP can be solved using value iteration. Value iteration is an offline planner, not a reinforcement learner. It is trained by doing repeated iterations until an equilibrium is reached. A ValueIterationAgent has been defined in the file

The ValueIterationAgent takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns.

The following command loads the ValueIterationAgent, which will compute a policy and execute it 10 times. After value iteration is complete, press a key to start the simulation. You should find that the value of the start state (V(start)) and the empirical resulting average reward are quite close.

python -a value -i 100 -k 10

Explore discount, noise, and living reward

The expected value of a state depends on a number of factors including how future rewards are discounted, how noisy the actions are, and how much reward is received on non-exit states. Find appropriate settings for these variables in the following scenarios and record your settings in the analysis.txt file.

  1. On BridgeGrid with the default discount of 0.9 and the default noise of 0.2, the optimal policy does not cross the bridge. Change only ONE of the discount and noise parameters so that the optimal policy causes the agent to attempt to cross the bridge.
    python -a value -i 100 -g BridgeGrid --discount 0.9 --noise 0.2

  2. Consider the DiscountGrid layout, shown below. This grid has two terminal states with positive payoff (shown in green), a close exit with payoff +1 and a distant exit with payoff +10. The bottom row of the grid consists of terminal states with negative payoff (shown in red); each state in this "cliff" region has payoff -10. The starting state is the yellow square. We distinguish between two types of paths: (1) paths that "risk the cliff" and travel near the bottom row of the grid; these paths are shorter but risk earning a large negative payoff, and are represented by the red arrow in the figure below. (2) paths that "avoid the cliff" and travel along the top edge of the grid. These paths are longer but are less likely to incur huge negative payoffs. These paths are represented by the green arrow in the figure below.

    Give an assignment of parameter values for discount, noise, and livingReward which produce the following optimal policy types or state that the policy is impossible or say that it is not possible. The default corresponds to:

    python -a value -i 100 -g DiscountGrid --discount 0.9 --noise 0.2 --livingReward 0.0
    1. Prefer the close exit (+1), risking the cliff (-10)
    2. Prefer the close exit (+1), but avoiding the cliff (-10)
    3. Prefer the distant exit (+10), risking the cliff (-10)
    4. Prefer the distant exit (+10), avoiding the cliff (-10)
    5. Avoid both exits (also avoiding the cliff)


You will now write a Q-learning agent that learns a policy for how to behave in the grid worlds you just explored.

This policy will be created from trial and error interactions with the environment through its update method. A stub of a Q-learner is specified in the file in the class QLearningAgent.

This Q-learning agent will maintain a table of its estimated value for all possible state-action combinations that it encounters in the environment. The agent knows the possible actions it can take, but does not know the possible states it will encounter. Any time the agent visits a new state for the first time, you'll need to add this state to the table with estimated values of 0 for all possible actions.

  1. Start by implementing following methods:

    Note: For computeActionFromQValues, you should break ties randomly for better behavior. In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, then an unseen action may be optimal.

    Important: Make sure that you only access Q values by calling getQValue in your methods. This abstraction will be useful later when we will override getQValue in approximate Q-learning to use features of state-action pairs rather than state-action pairs directly.

  2. With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard:
    python -a q -k 5 -m
    The -k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake."

  3. Complete your Q-learning agent by implementing epsilon-greedy action selection in getAction, meaning it chooses random actions epsilon of the time, and follows its current best q-values otherwise.

  4. Test your agent by doing:
    python -a q -k 100 
    Your final q-values should resemble those of the value iteration agent, especially along well-traveled paths. However, your average returns will be lower because of the random actions and the initial learning phase.

  5. With no additional code, you should now be able to run Q-learning on a simulated crawler robot:

    If this doesn't work, you've probably written some code too specific to the GridWorld problem and you should make it more general to all MDPs.

Once Q-learning is working on grid worlds and the crawler robot you are ready to move on to Pac-Man.


Familiarize yourself with the Pac-Man interface. Try playing a game of Pac-Man by typing the following at the command line:

In this case you are the agent that is controlling Pac-Man. Use the arrow keys to move Pac-Man around in his environment. The smaller white dots represent food. The goal of the game is to eat all of the food while avoiding the ghosts that are trying to thwart Pac-Man. If Pac-Man runs into a ghost, he dies. However, the larger white dots are special capsules that when eaten by Pac-Man make the ghosts safe to contact. The ghosts will only be safe for a short time, while they are colored white, but they eventually become dangerous again.


In your lab7 directory you'll see that you have a sub-directory called layouts. When you invoke the pacman game, by default it chooses the mediumClassic.lay layout. However you can use command-line arguments to choose other layouts. For example:
python -l smallGrid
Try other other layouts in the directory. Notice that these layouts are simply text files that follow a convention for designating the locations of walls, food, capsules, ghosts and pacman.

Command-line arguments

Note that supports a number of options that can each be expressed in a long way (e.g., --layout) or a short way (e.g., -l). You can see the list of all options and their default values via:
python -h


The Pac-Man agent can execute the following five actions:


What the Pac-Man agent can perceive is based on the methods of the GameState class which is defined in the file Open up this file and look through the options.

It is important to recognize that the game has a number of different agents (Pac-Man and the ghosts). Each agent in the game has a unique index; Pac-Man is always index 0, with ghosts starting at index 1.

Pac-Man can perceive:

In addition, Pac-Man can also determine given the action he chooses what the next state of the environment will be, by using the method generatePacmanSuccessor.

Pac-Man Agents

The file contains examples of three basic agents that control PacMan:

  1. RandomAgent: Chooses actions randomly (excluding 'Stop')
  2. ReflexAgent: Chooses actions that lead to immediate food, otherwise chooses randomly
  3. StateAgent: Chooses actions that lead to immediate food; when no food is visible continues to move in the same direction as the previous step

As you can see in this file, at a minimum, a Pac-Man agent must have a getAction method that takes in a state and returns a valid action.

To test these agents do:

python -l openSearch -p RandomAgent
python -l openSearch -p ReflexAgent
python -l openSearch -p StateAgent

You will be creating a Pac-Man agent that chooses actions based on Q-learning's assessment of which action is likely to maximize long-term reward.

Q-learning on Pac-Man

Time to use Q-learning to play some Pac-Man! Pac-Man will play games in two phases. In the first phase, training, Pac-Man will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate q-values even for tiny grids, Pac-Man's training games run in quiet mode by default, with no GUI (or console) display. Once Pac-Man's training is complete, he will enter testing mode. When testing, Pac-Man's self.epsilon and self.alpha will be set to 0.0, effectively stopping Q-learning and disabling exploration, in order to allow Pac-Man to exploit his learned policy. Test games are shown in the GUI by default. Without any code changes you should be able to run Q-learning Pac-Man for very tiny grids as follows:

 python -p PacmanQAgent -x 2000 -n 2010 -l smallGrid 
Note that PacmanQAgent is already defined for you in terms of the QLearningAgent you've already written. PacmanQAgent is only different in that it has default learning parameters that are more effective for the Pac-Man problem (epsilon=0.05, alpha=0.2, gamma=0.8).

Your agent should win at least 80% of the games in the last 10 runs.

Hint: If your QLearningAgent works for and but does not seem to be learning a good policy for Pac-Man on smallGrid, it may be because your getAction and/or getPolicy methods do not in some cases properly consider unseen actions. In particular, because unseen actions have by definition a Q-value of zero, if all of the actions that have been seen have negative Q-values, an unseen action may be optimal.

Note: If you want to experiment with learning parameters, you can use the option -a, for example -a epsilon=0.1,alpha=0.3,gamma=0.7. These values will then be accessible as self.epsilon, self.gamma and self.alpha inside the agent.

While a total of 2010 games will be played, the first 2000 games will not be displayed because of the option -x 2000, which designates the first 2000 games for training (no output). Thus, you will only see Pac-Man play the last 10 of these games. The number of training games is also passed to your agent as the option numTraining.

During training, you will see output every 100 games with statistics about how Pac-Man is faring. Epsilon is positive during training, so Pac-Man will play poorly even after having learned a good policy: this is because he occasionally makes a random exploratory move into a ghost.

As a benchmark, it should take about 1,000 games before Pac-Man's rewards for a 100 episode segment becomes positive, reflecting that he's started winning more than losing. By the end of training, it should remain positive and be fairly high (between 100 and 350).

If you want to watch 10 training games to see what's going on, use the command:

 python -p PacmanQAgent -n 10 -l smallGrid -a numTraining=10

Once Pac-Man is done training, he should win very reliably in test games, since now he is exploiting his learned policy.

However, you'll find that training the same agent on the seemingly simple mediumGrid may not work well. In our implementation, Pac-Man's average training rewards remain negative throughout training. At test time, he plays badly, probably losing all of his test games. Training will also take a long time, despite its ineffectiveness.

 python -p PacmanQAgent -x 2000 -n 2010 -l mediumGrid 

Pac-Man fails to win on larger layouts because each board configuration is a separate state with separate q-values. He has no way to generalize that running into a ghost is bad for all positions. Obviously, this approach will not scale. In the next section we will see how to use an approximate version of Q-learning to improve performance on larger layouts.

Approximate Q-learning and State Abstraction

Next you will implement an approximate q-learning agent that learns weights for features of states, where many states might share the same features.

Approximate q-learning assumes the existence of a feature function f(s,a) over state and action pairs, which yields a vector f1(s,a) .. fi(s,a) .. fn(s,a) of feature values. We provide feature functions for you in The constructor of the ApproximateQAgent class in the file includes a variable self.featureExtractor that is set from the command-line arguments. To access the features for a given state and action do:

This will return a feature vector which is an instance of a util.Counter object. You can see the implementation of the Counter class in the file. A Counter is like a dictionary that maps keys to values, but with some additional functionality. Any omitted key automatically has a value of zero. Also you can multiply, add, and subtract two Counter objects. A feature vector will map feature names such as 'closest-food' to a number representing the distance to the closest food pellet from Pac-Man's current position.

Implementing Approximate Q-learning

In the file, complete the implementation of the ApproximateQAgent class as follows:

  1. In the constructor, define self.weights as a Counter.

  2. In getQValue, the approximate version of the q-value takes the following form:

    where each weight wi is associated with a particular feature fi(s,a). Implement this as the dot product of the weights and the features.

  3. In update you will change the weights similarly to how you updated q-values:

    Note that the correction term is the same as in normal Q-Learning.

  4. In final, you should add the following lines to allow you to view the weight values at the end of the training:
        if self.episodesSoFar == self.numTraining:
          print "Weights at the end of training"
          print self.weights

Testing Approximate Q-learning

By default, ApproximateQAgent uses the IdentityExtractor, which assigns a single feature to every (state,action) pair. With this feature extractor, your approximate q-learning agent should work identically to PacmanQAgent. You can test this with the following command:

 python -p ApproximateQAgent -x 2000 -n 2010 -l smallGrid 

Important: ApproximateQAgent is a subclass of QLearningAgent, and it therefore shares several methods like getAction. Make sure that your methods in QLearningAgent call getQValue instead of accessing q-values directly, so that when you override getQValue in your approximate agent, the new approximate q-values are used to compute actions.

Once you're confident that your approximate learner works correctly with the identity features, you can try it with the custom feature extractor called SimpleExtractor, which depends on just four features:

Note: In order to keep weights in a reasonable range, all of these features are divided by 10.0.

Your approximate q-learning agent with the SimpleExtractor should learn to win with ease on the mediumGrid:

 python -p ApproximateQAgent -a extractor=SimpleExtractor -x 50 -n 60 -l mediumGrid 

Even much larger layouts, such as the mediumClassic should be no problem for your ApproximateQAgent (warning: this may take a few minutes to train).

 python -p ApproximateQAgent -a extractor=SimpleExtractor -x 50 -n 60 -l mediumClassic

If you have no errors, your approximate q-learning agent should win almost every time with these simple features, even with only 50 training games.

Improving the feature extractor

When watching your trained approximate q-learning agent using the SimpleExtractor, you probably noticed that he actively avoids ghosts, even when they are safe to eat. This is because the given feature extractor doesn't distinguish between safe and dangerous ghosts, so ghosts always look dangerous. If you try your approximate q-learning agent on some of the hardest layouts like contestClassic and trickyClassic, the fact that Pac-Man is not seeking out and using the capsules and not eating safe ghosts hurts his performance.

In the file make a new extractor class called BetterExtractor that will improve on SimpleExtractor. Remember that you can look in the file at the methods in the GameState class, for information you can use in your extractor. You might want to consider incorporating:

Note: Each ghost state maintains a variable called scaredTimer. This timer is 0 when a ghost is dangerous to Pac-Man and greater than 0 when a ghost is safe for Pac-Man to eat.

Test your better extractor on the hardest Pac-Man layouts and see if it fares better than the simple one:

 python -p ApproximateQAgent -a extractor=BetterExtractor -x 50 -n 60 -l contestClassic 

Be sure to write a clear comment with your BetterExtractor explaining what features it is focusing on.

Submitting your code

To submit your code, you need to use git to add, commit, and push the files that you modified.
cd ~/cs63/labs/07
git add analysis.txt
git commit -m "final version"
git push