Lab 7: Approximate Q-Learning

Lab 7: Approximate Q-Learning
Due April 7 by midnight

Starting point code

Use Teammaker to form your team. You can log in to that site to indicate your partner preference. Once you and your partner have specified each other, a GitHub repository will be created for your team.

Introduction

In this lab we will apply approximate Q-Learning to two reinforcement learning problems. We will use problems provided by Open AI gym. To start, we will focus on a classic control problem known as the Cart Pole. In this problem, we have a pole attached to a cart and the goal is the keep the pole as close to vertical as possible for as long as possible. The state for this problem consists of four continuous variables (the cart position, the cart velocity, the pole angle in radians, and the pole velocity). The actions are discrete: either push the cart left or right. The reward is +1 for every step the pole is within 12 degrees of vertical and the cart's position is within 2.4 of the starting position.

Once you've successfully been able to learn the Cart Pole problem, you will attempt another problem from Open AI Gym to explore, and write up a summary of your results.

Testing Open AI Gym Environments

Open the file testGym.py and review how to create an open AI gym environment, run multiple episodes, execute steps given a particular action, and receive rewards.

In order to use open AI gym, you'll need to activate the CS63 virtual environment. Then you can execute this file and watch the cart move and see example state information:

    source /usr/swat/bin/CS63env
    python3 testGym.py

Try the other environments provided in the file.

Make sure you understand how this program works before moving on.

Complete the implementation of Approximate Q-Learning

Recall that in approximate Q-Learning we represent the Q-table using a neural network. The input to this network represents a state and the output from the network represents the current Q values for every possible action.

Open the file deepQAgent.py. Much of the code has been written for you. This file contains the definition of a class called DeepQAgent that builds the neural network model of the Q-learning table and allows you to train and to use this model to choose actions. You should read carefully through the methods in this class to be sure you understand how it functions.

There are two methods that you need to write train and test. The main program, which calls both of these functions, is written for you and is in the file cartPole.py.

Here is pseudocode for the train function:

initialize the agent's history list to be empty
loop over episodes
  state = reset the environment
  state = np.reshape(state, [1, state size])
  every 50 episodes save the agent's weights to a unique filename
  (see program comments for details on this filename)
  initialize total_reward to 0
  loop over steps
    choose an action using the epsilon-greedy policy
    take the action
    observe next_state, reward, and whether the episode is done
    update total_reward based on current reward
    reshape the next_state (similar to above)
    remember this experience
    state = next_state
    if episode is done
       break
  add total_reward to the agent's history list
  if length of agent memory > batchSize
     replay batchSize experiences for batchIteration times
  print a message that episode has ended with total_reward
save the agent's final weights after training is complete

The test function is similar in structure to the train function, but you should choose greedy actions and also render the environment to observe the agent behavior. You should not remember experiences, replay experiences, nor save weights.

Use Approximate Q learning to find a successful policy

Once you have completed the implementation of the DQNAgent in the file deepQAgent.py, you can test it out on the Cart Pole problem. There are two ways that you can run this program, either in training mode or testing mode.

To run this program for training with 200 episodes do:

    python3 cartPole.py train 200

Remember that you will first need to activate the CS63env in the terminal window where you run the code.

If you experience a long delay (more than a minute) before the program starts running or you get error messages, tensorflow may be hung up searching for GPU devices. You can kill the program with CTRL-C and try running it like this instead:

    CUDA_VISIBLE_DEVICES="" python3 cartPole.py train 200

Approximate Q learning should be able to find a succesful policy in 200 episodes using the parameter settings provided in the file.

After training is complete, it is interesting to go back and look at the agent's learning progress over time. We can do this by using the weight files that were saved every 50 episodes.

To run this program in testing mode do:

    python3 cartPole.py test CartPole_episode_0.h5

This would show you how the agent behaved prior to any training at episode 0. The next command would show you the behavior after 50 episodes of training.

    python3 cartPole.py test CartPole_episode_50.h5

The file RewardByEpisode.png contains a plot of reward over time which also gives you a sense of the agent's progress on the task. You can view this file by doing:

    eog RewardByEpisode.png

Applying Deep Q-Learning to a new problem

Once you have been able to successfully learn the Cart Pole problem, you should choose another problem to try. Look through the open AI gym documentation and pick a different environment to play with. Remember that reinforcement learning is difficult! We started with one of the easier classic problems. Don't get discouraged if your attempts at a new problem are not as successful.

You should use the testGym.py file to test out different domains and to figure out how the states, actions, and rewards are represented. But you must choose a task that has a discrete action space.

I suggest trying the Lunar Lander problem pictured above. This problem is more challenging than the Cart Pole, but you should be able to make some progress at learning it. Here, the goal is to learn how to land the vessel between the flags.

You should make copies of the files cartPole.py and the deepQAgent.py, and rename them for your new task. Here are some of the aspects of the code that may need to change to be successful on a harder problem:

You may need to change the neural network model. For harder problems you may need more hidden layers, or larger hidden layers.
If your task has image input (the lunar lander does not), for say a video game, you'll likely want to use convolution and max pooling layers at the front end.
You may want to create your own custom reward function, this allows you to give the agent more feedback about how it is doing.
You may want to experiment with some of the learning parameters such as learning rate, epsilon decay, etc.

Write up a summary of what you did for this second problem in the file writeup.tex

Submitting

You should use git to add, commit, and push any files that you modify.

Acknowledgements

The structure of the Q-learning algorithm is based on code provided at Deep Q-Learning with Keras and Gym