Discounted Reward Example

Discounted Reward Example for OfflineMCTSAgent

Suppose that Pac-Man is in the smallGrid world with just two pieces of food. On every time step his living reward is -1. When he eats a piece of food he gets an additional reward of 10. If he manages to eat all the food he gets a bonus reward of 500. Given this setup, here is one example reward history he could have:

self.rewards = [-1, -1, 9, -1, -1, 509]

Then his discounted reward for each step would be calculated as follows (assuming the discount is d):

discountedReward₅ =  509
discountedReward₄ = -1 + d*509
discountedReward₃ = -1 + d*-1 + d²*509
discountedReward₂ =  9 + d*-1 + d²*-1 + d³*509
discountedReward₁ = -1 + d*9 + d²*-1 + d³*-1 + d⁴*509
discountedReward₀ = -1 + d*-1 + d²*9 + d³*-1 + d⁴*-1 + d⁵*509

Each of these discounted rewards can be written in terms of the previous one like this:

discountedReward₅ =  509
discountedReward₄ = -1 + d * discountedReward₅
discountedReward₃ = -1 + d * discountedReward₄
discountedReward₂ =  9 + d * discountedReward₃
discountedReward₁ = -1 + d * discountedReward₂
discountedReward₀ = -1 + d * discountedReward₁

In general the calculation is:

discountedReward_t = R(t) + d * discountedReward_t+1

Thus if you iterate over the self.rewards list in reverse order, you can easily calculate the discounted reward for each time step. For this particular example the discounted rewards would be:

discountedReward₅ = 509
discountedReward₄ = 457
discountedReward₃ = 410
discountedReward₂ = 378
discountedReward₁ = 340
discountedReward₀ = 305