Discounted Reward Example for OfflineMCTSAgent

Suppose that Pac-Man is in the smallGrid world with just two pieces of food. On every time step his living reward is -1. When he eats a piece of food he gets an additional reward of 10. If he manages to eat all the food he gets a bonus reward of 500. Given this setup, here is one example reward history he could have:

self.rewards = [-1, -1, 9, -1, -1, 509]
Then his discounted reward for each step would be calculated as follows (assuming the discount is d):
discountedReward5 =  509
discountedReward4 = -1 + d*509
discountedReward3 = -1 + d*-1 + d2*509
discountedReward2 =  9 + d*-1 + d2*-1 + d3*509
discountedReward1 = -1 + d*9 + d2*-1 + d3*-1 + d4*509
discountedReward0 = -1 + d*-1 + d2*9 + d3*-1 + d4*-1 + d5*509
Each of these discounted rewards can be written in terms of the previous one like this:
discountedReward5 =  509
discountedReward4 = -1 + d * discountedReward5
discountedReward3 = -1 + d * discountedReward4
discountedReward2 =  9 + d * discountedReward3
discountedReward1 = -1 + d * discountedReward2
discountedReward0 = -1 + d * discountedReward1
In general the calculation is:
discountedRewardt = R(t) + d * discountedRewardt+1
Thus if you iterate over the self.rewards list in reverse order, you can easily calculate the discounted reward for each time step. For this particular example the discounted rewards would be:
discountedReward5 = 509
discountedReward4 = 457
discountedReward3 = 410
discountedReward2 = 378
discountedReward1 = 340
discountedReward0 = 305