# Basic Policy Gradients with the Reparameterization Trick

Reinforcement learning is the training of models to make optimal decisions. One realm where we can put that decision-making prowess to the test is computer games. We can test humans vs machines on these frame-by-frame reaction tasks. When we train a good model with reinforcement learning, machines can play like a pro. At the core of many modern reinforcement learning algorithms is the policy gradient. To understand this line of algorithms, we will dive deeper into the basic policy gradient algorithm.

# OpenAI Gym

OpenAI gym provides a set of toolkits for reinforcement learning. It has a collection of games for you to train the computer algorithms to play. Each game simulation is called an environment. The algorithm that is playing the game is your agent. Your agent will choose an action to take at each step. The agent passes the action to the environment, and the environment processes the action for one time step. An observation, the state of the environment, and a reward, a score for how effective the action or actions taken, is returned by the environment after a step. Now, we find ourselves back at the beginning of the loop.

OpenAI gym standardized the reinforcement learning environments for benchmarking. It’s also an incredible tool to learn and experiment with different algorithms.

# CartPole Environment

What’s the CartPole game? It’s a game to keep upright a pole that’s attached to a cart. We can assume that the conditions are ideal without any friction, and you are allowed to move the cart back and forth along the track. Every step when the pole doesn’t fall, you’re rewarded with your balancing skills +1. The goal of training the network is to balance the pole for as long as possible.

The end of the game is then defined as when the pole tilts more than 15 degrees, or when the cart runs out of space to move. Below is an example of a failed game. For convenience though, the environment doesn’t render what happens after the pole tilts beyond 15 degrees.

# Taking a Step

A step is a full cycle of units played by your agent. Your agent (the neural network) takes as input the observation and outputs an action probability distribution. An action is then chosen from this probability distribution. Finally, the environment executes the action to generate a new observation and a reward that’s associated with it. Note that rewards are different with different game environments, but the goal is always to maximize the total rewards.

`torch_obs = torch.from_numpy(obs_tmp)action_prob_distribution = model.forward(torch_obs)m = torch.distributions.categorical.Categorical(        action_prob_distribution)action = m.sample()observation, reward, done, info = env.step(action.item())`

Figure 4: OpenAI and PyTorch code to take a step

# Judging an Action

In our problem, the agent makes a series of N actions before finding out if the CartPole has fallen down or not. How can we weigh how good or bad these N actions are? Intuitively, we’d discourage the last actions that led to the failure of the game, and may actually encourage the beginning of the actions that led to a long final series. In the training step then, we’d like to encourage the good actions and discourage the bad actions.

The Policy Gradient method encourages or discourages actions based on a value called the Advantage. The advantage is a value for each action that can be positive, negative, or zero. If the action has a positive advantage, we’ll encourage more of that action. Likewise if the action is negative, we’ll try to discourage the agent from taking that action. Finally, the agent will be neutral on an action if the advantage is zero.

# How to Calculate the Advantage

In order to calculate the advantage, we calculate the discounted rewards that the agent has obtained during an episode. Rewards are discounted by the length of time so that rewards that can be obtained quicker are weighted more than rewards that you need to wait a long time for.

`# rewards: a list of rewards during over the time steps in the episodedef calculate_discounted_rewards(rewards):    Discounted_rewards = get_discounted_rewards(rewards)    Discounted_rewards -= np.mean(discounted_rewards)    Discounted_rewards /= np.std(discounted_rewards)    Discounted_rewards = torch.Tensor(discounted_rewards)`

Figure 5: Discounted Reward Code

Note that we center our discounted rewards around zero and normalize by the standard deviation. This calculation causes the most recent timesteps to be negative and long ago timesteps to be positive.

After we’ve finished playing an episode, we want to retrain our neural network to optimize the agent’s behavior. If a series of actions caused the CartPole to fall down, we want to reduce the probability of those actions.

`m = torch.distributions.Categorical(action_prob_distribution)action = m.sample()observation, reward, done, info = env.step(action.item())log_probs.append(m.log_prob(action))rewards.append(reward)discounted_rewards = calculated_discounted_rewards(rewards)# Negative log likelihoodloss = [-log_prob * r for log_prob, r in zip(log_probs,         discounted_rewards)]loss = torch.cat(loss).sum()optimizer.zero_grad()loss.backward()optimizer.step()`

Figure 6: REINFORCE Method Training Code

# Reparameterization Trick

While we won’t try to completely explain the reparameterization trick in this post, we will try to give an overview of the concept. The REINFORCE agent essentially outputs a weight for each action for a dice roll. We expect our model to learn this arbitrary distribution and to handle the probabilistic nature of the output in training.

The reparameterization trick moves that probabilistic nature outside of the model. We can do this by changing our output of the model from a single value to the parameters of a probabilistic function, which in our case is a Normal distribution. The new agent’s output, a Normal distribution’s parameters: mean and standard deviation, are then used to sample from the Normal distribution, which we then use to determine our action.

The reparametrization trick has the model learning to the parameters of a specific distribution rather than having the model attempt to learn some arbitrary distribution. An analogy to our output of Normal distribution parameters is that we’re constraining our output to a Gaussian Mixture rather than an arbitrary distribution. With this constraint, we could get a more optimal model for our task.

One thing to note is that our outputs in both approaches are probability distributions. The reparameterization trick has been used in other cases like Variational Autoencoders, but may not be applicable to all other use cases. The reinforcement learning case is to find parameters of a probability to sample and NOT the final step to determining an action.

# How To Do the Reparameterization Trick?

`class ReinforceAgent(nn.Module):    def __init__(self, state_shape, action_shape):        super(Agent, self).__init__()        self.state_shape = state_shape        self.action_shape = action_shapeself.relu = nn.ReLU(inplace=True)        self.linear1 = nn.Linear(state_shape, 24)        self.linear2 = nn.Linear(24, 12)        self.linear3 = nn.Linear(12, action_shape)def forward(self, state):        x = F.relu(self.linear1(state))        x = F.relu(self.linear2(x))        action = F.softmax(self.linear3(x))                return actionclass ReparamTrickAgent(nn.Module):    def __init__(self, state_shape, action_shape):        super(ReparamTrickAgent, self).__init__()        self.state_shape = state_shape        self.action_shape = action_shapeself.relu = nn.ReLU(inplace=True)        self.linear1 = nn.Linear(state_shape, 24)        self.linear2 = nn.Linear(24, 12)# Instead of a single parameter output per action, we have 2 for mean and standard deviation        init_w = 3e-3        self.mu_linear = nn.Linear(12, action_shape)        self.mu_linear.weight.data.uniform_(-init_w, init_w)        self.mu_linear.bias.data.uniform_(-init_w, init_w)self.logvar_linear = nn.Linear(12, action_shape)        self.logvar_linear.weight.data.uniform_(-init_w, init_w)        self.logvar_linear.bias.data.uniform_(-init_w, init_w)def forward(self, state):        x = F.relu(self.linear1(state))        x = F.relu(self.linear2(x))        mu = torch.tanh(self.mu_linear(x))  # Instead of softmax, we use a tanh activation        logvar = torch.tanh(self.logvar_linear(x))        std = torch.exp(0.5 * logvar)                return mu, std`

Figure 7: REINFORCE and Reparameterization Agent

As we see in `ReinforceAgent`, we get a single value representing a weighted probability for each action. In `ReparamTrickAgent`, we instead output 2 values: mu and logvar for each action.

# Reparameterization Trick Training

When using the reparameterization trick, the neural network outputs the parameters of a Normal distribution. The code below shows an example of training our vanilla Policy Gradient agent using the reparameterization trick.

`m = torch.distributions.Normal(mu, std)action = m.sample()log_prob = m.log_prob(action).sum(axis=-1)action = torch.argmax(action)observation, reward, done, info = env.step(action.item())log_probs.append(log_prob)rewards.append(reward)discounted_rewards = calculated_discounted_rewards(rewards)# Negative log likelihoodloss = [-log_prob * r for log_prob, r in zip(log_probs,        discounted_rewards)]loss = torch.cat(loss).sum()optimizer.zero_grad()loss.backward()optimizer.step()`

Figure 8: Reparameterization Method Training Code

# Conclusion

In this tutorial, we demonstrated how to train a reinforcement network on a CartPole control problem. To make the model more robust, we applied the reparameterization trick. You can find our code at github.

# Resources

We’re a team of Machine Learning Engineers exploring and researching deep learning technologies

## More from Deep Gan Team

We’re a team of Machine Learning Engineers exploring and researching deep learning technologies