PPO Explained for Dummies (With Python)

type

Post

status

Published

date

Mar 9, 2022

slug

article111

summary

Proximal Policy Optimization (PPO) is one of the most powerful reinforcement learning algorithms, balancing stability and efficiency. This article breaks down how AI gradually improves in decision-making using trial, error, and strategic policy updates—just like learning to ride a bike!

Plain Explanation of the PPO Algorithm

PPO Explained in Simple Terms

1. Basic Idea

Imagine you’re learning to ride a bike:

You try different actions (turning, braking, accelerating).

Some actions help you stay balanced (reward).

Some actions make you fall (punishment).

Over time, you remember which actions work best in different situations.

PPO works just like this but has a special advantage: it doesn’t change too much at once when learning.

2. Key Feature: Making Gradual Changes

Let’s say you discover that leaning slightly to the right helps with balance:

A regular learner might overreact and lean too much next time.

A PPO learner makes small, gradual adjustments to avoid overcorrecting.

This is why PPO includes the word "proximal"—it ensures the new strategy stays close to the previous one.

3. PPO’s "Clipping" Technique

PPO has a built-in safety mechanism like a "guardrail" for learning:

If an action is significantly better than before, it avoids getting overly fixated on it.

If an action performs poorly, it quickly reduces the chance of repeating it.

This "guardrail" ensures stable learning and prevents wild fluctuations.

How PPO Works Step by Step

Collect Experiences – The agent interacts with the environment, trying different actions and recording the results.

Evaluate Performance – It calculates an advantage score (how much better an action is compared to average performance).

Update Strategy Gradually – Adjusts the policy without making extreme changes:

If an action is good → Slightly increase its probability.

If an action is bad → Slightly decrease its probability.

Repeat the Process – The agent continuously learns through these steps.

Why is PPO So Popular?

Easy to implement – Simpler than many other reinforcement learning algorithms.

Stable and reliable – The learning process is smooth and doesn’t crash easily.

Performs well – Achieves great results across many tasks.

Efficient use of data – Can learn from the same batch of data multiple times.

PPO is like a cautious learner that steadily improves by learning from experience. It doesn’t overreact but rather makes small, careful improvements, ensuring consistent progress toward becoming an expert.

Python example

This code implements the PPO (Proximal Policy Optimization) algorithm to train a robot to play the "CartPole" game. We can break it down into four main parts:

1. The Brain Structure (Actor-Critic Network)

This serves as the robot’s brain, consisting of two key components:

Decision-making part (Actor): Determines which action to take (push the cart left or right).

Evaluation part (Critic): Assesses how good the current state is.

Think of it like playing chess:

Actor: "Which piece should I move?"

Critic: "Is the current position favorable or not?"

probs: The action probability distribution output by the Actor (probability of pushing the cart left or right).

state_value: The current state value estimation output by the Critic (how good the current state is).

2. Trial and Improvement (Main Loop)

The main loop consists of five steps:

Set up the environment: Initialize the CartPole game and create the brain (neural network).

Collect experience: Let the robot play a round, recording all states, actions, and rewards.

Compute rewards: Evaluate the long-term value of each action.

Improve strategy: Adjust the brain to increase the likelihood of selecting better actions.

Repeat the process: Through continuous trial and improvement, the robot gets better over time.

Like a child learning to walk: try, fall, learn from mistakes, try again, and gradually become stable.

In short, the program enables the robot to continuously experiment, remember what works and what doesn’t, and steadily refine its strategy until it masters balancing the pole.

Set up the CartPole game and create the brain (neural network).

optim.Adam(policy.parameters(), lr=0.01) updates the brain.

for episode in range(max_episodes): env.reset() starts a new game (one episode), repeating for a total of 500 episodes.

while not done: collects all data from the current game session.

First, the agent receives the current state, the brain decides on an action, and the action is executed in the game (the cart keeps moving left and right).

The step_result contains the execution outcome.
The agent collects data and then continues playing with the new state.
Repeat: Get the current state → Decide an action → Execute in the game → Collect results.
This continues until the episode ends.

If the agent becomes very smart, it will avoid mistakes such as:

The pole tilting too much (beyond ~15 degrees).
The cart moving out of bounds (beyond ±2.4 units from the center).
If the agent survives 200 steps, the game ends because it reaches the step limit.
A well-trained agent should consistently survive for 200 steps.

After each game, the agent computes discounted rewards and updates the model using PPO.

With the updated model, the agent plays the next episode.

This process repeats for 500 episodes.

"Learning while playing" allows the model to gradually improve, with each training step based on the latest collected data.

Example result from a single episode (while not done: loop)

The agent successfully balanced the pole for 28 steps.
Then, the pole fell, or the cart went out of bounds, ending the episode.
As training progresses, this number should increase as the agent learns to perform better.

3. Experience Review (Computing Returns)

This step helps the agent evaluate the long-term value of each action, rather than just focusing on immediate rewards.

Analogy: Learning to Ski

Immediate reward: Not falling = +1

Long-term thinking: Did my chosen path ultimately get me to the finish line?

The gamma parameter (0.99) determines how important future rewards are.

A higher gamma means the agent values long-term rewards more.

A lower gamma makes the agent focus more on short-term rewards.

The formula for Discounted Cumulative Rewards is:

The recursive form simplifies to:

→ Discounted cumulative reward at time step t

→ Immediate reward at time step t

→ Discount factor (between 0 and 1, in this case 0.99)

→ Final time step of the episode

Why Do We Use Discounted Rewards?

Immediate rewards may not reflect the long-term impact of an action.

Discounted rewards help evaluate each action’s contribution to future success.

The discount factor γ\gammaγ controls how much future rewards influence current decisions.

This is not unique to PPO, but a fundamental component in reinforcement learning, used in many algorithms.

Imagine in CartPole Game

Each step earns a reward of 1 (because the pole hasn’t fallen).

Suppose the game lasts 5 steps before ending.

Let's compute the discounted cumulative rewards:

Step 5 (last step): : R_5 = 1

Step 4:

Step 3:

Step 2:

Step 1 (start) :

Earlier actions receive higher scores because they set the foundation for future rewards.

Each action's score is retained instead of being summed (unlike total episode rewards).

In PPO and other policy gradient algorithms, these reward values guide the policy updates, pushing the agent to favor actions that maximize long-term rewards.

3. Learning Optimization (PPO Update)

This is the core learning algorithm, consisting of several key steps:

Compute the advantage → Determine how much better each action is compared to the average.

Calculate the ratio between new and old policies → Measure the difference between the updated policy and the previous one.

Apply a "guardrail" (clipping parameter 0.2) → Restrict update magnitude to prevent drastic changes.

Simultaneously learn decision-making (policy) and state evaluation (value function).

This is similar to learning to ride a bike:

You don’t drastically change your posture after one successful ride.

Instead, you make small, gradual optimizations for stability and improvement.

returns = (returns - returns.mean()) / (returns.std() + 1e-8) # Normalization stabilizes training by ensuring rewards have a standard scale.

_, values = policy(states) Retrieve state values for every step in the game

advantages = returns - values.squeeze() # A = R - V(s)） advantage = return - baseline

values.squeeze() compresses the state value tensor by removing one dimension

Advantage = actual return - predicted state value

This measures "excess return over expectation." If the actual return is higher than the value network’s prediction, the advantage is positive, indicating the action performed better than expected. Otherwise, it suggests the action underperformed.

probs, _ = policy(states) # Retrieve action probability distributions for all states

ratio = torch.exp(new_log_probs - old_log_probs) #Ratio of action probabilities under the new and old policies

surrogate1 = ratio * advantages Original policy gradient objective

surrogate2 = torch.clamp(ratio, 1 - clip_param, 1 + clip_param) * advantages

torch.clamp function clamps values within a specified range:

All ratios below 0.8 are set to 0.8
All ratios above 1.2 are set to 1.2
Ratios between 0.8 and 1.2 remain unchanged

Significance in PPO

This clipping mechanism is a key feature of the PPO algorithm:

Prevents overly large updates: Limits the change in policy updates

Stabilizes training: Ensures the new policy doesn’t diverge too far from the old one

Encourages conservative updates: When the ratio exceeds the threshold, further optimization is removed

policy_loss = -torch.min(surrogate1, surrogate2).mean()

The core innovation of PPO:

surrogate1: Original policy gradient objective

surrogate2: Clipped ratio within [0.8, 1.2]

The smaller value between the advantage with new/old policy ratio and the clipped version is selected.

Selecting the smaller value prevents excessive policy updates

Taking the negative mean ensures we maximize returns (by minimizing negative returns)

value_loss = F.mse_loss(values.squeeze(), returns) Variance between state value and return

This represents the mean squared error (MSE) between the value network’s prediction and actual return.
The goal is for the value network to accurately predict future returns.
This is a common approach across many Actor-Critic algorithms, including A2C, DDPG, etc.
Typically, value networks are trained using mean squared error loss.

loss = policy_loss + 0.5 * value_loss

The 0.5 co-efficient is a hyperparameter, not a fixed rule:

It balances the relative importance of policy updates and value updates
0.5 means the value loss contributes less to the overall loss
Different implementations may use different coefficients, ranging from 0.1 to 1
This is a heuristic choice, often adjusted for different tasks

The original PPO paper does not strictly define this coefficient, so 0.5 is a common but not mandatory choice.

Some implementations even add entropy regularization, making the loss function threefold.

Is policy loss more important than value loss?

Both are crucial. Policy loss directly improves the agent’s decision-making, while value loss helps better estimate state values, leading to more accurate advantage calculations.
They complement each other, jointly enhancing learning performance.

optimizer.step() Performs one optimizer step to update parameters.

Key Innovation: Stable Learning

surrogate1 = ratio * advantages

surrogate2 = torch.clamp(ratio, 1 - clip_param, 1 + clip_param) * advantages

policy_loss = -torch.min(surrogate1, surrogate2).mean()

This is the essence of PPO, setting up a "guardrail" to prevent excessive learning:

If an action is particularly good, it won’t become overly favored immediately.

If an action is particularly bad, it will be adjusted more quickly.

Just like learning to drive: you don’t make a sharp turn just because one small correction worked, but rather adjust steadily and progressively.