Policy Gradient (PG)¶
Paper Link: Download PDF
The Policy Gradient (PG) algorithm, introduced by Richard Sutton et al. in their seminal 1999 paper “Policy Gradient Methods for Reinforcement Learning with Function Approximation”, is a foundational approach in reinforcement learning for optimizing policies directly. It is particularly effective in scenarios where value-based methods like Q-learning struggle, such as high-dimensional or continuous action spaces.
Features of PG |
Values |
Description |
|---|---|---|
On-policy |
✅ |
The evaluate policy is the same as the target policy. |
Off-policy |
❌ |
The evaluate policy is different from the target policy. |
Model-free |
✅ |
No need to prepare an environment dynamics model. |
Model-based |
❌ |
Need an environment model to train the policy. |
Discrete Action |
✅ |
Deal with discrete action space. |
Continuous Action |
✅ |
Deal with continuous action space. |
Method¶
Motivation¶
The goal of PG methods is to directly optimize the policy \(\pi_{\theta}(a | s)\), parameterized by \(\theta\), by maximizing the expected cumulative reward:
Instead of approximating the value function, PG methods compute the gradient of the objective \(J(\theta)\) with respect to the policy parameters \(\theta\), and then update \(\theta\) to maximize \(J(\theta)\).
Policy Gradient¶
The core of the PG algorithm is the policy gradient theorem, which states:
where,
\(\pi_{\theta}(a|s)\): The stochastic policy that outputs the probabilities of taking action \(a\) in state \(s\).
\(Q^{\pi_{\theta}}(s, a)\): The action-value function under the current policy.
\(\nabla_{\theta}\log{\pi_{\theta}(a|s)}\): The gradient of the log-policy with respect to its parameters \(\theta\), often called the score function.
This formulation enables the policy to be updated by following the gradient of the expected reward.
Run PG in XuanCe¶
Before running PG in XuanCe, you need to prepare a conda environment and install xuance following
the installation steps.
Run Build-in Demos¶
After completing the installation, you can open a Python console and run PG directly using the following commands:
import xuance
runner = xuance.get_runner(method='pg',
env='classic_control', # Choices: claasi_control, box2d, atari.
env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc.
is_test=False)
runner.run() # Or runner.benchmark()
Run With Self-defined Configs¶
If you want to run PG with different configurations, you can build a new .yaml file, e.g., my_config.yaml.
Then, run the PG by the following code block:
import xuance as xp
runner = xp.get_runner(method='pg',
env='classic_control', # Choices: claasi_control, box2d, atari.
env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc.
config_path="my_config.yaml", # The path of my_config.yaml file should be correct.
is_test=False)
runner.run() # Or runner.benchmark()
To learn more about the configurations, please visit the tutorial of configs.
Run With Custom Environment¶
If you would like to run XuanCe’s PG in your own environment that was not included in XuanCe,
you need to define the new environment following the steps in
New Environment Tutorial.
Then, prepapre the configuration file
pg_myenv.yaml.
After that, you can run PG in your own environment with the following code:
import argparse
from xuance.common import get_configs
from xuance.environment import REGISTRY_ENV
from xuance.environment import make_envs
from xuance.torch.agents import PG_Agent
configs_dict = get_configs(file_dir="pg_myenv.yaml")
configs = argparse.Namespace(**configs_dict)
REGISTRY_ENV[configs.env_name] = MyNewEnv
envs = make_envs(configs) # Make parallel environments.
Agent = PG_Agent(config=configs, envs=envs) # Create a DDPG agent from XuanCe.
Agent.train(configs.running_steps // configs.parallels) # Train the model for numerous steps.
Agent.save_model("final_train_model.pth") # Save the model to model_dir.
Agent.finish() # Finish the training.
Citation¶
@article{sutton1999policy,
title={Policy gradient methods for reinforcement learning with function approximation},
author={Sutton, Richard S and McAllester, David and Singh, Satinder and Mansour, Yishay},
journal={Advances in neural information processing systems},
volume={12},
year={1999}
}