Categorical 51 DQN¶
Paper Link: https://proceedings.mlr.press/v70/bellemare17a.html.
The C51 algorithm (Categorical DQN) is a variant of the DQN that introduces a distributional approach to reinforcement learning. Instead of predicting a single scalar value for the Q-function (expected future reward), C51 predicts a probability distribution over a discrete set of possible returns (rewards), enabling the agent to learn not just the expected return but also its uncertainty.
This table lists some general features about C51 algorithm:
Features of C51 |
Values |
Description |
|---|---|---|
On-policy |
❌ |
The evaluate policy is the same as the target policy. |
Off-policy |
✅ |
The evaluate policy is different from the target policy. |
Model-free |
✅ |
No need to prepare an environment dynamics model. |
Model-based |
❌ |
Need an environment model to train the policy. |
Discrete Action |
✅ |
Deal with discrete action space. |
Continuous Action |
❌ |
Deal with continuous action space. |
Method¶
Problem With DQN¶
In regular DQN, the goal is to learn the expected return for each action in a given state:
where
Instead of learning a single number \(Q(s, a)\), C51 learns the entire distribution of possible rewards for each action, denoted as \(Z(s, a)\).
C51 divides the possible range of rewards (e.g., from -1 to 1) into a fixed number of bins (called atoms). For each atom, the algorithm predicts the probability that the reward will fall into that atom.
The agent can now make decisions based on both the expected reward and its uncertainty.
Categorical Representation¶
To apporoximate \(Z(s, a)\), C51 represents it as a categorical distribution over \(N\) discrete atoms within a predefined value range \([v_{min}, v_{max}]\). The atoms are defined as:
Each atom \(z_i\) is associated with a probability \(p_i\), forming the categorical distribution:
Distributional Bellman Equation¶
The Bellman equation for the return distribution is given by:
This equation states that the distribution of returns for the current state-action pair is determined by the immediate reward \(R\) and the discounted distribution of returns from the next state \(S'\).
In practice, The next-state distribution \(Z(S', A')\) is projected back onto the fixed set of atoms \(z_i\) to maintain consistency.
Algorithm¶
The full algorithm for training C51 is presented in Algorithm 1.
备注
Algorithm 1 computes the projection in time linear in N.
Run C51 in XuanCe¶
Before running C51 in XuanCe, you need to prepare a conda environment and install xuance following
the installation steps.
Run Build-in Demos¶
After completing the installation, you can open a Python console and run C51 directly using the following commands:
import xuance
runner = xuance.get_runner(method='c51',
env='classic_control', # Choices: claasi_control, box2d, atari.
env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc.
is_test=False)
runner.run() # Or runner.benchmark()
Run With Self-defined Configs¶
If you want to run C51 with different configurations, you can build a new .yaml file, e.g., my_config.yaml.
Then, run the C51 by the following code block:
import xuance as xp
runner = xp.get_runner(method='c51',
env='classic_control', # Choices: claasi_control, box2d, atari.
env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc.
config_path="my_config.yaml", # The path of my_config.yaml file should be correct.
is_test=False)
runner.run() # Or runner.benchmark()
To learn more about the configurations, please visit the tutorial of configs.
Run With Custom Environment¶
If you would like to run XuanCe’s C51 in your own environment that was not included in XuanCe,
you need to define the new environment following the steps in
New Environment Tutorial.
Then, prepapre the configuration file
c51_myenv.yaml.
After that, you can run C51 in your own environment with the following code:
import argparse
from xuance.common import get_configs
from xuance.environment import REGISTRY_ENV
from xuance.environment import make_envs
from xuance.torch.agents import C51_Agent
configs_dict = get_configs(file_dir="c51_myenv.yaml")
configs = argparse.Namespace(**configs_dict)
REGISTRY_ENV[configs.env_name] = MyNewEnv
envs = make_envs(configs) # Make parallel environments.
Agent = C51_Agent(config=configs, envs=envs) # Create a DDPG agent from XuanCe.
Agent.train(configs.running_steps // configs.parallels) # Train the model for numerous steps.
Agent.save_model("final_train_model.pth") # Save the model to model_dir.
Agent.finish() # Finish the training.
Citation¶
@inproceedings{bellemare2017distributional,
title={A distributional perspective on reinforcement learning},
author={Bellemare, Marc G and Dabney, Will and Munos, R{\'e}mi},
booktitle={International conference on machine learning},
pages={449--458},
year={2017},
organization={PMLR}
}