Parametrized Deep Q-Network(P-DQN)¶
Paper Link:https://arxiv.org/pdf/1810.06394
Parameterized Deep Q-Network(P-DQN) is a framework for the hybrid action space without approximation or relaxation,combining the spirits of both DQN (dealing with discrete action space) and DDPG (dealing with continuous action space) by seamlessly integrating them.
This table lists some general features about P-DQN algorithm:
Features of P-DQN |
Values |
Description |
|---|---|---|
On-policy |
❌ |
The evaluate policy is the same as the target policy. |
Off-Policy |
✅ |
The evaluate policy is different as the target policy. |
Model-free |
✅ |
No need to prepare an environment dynamics model. |
Model-based |
❌ |
Need an environment model to train the policy. |
Discrete Action |
✅ |
Deal with discrete action space. |
Continuous Action |
✅ |
Deal with continuous action space. |
Hybrid Action |
✅ |
Deal with hybrid action space. |
Discrete-Continuous Hybrid Action Space¶
The hybrid action is defiend by the following hierarchical structure. Firstly we choose a high level action \(\mathcal{k}\) from a discrete set \([K]\); upon choosing \(\mathcal{k}\), we further choose a low level parameter \(\mathcal{x_k}\in\mathcal{\mathcal{X}_k}\) which is associated with the \(k\)-th high level action.Here \(\mathcal{X}_k\) is a continuous set for all \(k\in[K]\).
Key Idea of P-DQN¶
In hybrid action space, we denote the action value function by \(Q(s,a)=Q(s,k,x_k)\) where \(s\in S\), \(k\in[K]\), and \(x_k\in\mathcal{X}_k\). Let \(k_t\) be the discrete action selected at time \(t\) and let \(x_t\) be the associated continuous parameter. Then the Bellman equation becomes:
When the function \(Q\) is fixed, for any \(s\in S\) and \(k\in[K]\), we can view \(argsup_{x_k\in\mathcal{X}_k}Q(s,k,x_k)\) as a function \(x_k^Q\): \(S→ \mathcal{X}_k\). Then Bellman equation can be rewrite as:
Similar to the deep Q-Network, we use a deep neural network \(Q(s,k,x_k,\omega)\) to approximate \(Q(s,k,x_k)\) where \(\omega\) denotes the network weights. For such a \(Q(s,k,x_k,\omega)\) we approximate \(x_k^Q\) with a deterministic policy network \(x_k(·;\theta):S→ \mathcal{X}_k\) where θ denotes the network weights of the policy network. When \(\omega\) is fixed, we want to find \(\theta\) such that:
Then similar to DQN, we could estimate \(\omega\) by minimizing the mean-squared Bellman error via gradient descent.
We use the least squares loss function for \(\omega\) like DQN, since we aim to find \(\theta\) that maximize \(Q(s,k,x_k(s;\theta);\omega)\) with \(\omega\) fixed, we use the loss function for \(\omega\) as following:
Then update the weights using stochastic gradient methods, we would minimize the loss function \(\ell^\Theta (\theta)\) when \(\omega_t\) is fixed.
Algorithm¶
The full algorithm for training P-DQN is presented in Algorithm 1:
Run P-DQN in XuanCe¶
Before running P-DQN in XuanCe, you need to prepare a conda environment and install xuance following
the installation steps.
Run With Custom Demos¶
If you would like to run XuanCe’s P-DQN in your own environment that was not included in XuanCe, you need to define the new environment following the steps in New Environment Tutorial. Then, prepapre the configuration filepdqn_myenv.yaml.
After that, you can run P-DQN in your own environment with the following code:
import argparse
from xuance.common import get_configs
from xuance.environment import REGISTER_ENV
from xuance.environment import make_envs
from xuance.torch.agents import PDQN_Agent
config_dict = get_configs(file_dir="pdqn_myenv.yaml")
configs = argparse.Namespace(**configs_dict)
REGISTRY_ENV[configs.env_name] = MyNewEnv
envs = make_envs(configs) # Make parallel environments.
Agent = PDQN_Agent(config=configs, envs=envs) # Create a PDQN agent from XuanCe.
Agent.train(configs.running_steps // configs.parallels) # Train the model for numerous steps.
Agent.save_model("final_train_model.pth") # Save the model to model_dir.
Agent.finish() # Finish the training.
Citation¶
@article{xiong2018parametrized,
title={Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space},
author={Xiong, Jiechao and Wang, Qing and Yang, Zhuoran and Sun, Peng and Han, Lei and Zheng, Yang and Fu, Haobo and Zhang, Tong and Liu, Ji and Liu, Han},
journal={arXiv preprint arXiv:1810.06394},
year={2018}
}