Configuration Examples¶

以 Atari 环境中的 DQN 算法为例，除了基本参数配置外，与算法相关的特定参数还存储在 “xuance/configs/dqn/atari.yaml” 文件中。

由于 Atari 环境中包含 60 多种不同的场景，这些场景之间的差异主要体现在任务上而非环境结构上，因此使用一个默认的参数配置文件即可满足大多数情况的需求。

对于场景差异较大的环境（例如 “Box2D” 环境中的 “CarRacing-v2” 和 “LunarLander” 场景），前者的状态输入为大小为 96×96×3 的 RGB 图像，而后者的状态输入则是一个 8 维向量。因此，针对这两种场景的 DQN 算法参数配置分别保存在以下两个文件中:

xuance/configs/dqn/box2d/CarRacing-v2.yaml

xuance/configs/dqn/box2d/LunarLander-v2.yaml

Within the following content, we provide the preset arguments for each implementation that can be run by following the steps in Quick Start. 在接下来的内容中，我们将为每个实现提供预设参数，这些参数可以按照快速开始中的步骤直接运行。

Value-based Algorithms¶

agent: "DQN"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Basic_Q_network"
representation: "Basic_MLP"
learner: "DQN_Learner"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 100000
sync_frequency: 50
training_frequency: 1
running_steps: 200000  # 200k
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 20000
test_episode: 1
log_dir: "logs/dqn/"
model_dir: "models/dqn/"

agent: "DQN"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Basic_Q_network"
representation: "Basic_MLP"
learner: "DQN_Learner"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 100000
sync_frequency: 50
training_frequency: 1
running_steps: 200000  # 200k
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 20000
test_episode: 1
log_dir: "logs/dqn/"
model_dir: "models/dqn/"

agent: "DQN"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Basic_Q_network"
representation: "Basic_MLP"
learner: "DQN_Learner"
runner: "DRL"

representation_hidden_size: [256, ]
q_hidden_size: [256, ]
activation: 'leaky_relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate: 0.1
gamma: 0.99

start_greedy: 1.0
end_greedy: 0.01
decay_step_greedy: 100000
sync_frequency: 200
training_frequency: 2
running_steps: 2000000  # 2M
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 5
log_dir: "logs/dqn/"
model_dir: "models/dqn/"

agent: "DQN"
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
policy: "Basic_Q_network"
representation: "Basic_CNN"
learner: "DQN_Learner"
runner: "DRL"

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 64, 64]  #  [16, 16, 32, 32]
kernels: [8, 4, 3]  # [8, 6, 4, 4]
strides: [4, 2, 1]  # [2, 2, 2, 2]

q_hidden_size: [512, ]
activation: "relu"  # The activation function of each hidden layer.

seed: 1069
parallels: 5
buffer_size: 500000
batch_size: 32  # 64
learning_rate: 0.0001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.05
decay_step_greedy: 1000000  # 1M
sync_frequency: 500
training_frequency: 1
running_steps: 50000000  # 50M
start_training: 10000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 500000
test_episode: 5
log_dir: "logs/dqn/"
model_dir: "models/dqn/"

agent: "C51DQN"  # Name of agent
env_name: "Classic Control"  # Environment name.
env_id: "CartPole-v1"  # Environment ID.
env_seed: 1  # Random seed for the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "C51_Learner"  # Name of the learner.
policy: "C51_Q_network"  # Name of the policy.
representation: "Basic_MLP"  # The representation.
runner: "DRL"  # Name of the runner.

representation_hidden_size: [128,]  # hidden sizes for MLP representation.
q_hidden_size: [128,]  # hidden sizes for Q networks.
activation: 'relu'  # The activation function for hidden layers.

seed: 1  # The random seed.
parallels: 10  # Number of environments that to be implemented in parallel.
buffer_size: 200000  # The size of replay buffer
batch_size: 256  # The batch size for training.
learning_rate: 0.001  # The learning rate.
gamma: 0.99  # The discount factor.
v_min: 0
v_max: 200
atom_num: 51

start_greedy: 0.5  # The start epsilon greedy.
end_greedy: 0.01  # The end epsilon greedy.
decay_step_greedy: 100000  # Number of steps for the decay of epsilon greedy.
sync_frequency: 100  # The frequency to update target networks.
training_frequency: 1  # The frequency to update the RL model.
running_steps: 200000  # The total running steps.
start_training: 1000  # The running steps before training.

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5  # The norm value for gradient clip.
use_actions_mask: False  # Whether to use actions mask when the environment provides available actions values.
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # Frequency to evaluate the model.
test_episode: 1  # Number of episodes to test.
log_dir: "logs/c51/"  # The directory to store logger file.
model_dir: "models/c51/"  # The directory to store model file.

agent: "C51DQN"  # Name of agent
env_name: "Classic Control"  # Environment name.
env_id: "Acrobot-v1"  # Environment ID.
env_seed: 1  # Random seed for the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "C51_Learner"  # Name of the learner.
policy: "C51_Q_network"  # Name of the policy.
representation: "Basic_MLP"  # The representation.
runner: "DRL"  # Name of the runner.

representation_hidden_size: [128,]  # hidden sizes for MLP representation.
q_hidden_size: [128,]  # hidden sizes for Q networks.
activation: 'relu'  # The activation function for hidden layers.

seed: 1  # The random seed.
parallels: 10  # Number of environments that to be implemented in parallel.
buffer_size: 200000  # The size of replay buffer
batch_size: 256  # The batch size for training.
learning_rate: 0.001  # The learning rate.
gamma: 0.99  # The discount factor.
v_min: 0
v_max: 200
atom_num: 51

start_greedy: 0.5  # The start epsilon greedy.
end_greedy: 0.01  # The end epsilon greedy.
decay_step_greedy: 100000  # Number of steps for the decay of epsilon greedy.
sync_frequency: 100  # The frequency to update target networks.
training_frequency: 1  # The frequency to update the RL model.
running_steps: 300000  # The total running steps.
start_training: 1000  # The running steps before training.

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5  # The norm value for gradient clip.
use_actions_mask: False  # Whether to use actions mask when the environment provides available actions values.
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # Frequency to evaluate the model.
test_episode: 1  # Number of episodes to test.
log_dir: "logs/c51/"  # The directory to store logger file.
model_dir: "models/c51/"  # The directory to store model file.

agent: "C51DQN"  # Name of agent
env_name: "Classic Control"  # Environment name.
env_id: "MountainCar-v0"  # Environment ID.
env_seed: 1  # Random seed for the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "C51_Learner"  # Name of the learner.
policy: "C51_Q_network"  # Name of the policy.
representation: "Basic_MLP"  # The representation.
runner: "DRL"  # Name of the runner.

representation_hidden_size: [128,]  # hidden sizes for MLP representation.
q_hidden_size: [128,]  # hidden sizes for Q networks.
activation: 'relu'  # The activation function for hidden layers.

seed: 1  # The random seed.
parallels: 10  # Number of environments that to be implemented in parallel.
buffer_size: 200000  # The size of replay buffer
batch_size: 256  # The batch size for training.
learning_rate: 0.001  # The learning rate.
gamma: 0.99  # The discount factor.
v_min: 0
v_max: 200
atom_num: 51

start_greedy: 0.5  # The start epsilon greedy.
end_greedy: 0.01  # The end epsilon greedy.
decay_step_greedy: 100000  # Number of steps for the decay of epsilon greedy.
sync_frequency: 100  # The frequency to update target networks.
training_frequency: 1  # The frequency to update the RL model.
running_steps: 200000  # The total running steps.
start_training: 1000  # The running steps before training.

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5  # The norm value for gradient clip.
use_actions_mask: False  # Whether to use actions mask when the environment provides available actions values.
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # Frequency to evaluate the model.
test_episode: 1  # Number of episodes to test.
log_dir: "logs/c51/"  # The directory to store logger file.
model_dir: "models/c51/"  # The directory to store model file.

agent: "C51DQN"  # Name of agent
vectorize: "Dummy_Atari"  # Method to vectorize the environment.
env_name: "Atari"  # Environment name.
env_id: "ALE/Breakout-v5"  # Environment ID.
env_seed: 1  # Random seed for the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
learner: "C51_Learner"  # Name of the learner.
policy: "C51_Q_network"  # Name of the policy.
representation: "Basic_CNN"  # The representation.
runner: "DRL"  # Name of the runner.

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 64, 64]  #  [16, 16, 32, 32]
kernels: [8, 4, 3]  # [8, 6, 4, 4]
strides: [4, 2, 1]  # [2, 2, 2, 2]

q_hidden_size: [512, ]  # The hidden units for Q-network.
activation: "relu"  # The activation function of each hidden layer.

seed: 1069  # The random seed.
parallels: 5  # Number of environments that to be implemented in parallel.
buffer_size: 500000  # The size of replay buffer
batch_size: 32  # The batch size for training.
learning_rate: 0.0001  # The learning rate.
gamma: 0.99  # The discount factor.
v_min: 0
v_max: 200
atom_num: 51

start_greedy: 0.5  # The start epsilon greedy.
end_greedy: 0.05  # The end epsilon greedy.
decay_step_greedy: 10000000  # Number of steps for the decay of epsilon greedy.
sync_frequency: 500  # The frequency to update target networks.
training_frequency: 1  # The frequency to update the RL model.
running_steps: 50000000  # The total running steps.
start_training: 10000  # The running steps before training.

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5  # The norm value for gradient clip.
use_actions_mask: False  # Whether to use actions mask when the environment provides available actions values.
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 500000  # Frequency to evaluate the model.
test_episode: 3  # Number of episodes to test.
log_dir: "logs/c51/"  # The directory to store logger file.
model_dir: "models/c51/"  # The directory to store model file.

agent: "DDQN"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Basic_Q_network"
representation: "Basic_MLP"
learner: "DDQN_Learner"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 100000
sync_frequency: 100
training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ddqn/"
model_dir: "models/ddqn/"

agent: "DDQN"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Basic_Q_network"
representation: "Basic_MLP"
learner: "DDQN_Learner"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 100000
sync_frequency: 100
training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ddqn/"
model_dir: "models/ddqn/"

agent: "DDQN"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Basic_Q_network"
representation: "Basic_MLP"
learner: "DDQN_Learner"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 100000
sync_frequency: 100
training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ddqn/"
model_dir: "models/ddqn/"

agent: "DDQN"
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
policy: "Basic_Q_network"
representation: "Basic_CNN"
learner: "DDQN_Learner"
runner: "DRL"

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 64, 64]  #  [16, 16, 32, 32]
kernels: [8, 4, 3]  # [8, 6, 4, 4]
strides: [4, 2, 1]  # [2, 2, 2, 2]

q_hidden_size: [512, ]
activation: "relu"  # The activation function of each hidden layer.

seed: 1069
parallels: 5
buffer_size: 500000
batch_size: 32
learning_rate: 0.0001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.05
decay_step_greedy: 10000000  # 10M
sync_frequency: 500
training_frequency: 1
running_steps: 50000000  # 50M
start_training: 10000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 500000
test_episode: 3
log_dir: "logs/ddqn/"
model_dir: "models/ddqn/"

agent: "Duel_DQN"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Duel_Q_network"
representation: "Basic_MLP"
learner: "DuelDQN_Learner"
runner: "DRL"

representation_hidden_size: [128, ]
q_hidden_size: [128, ]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 200000
sync_frequency: 50
training_frequency: 1
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/dueldqn/"
model_dir: "models/dueldqn/"

agent: "Duel_DQN"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Duel_Q_network"
representation: "Basic_MLP"
learner: "DuelDQN_Learner"
runner: "DRL"

representation_hidden_size: [128, ]
q_hidden_size: [128, ]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 200000
sync_frequency: 50
training_frequency: 1
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/dueldqn/"
model_dir: "models/dueldqn/"

agent: "Duel_DQN"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Duel_Q_network"
representation: "Basic_MLP"
learner: "DuelDQN_Learner"
runner: "DRL"

representation_hidden_size: [128, ]
q_hidden_size: [128, ]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 100000
batch_size: 256
learning_rate: 0.0001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.01
decay_step_greedy: 200000
sync_frequency: 50
training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/dueldqn/"
model_dir: "models/dueldqn/"

agent: "Duel_DQN"
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
policy: "Duel_Q_network"
representation: "Basic_CNN"
learner: "DuelDQN_Learner"
runner: "DRL"

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 64, 64]  #  [16, 16, 32, 32]
kernels: [8, 4, 3]  # [8, 6, 4, 4]
strides: [4, 2, 1]  # [2, 2, 2, 2]

q_hidden_size: [512, ]
activation: "relu"  # The activation function of each hidden layer.

seed: 1069
parallels: 5
buffer_size: 500000
batch_size: 32  # 64
learning_rate: 0.0001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.05
decay_step_greedy: 10000000  # 10M
sync_frequency: 500
training_frequency: 1
running_steps: 50000000  # 50M
start_training: 10000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 500000
test_episode: 1
log_dir: "logs/dueldqn/"
model_dir: "models/dueldqn/"

agent: "NoisyDQN"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "DQN_Learner"
policy: "Noisy_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_noise: 0.05
end_noise: 0.0
decay_step_noise: 500000
sync_frequency: 100
training_frequency: 2
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/noisy_dqn/"
model_dir: "models/noisy_dqn/"

agent: "NoisyDQN"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "DQN_Learner"
policy: "Noisy_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_noise: 0.05
end_noise: 0.0
decay_step_noise: 500000
sync_frequency: 100
training_frequency: 2
running_steps: 500000
start_training: 1000

use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
eval_interval: 50000
test_episode: 1
log_dir: "logs/noisy_dqn/"
model_dir: "models/noisy_dqn/"

agent: "NoisyDQN"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "DQN_Learner"
policy: "Noisy_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_noise: 0.05
end_noise: 0.0
decay_step_noise: 500000
sync_frequency: 100
training_frequency: 2
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/noisy_dqn/"
model_dir: "models/noisy_dqn/"

agent: "NoisyDQN"
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
learner: "DQN_Learner"
policy: "Noisy_Q_network"
representation: "Basic_CNN"
runner: "DRL"

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 64, 64]  #  [16, 16, 32, 32]
kernels: [8, 4, 3]  # [8, 6, 4, 4]
strides: [4, 2, 1]  # [2, 2, 2, 2]

q_hidden_size: [512, ]
activation: "relu"  # The activation function of each hidden layer.

seed: 1069
parallels: 5
buffer_size: 500000
batch_size: 32  # 64
learning_rate: 0.0001
gamma: 0.99

start_noise: 0.05
end_noise: 0.0
decay_step_greedy: 10000000  # 10M
sync_frequency: 500
training_frequency: 1
running_steps: 50000000  # 50M
start_training: 10000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 500000
test_episode: 1
log_dir: "logs/noisy_dqn/"
model_dir: "models/noisy_dqn/"

agent: "PerDQN"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "PerDQN_Learner"
policy: "Basic_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.1
decay_step_greedy: 200000
sync_frequency: 100
training_frequency: 4
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

PER_alpha: 0.5
PER_beta0: 0.4

eval_interval: 50000
test_episode: 1
log_dir: "logs/perdqn/"
model_dir: "models/perdqn/"

agent: "PerDQN"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "PerDQN_Learner"
policy: "Basic_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.1
decay_step_greedy: 200000
sync_frequency: 100
training_frequency: 4
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

PER_alpha: 0.5
PER_beta0: 0.4

eval_interval: 50000
test_episode: 1
log_dir: "logs/perdqn/"
model_dir: "models/perdqn/"

agent: "PerDQN"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "PerDQN_Learner"
policy: "Basic_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 128
learning_rate: 0.001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.1
decay_step_greedy: 200000
sync_frequency: 100
training_frequency: 4
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

PER_alpha: 0.5
PER_beta0: 0.4

eval_interval: 50000
test_episode: 1
log_dir: "logs/perdqn/"
model_dir: "models/perdqn/"

agent: "PerDQN"
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
learner: "PerDQN_Learner"
policy: "Basic_Q_network"
representation: "Basic_CNN"
runner: "DRL"

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 64, 64]  #  [16, 16, 32, 32]
kernels: [8, 4, 3]  # [8, 6, 4, 4]
strides: [4, 2, 1]  # [2, 2, 2, 2]

q_hidden_size: [512, ]
activation: "relu"  # The activation function of each hidden layer.

seed: 1069
parallels: 5
buffer_size: 500000
batch_size: 32  # 64
learning_rate: 0.0001
gamma: 0.99

start_greedy: 0.5
end_greedy: 0.05
decay_step_greedy: 10000000  # 10M
sync_frequency: 500
training_frequency: 1
running_steps: 50000000  # 50M
start_training: 10000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

PER_alpha: 0.5
PER_beta0: 0.4

eval_interval: 500000
test_episode: 1
log_dir: "logs/perdqn/"
model_dir: "models/perdqn/"

agent: "QRDQN"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "QRDQN_Learner"
policy: "QR_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate: 0.001
gamma: 0.99
quantile_num: 20

start_greedy: 0.25
end_greedy: 0.01
decay_step_greedy: 300000
sync_frequency: 100
training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/qrdqn/"
model_dir: "models/qrdqn/"

agent: "QRDQN"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "QRDQN_Learner"
policy: "QR_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate: 0.001
gamma: 0.99
quantile_num: 20

start_greedy: 0.25
end_greedy: 0.01
decay_step_greedy: 300000
sync_frequency: 100
training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/qrdqn/"
model_dir: "models/qrdqn/"

agent: "QRDQN"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "QRDQN_Learner"
policy: "QR_Q_network"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
q_hidden_size: [128,]
activation: 'relu'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate: 0.001
gamma: 0.99
quantile_num: 20

start_greedy: 0.25
end_greedy: 0.01
decay_step_greedy: 300000
sync_frequency: 100
training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/qrdqn/"
model_dir: "models/qrdqn/"

agent: "QRDQN"
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
learner: "QRDQN_Learner"
policy: "QR_Q_network"
representation: "Basic_CNN"
runner: "DRL"

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 64, 64]  #  [16, 16, 32, 32]
kernels: [8, 4, 3]  # [8, 6, 4, 4]
strides: [4, 2, 1]  # [2, 2, 2, 2]

q_hidden_size: [512, ]
activation: "relu"  # The activation function of each hidden layer.

seed: 1069
parallels: 5
buffer_size: 500000
batch_size: 32  # 64
learning_rate: 0.0001
gamma: 0.99
quantile_num: 20

start_greedy: 0.5
end_greedy: 0.05
decay_step_greedy: 10000000  # 10M
sync_frequency: 500
training_frequency: 1
running_steps: 50000000  # 50M
start_training: 10000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 500000
test_episode: 1
log_dir: "logs/qrdqn/"
model_dir: "models/qrdqn/"

Policy-based Algorithms¶

agent: "PG"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
representation: "Basic_MLP"
vectorize: "DummyVecEnv"
policy: "Categorical_Actor"
learner: "PG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
activation: 'relu'
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 128  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0004

ent_coef: 0.01
gamma: 0.98
use_gae: False
gae_lambda: 0.95
use_advnorm: False

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/pg/"
model_dir: "models/pg/"

agent: "PG"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
representation: "Basic_MLP"
vectorize: "DummyVecEnv"
policy: "Categorical_Actor"
learner: "PG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
activation: 'relu'
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 500
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0004

ent_coef: 0.01
gamma: 0.98
use_gae: False
gae_lambda: 0.95
use_advnorm: False

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/pg/"
model_dir: "models/pg/"

agent: "PG"
env_name: "Classic Control"
env_id: "Pendulum-v1"
env_seed: 1  # The random seed of the environment.
representation: "Basic_MLP"
vectorize: "DummyVecEnv"
policy: "Gaussian_Actor"
learner: "PG_Learner"
runner: "DRL"

representation_hidden_size: [256,]
actor_hidden_size: [256,]
activation: 'leaky_relu'
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 128  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0004

ent_coef: 0.01
gamma: 0.98
use_gae: False
gae_lambda: 0.95
use_advnorm: False

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/pg/"
model_dir: "models/pg/"

agent: "PG"
env_name: "Box2D"
env_id: "BipedalWalker-v3"
env_seed: 1  # The random seed of the environment.
representation: "Basic_MLP"
vectorize: "DummyVecEnv"
policy: "Gaussian_Actor"
learner: "PG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
activation: 'relu'
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 100000
horizon_size: 1024  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 3
n_minibatch: 1
learning_rate: 0.0004

ent_coef: 0.01
gamma: 0.98
use_gae: False
gae_lambda: 0.95
use_advnorm: False

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 10000
test_episode: 1
log_dir: "logs/pg/"
model_dir: "models/pg/"

agent: "PG"
env_name: "MuJoCo"
env_id: "Ant-v4"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "Gaussian_Actor"
representation: "Basic_MLP"
learner: "PG_Learner"
runner: "DRL"

representation_hidden_size: [256, 256]
actor_hidden_size: []
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 16
running_steps: 1000000  # 1M
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4

ent_coef: 0.0
gamma: 0.99
use_gae: False
gae_lambda: 0.95
use_advnorm: False

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 5000
test_episode: 5
log_dir: "logs/pg/"
model_dir: "models/pg/"

agent: "PPG"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Categorical_PPG"
learner: "PPG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
policy_nepoch: 4
value_nepoch: 8 
aux_nepoch: 8
n_minibatch: 1
learning_rate: 0.0004

ent_coef: 0.01
clip_range: 0.2
kl_beta: 1.0
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ppg/"
model_dir: "models/ppg/"

agent: "PPG"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Categorical_PPG"
learner: "PPG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "leaky_relu"

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
policy_nepoch: 4
value_nepoch: 8 
aux_nepoch: 8
n_minibatch: 1
learning_rate: 0.001

ent_coef: 0.01
clip_range: 0.2
kl_beta: 1.0
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ppg/"
model_dir: "models/ppg/"

agent: "PPG"
env_name: "Classic Control"
env_id: "Pendulum-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Gaussian_PPG"
learner: "PPG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
policy_nepoch: 4
value_nepoch: 8 
aux_nepoch: 8
n_minibatch: 1
learning_rate: 0.001

ent_coef: 0.01
clip_range: 0.2
kl_beta: 1.0
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ppg/"
model_dir: "models/ppg/"

agent: "PPG"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Categorical_PPG"
learner: "PPG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "leaky_relu"

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
policy_nepoch: 4
value_nepoch: 8 
aux_nepoch: 8
n_minibatch: 1
learning_rate: 0.001

ent_coef: 0.01
clip_range: 0.2
kl_beta: 1.0
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ppg/"
model_dir: "models/ppg/"

agent: "PPG"
env_name: "Box2D"
env_id: "BipedalWalker-v3"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Gaussian_PPG"
learner: "PPG_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1
policy_nepoch: 4
value_nepoch: 8 
aux_nepoch: 8
n_minibatch: 1
learning_rate: 0.001

ent_coef: 0.01
clip_range: 0.2
kl_beta: 1.0
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/ppg/"
model_dir: "models/ppg/"

agent: "PPG"
env_name: "MuJoCo"
env_id: "InvertedPendulum-v2"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Gaussian_PPG"
learner: "PPG_Learner"
runner: "DRL"

representation_hidden_size: [256,]
actor_hidden_size: [256,]
critic_hidden_size: [256,]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 16
running_steps: 1000000  # 1M
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_minibatch: 4
n_epochs: 1
policy_nepoch: 2
value_nepoch: 4
aux_nepoch: 8

learning_rate: 0.0007

ent_coef: 0.0
clip_range: 0.25
kl_beta: 2.0
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 10000
test_episode: 5
log_dir: "logs/ppg/"
model_dir: "models/ppg/"

agent: "PPO"  # Choice: PPO, PPO_KL
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Categorical_AC"
learner: "PPO_Learner"  # Choice: PPO_Learner, PPOKL_Learner
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: 'leaky_relu'

seed: 1
parallels: 10
running_steps: 120000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 8
n_minibatch: 8
learning_rate: 0.0004

vf_coef: 0.25
ent_coef: 0.01
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 1000
test_episode: 5
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # Choice: PPO, PPO_KL
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Categorical_AC"
learner: "PPO_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: 'leaky_relu'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 8
n_minibatch: 8
learning_rate: 0.0004

vf_coef: 0.25
ent_coef: 0.01
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 3
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # Choice: PPO, PPO_KL
env_name: "Classic Control"
env_id: "Pendulum-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Gaussian_AC"
learner: "PPO_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: 'leaky_relu'
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 8
n_minibatch: 8
learning_rate: 0.0004

vf_coef: 0.25
ent_coef: 0.01
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 3
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # Choice: PPO, PPO_KL
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Categorical_AC"
learner: "PPO_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: 'leaky_relu'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 8
n_minibatch: 8
learning_rate: 0.0004

vf_coef: 0.25
ent_coef: 0.01
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 3
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # Choice: PPO, PPO_KL
env_name: "Box2D"
env_id: "BipedalWalker-v3"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_MLP"
policy: "Gaussian_AC"
learner: "PPO_Learner"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: 'leaky_relu'
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 300000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 8
n_minibatch: 8
learning_rate: 0.0004

vf_coef: 0.25
ent_coef: 0.01
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.98
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 3
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # Choice: PPO, PPO_KL
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
representation: "AC_CNN_Atari"  # CNN and FC layers
policy: "Categorical_AC"
learner: "PPO_Learner"
runner: "DRL"

# Good HyperParameters for Atari Games, Do not change them.
filters: [32, 64, 64]
kernels: [8, 4, 3]
strides: [4, 2, 1]
fc_hidden_sizes: [512, ]  # fully connected layer hidden sizes.
actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 8
running_steps: 10000000  # 10M
horizon_size: 128  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 4
n_minibatch: 4
learning_rate: 0.00025

vf_coef: 0.25
ent_coef: 0.01
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.99
use_gae: True
gae_lambda: 0.95  # gae_lambda: Lambda parameter for calculating N-step advantage
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 100000
test_episode: 3
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # choice: PPO, PPO_KL
env_name: "MuJoCo"
env_id: "Ant-v4"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "PPO_Learner"
policy: "Gaussian_AC"  # choice: Gaussian_AC for continuous actions, Categorical_AC for discrete actions.
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [256,]
actor_hidden_size: [256,]
critic_hidden_size: [256,]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 79811
parallels: 16
running_steps: 1000000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 16
n_minibatch: 8
learning_rate: 0.0004

vf_coef: 0.25
ent_coef: 0.0
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.99
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 5000
test_episode: 5
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # choice: PPO, PPO_KL
env_name: "MetaDrive"
env_id: "metadrive"
env_seed: 1  # The random seed of the environment.
env_config:  # the configs for MetaDrive environment
  map: "C"  # see https://metadrive-simulator.readthedocs.io/en/latest/rl_environments.html#generalization-environment for choices
render: False
vectorize: "DummyVecEnv"
learner: "PPO_Learner"
policy: "Gaussian_AC"  # choice: Gaussian_AC for continuous actions, Categorical_AC for discrete actions.
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [512,]
actor_hidden_size: [512, 512]
critic_hidden_size: [512, 512]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10
running_steps: 500000
horizon_size: 128  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 4
n_minibatch: 4
learning_rate: 0.00025

vf_coef: 0.25
ent_coef: 0.0
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.99
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 5000
test_episode: 5
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # choice: PPO, PPO_KL
env_name: "MiniGrid"
env_id: "MiniGrid-Empty-5x5-v0"
env_seed: 1  # The random seed of the environment.
RGBImgPartialObsWrapper: False
ImgObsWrapper: False
vectorize: "DummyVecEnv"
learner: "PPO_Learner"
policy: "Categorical_AC"  # choice: Gaussian_AC for continuous actions, Categorical_AC for discrete actions.
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [256,]
actor_hidden_size: [256,]
critic_hidden_size: [256,]
activation: "leaky_relu"

seed: 79811
parallels: 16
running_steps: 100000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 16
n_minibatch: 8
learning_rate: 0.0001

vf_coef: 0.25
ent_coef: 0.0
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.99
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 1000
test_episode: 5
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "PPO"  # choice: PPO, PPO_KL
env_name: "Drones"
env_id: "HoverAviary"  # choices: ['CtrlAviary', 'HoverAviary', 'VelocityAviary']
env_seed: 1  # The random seed of the environment.
obs_type: 'kin'
act_type: 'one_d_rpm'
num_drones: 1
record: False
render: False
sleep: 0.01
obstacles: True
max_episode_steps: 2000
vectorize: "DummyVecEnv"
learner: "PPO_Learner"
policy: "Gaussian_AC"  # choice: Gaussian_AC for continuous actions, Categorical_AC for discrete actions.
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [512,]
actor_hidden_size: [512,]
critic_hidden_size: [512,]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 79811
parallels: 10
running_steps: 1000000
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 16
n_minibatch: 8
learning_rate: 0.0004

vf_coef: 0.25
ent_coef: 0.0
target_kl: 0.25  # for PPO_KL agent
kl_coef: 1.0  # for PPO_KL agent
clip_range: 0.2  # for PPO agent
gamma: 0.99
use_gae: True
gae_lambda: 0.95
use_advnorm: True

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 5000
test_episode: 5
log_dir: "logs/ppo/"
model_dir: "models/ppo/"

agent: "A2C"  # The learning algorithms.
env_name: "Classic Control"  # Environment name.
env_id: "CartPole-v1"  # Environment id
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "A2C_Learner"  # Name of learner.
policy: "Categorical_AC"  # Name of policy.
representation: "Basic_MLP"  # Name of representation.
runner: "DRL"  # Runner.

representation_hidden_size: [128,]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [128,]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128,]  # A list of hidden units for each layer of critic network.
activation: 'leaky_relu'  # The activation function of each hidden layer.

seed: 1  # Random seeds.
parallels: 10  # Number of environments that to be implemented in parallel.
running_steps: 300000  # The total running steps.
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 8  # Number of minibatch.
learning_rate: 0.0004  # The learning rate.

vf_coef: 0.25  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.98  # Discount factor.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_advnorm: True  # Whether to use advantage normalization.

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5  # The max norm of the gradient.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # The interval between every two trainings.
test_episode: 3  # The episodes to test in each test period.

log_dir: "logs/a2c/"  # The directory to store logger file.
model_dir: "models/a2c/"  # The directory to store model file.

agent: "A2C"  # The learning algorithms.
env_name: "Classic Control"  # Environment name.
env_id: "Acrobot-v1"  # Environment id
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "A2C_Learner"  # Name of learner.
policy: "Categorical_AC"  # Name of policy.
representation: "Basic_MLP"  # Name of representation.
runner: "DRL"  # Runner.

representation_hidden_size: [128,]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [128,]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128,]  # A list of hidden units for each layer of critic network.
activation: 'leaky_relu'  # The activation function of each hidden layer.

seed: 1  # Random seeds.
parallels: 10  # Number of environments that to be implemented in parallel.
running_steps: 300000  # The total running steps.
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 8  # Number of epochs to update the model.
n_minibatch: 8  # Number of minibatch.
learning_rate: 0.0004  # The learning rate.

vf_coef: 0.25  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.98  # Discount factor.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_advnorm: True  # Whether to use advantage normalization.

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5  # The max norm of the gradient.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # The interval between every two trainings.
test_episode: 3  # The episodes to test in each test period.

log_dir: "logs/a2c/"  # The directory to store logger file.
model_dir: "models/a2c/"  # The directory to store model file.

agent: "A2C"  # The learning algorithms.
env_name: "Classic Control"  # Environment name.
env_id: "Pendulum-v1"  # Environment id
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "A2C_Learner"  # Name of learner.
policy: "Gaussian_AC"  # Name of policy.
representation: "Basic_MLP"  # Name of representation.
runner: "DRL"  # Runner.

representation_hidden_size: [128,]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [128,]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128,]  # A list of hidden units for each layer of critic network.
activation: 'leaky_relu'  # The activation function of each hidden layer.
activation_action: 'tanh'  # The activation function for the output of actor.

seed: 1  # Random seeds.
parallels: 10  # Number of environments that to be implemented in parallel.
running_steps: 1000000  # The total running steps.
horizon_size: 64  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate: 0.0004  # The learning rate.

vf_coef: 0.25  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.98  # Discount factor.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_advnorm: True  # Whether to use advantage normalization.

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5  # The max norm of the gradient.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # The interval between every two trainings.
test_episode: 3  # The episodes to test in each test period.

log_dir: "logs/a2c/"  # The directory to store logger file.
model_dir: "models/a2c/"  # The directory to store model file.

agent: "A2C"  # The learning algorithms.
env_name: "Classic Control"  # Environment name.
env_id: "MountainCar-v0"  # Environment id
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "A2C_Learner"  # Name of learner.
policy: "Categorical_AC"  # Name of policy.
representation: "Basic_MLP"  # Name of representation.
runner: "DRL"  # Runner.

representation_hidden_size: [256,]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [256,]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [256,]  # A list of hidden units for each layer of critic network.
activation: 'leaky_relu'  # The activation function of each hidden layer.

seed: 1  # Random seeds.
parallels: 10  # Number of environments that to be implemented in parallel.
running_steps: 300000  # The total running steps.
horizon_size: 128  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate: 0.0004  # The learning rate.

vf_coef: 0.25  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.98  # Discount factor.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_advnorm: True  # Whether to use advantage normalization.

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5  # The max norm of the gradient.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # The interval between every two trainings.
test_episode: 3  # The episodes to test in each test period.

log_dir: "logs/a2c/"  # The directory to store logger file.
model_dir: "models/a2c/"  # The directory to store model file.

agent: "A2C"  # The learning algorithms.
env_name: "Box2D"  # Environment name.
env_id: "BipedalWalker-v3"  # Environment id
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "A2C_Learner"  # Name of learner.
policy: "Gaussian_AC"  # Name of policy.
representation: "Basic_MLP"  # Name of representation.
runner: "DRL"  # Runner.

representation_hidden_size: [64,]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64,]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [64,]  # A list of hidden units for each layer of critic network.
activation: 'leaky_relu'  # The activation function of each hidden layer.
activation_action: 'tanh'  # The activation function for the output of actor.

seed: 1  # Random seeds.
parallels: 10  # Number of environments that to be implemented in parallel.
running_steps: 1000000  # The total running steps.
horizon_size: 128  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 5  # Number of epochs to update the model.
n_minibatch: 8  # Number of minibatch.
learning_rate: 0.0004  # The learning rate.

vf_coef: 0.25  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.98  # Discount factor.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_advnorm: True  # Whether to use advantage normalization.

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5  # The max norm of the gradient.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 50000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/a2c/"  # The directory to store logger file.
model_dir: "models/a2c/"  # The directory to store model file.

agent: "A2C"  # The learning algorithms.
env_name: "MuJoCo"  # Environment name.
env_id: "Ant-v4"  # Environment id
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"  # Method to vectorize the environment.
learner: "A2C_Learner"  # Name of learner.
policy: "Gaussian_AC"  # Name of policy.
representation: "Basic_MLP"  # Name of representation.
runner: "DRL"  # Runner.

representation_hidden_size: [256,]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [256,]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [256,]  # A list of hidden units for each layer of critic network.
activation: 'leaky_relu'  # The activation function of each hidden layer.
activation_action: 'tanh'  # The activation function for the output of actor.

seed: 6782  # Random seeds.
parallels: 16  # Number of environments that to be implemented in parallel.
running_steps: 1000000  # The total running steps.
horizon_size: 16  # the horizon size for an environment, buffer_size = horizon_size * parallels.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate: 0.0007  # The learning rate.

vf_coef: 0.25  # Coefficient factor for critic loss.
ent_coef: 0.0  # Coefficient factor for entropy loss.
gamma: 0.99  # Discount factor.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_advnorm: True  # Whether to use advantage normalization.

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5  # The max norm of the gradient.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.
use_obsnorm: True  # Whether to use observation normalization trick.
use_rewnorm: True  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 5000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/a2c/"  # The directory to store logger file.
model_dir: "models/a2c/"  # The directory to store model file.

agent: "A2C"  # The learning algorithms.
vectorize: "Dummy_Atari"  # Method to vectorize the environment.
env_name: "Atari"  # Environment name.
env_id: "ALE/Breakout-v5"  # Environment id
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
learner: "A2C_Learner"  # Name of learner.
policy: "Categorical_AC"  # Name of policy.
representation: "Basic_CNN"  # Name of representation.
runner: "DRL"  # Runner.

# the following three arguments are for "Basic_CNN" representation.
filters: [32, 32, 64, 64]
kernels: [8, 4, 4, 4]
strides: [4, 2, 2, 2]
actor_hidden_size: [128, 128]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: 'leaky_relu'  # The activation function of each hidden layer.

seed: 1  # Random seeds.
parallels: 5  # Number of environments that to be implemented in parallel.
running_steps: 10000000  # The total running steps.
horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.  #
n_epochs: 4  # Number of epochs to update the model.
n_minibatch: 8  # Number of minibatch.
learning_rate: 0.0007  # The learning rate.

vf_coef: 0.25  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.99  # Discount factor.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_advnorm: True  # Whether to use advantage normalization.

use_grad_clip: True  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5  # The max norm of the gradient.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5  # The observation normalization range.
rewnorm_range: 5  # The reward normalization range.

eval_interval: 100000  # The interval between every two trainings.
test_episode: 3  # The episodes to test in each test period.

log_dir: "logs/a2c/"  # The directory to store logger file.
model_dir: "models/a2c/"  # The directory to store model file.

agent: "SAC"
env_name: "Classic Control"
env_id: "CartPole-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "SACDIS_Learner"
policy: "Categorical_SAC"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
alpha: 0.2
use_automatic_entropy_tuning: False
tau: 0.005

training_frequency: 2
running_steps: 500000
start_training: 2000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
env_name: "Classic Control"
env_id: "Acrobot-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "SACDIS_Learner"
policy: "Categorical_SAC"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
alpha: 0.2
use_automatic_entropy_tuning: True
tau: 0.005

training_frequency: 2
running_steps: 500000
start_training: 2000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
env_name: "Classic Control"
env_id: "Pendulum-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "SAC_Learner"
policy: "Gaussian_SAC"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [256,]
actor_hidden_size: [256,]
critic_hidden_size: [256,]
activation: "relu"  # The activation function of each hidden layer.
activation_action: 'tanh'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.98
alpha: 0.2
use_automatic_entropy_tuning: True
tau: 0.005

training_frequency: 1
running_steps: 300000
start_training: 1000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
env_name: "Classic Control"
env_id: "MountainCar-v0"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "SACDIS_Learner"
policy: "Categorical_SAC"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
actor_hidden_size: [128,]
critic_hidden_size: [128,]
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.01
gamma: 0.98
alpha: 0.2
use_automatic_entropy_tuning: True
tau: 0.005

training_frequency: 2
running_steps: 500000
start_training: 2000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
env_name: "Box2D"
env_id: "BipedalWalker-v3"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "SAC_Learner"
policy: "Gaussian_SAC"
representation: "Basic_Identical"
runner: "DRL"

representation_hidden_size:
actor_hidden_size: [256, 256]
critic_hidden_size: [256, 256]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10  # number of environments
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
alpha: 0.2
use_automatic_entropy_tuning: True
tau: 0.005

training_frequency: 1
running_steps: 5000000
start_training: 2000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 500000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
vectorize: "Dummy_Atari"
env_name: "Atari"
env_id: "ALE/Breakout-v5"
env_seed: 1  # The random seed of the environment.
obs_type: "grayscale"  # choice for Atari env: ram, rgb, grayscale
img_size: [84, 84]  # default is 210 x 160 in gym[Atari]
num_stack: 4  # frame stack trick
frame_skip: 4  # frame skip trick
noop_max: 30  # Do no-op action for a number of steps in [1, noop_max].
representation: "Basic_CNN"
policy: "Categorical_SAC"
learner: "SACDIS_Learner"
runner: "DRL"

filters: [32, 32, 64, 64]
kernels: [8, 4, 4, 4]
strides: [4, 2, 2, 2]
actor_hidden_size: [128, 128]
critic_hidden_size: [128, 128]
activation: "leaky_relu"

seed: 1069
parallels: 5
buffer_size: 500000
batch_size: 32  # 64
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
alpha: 0.01
use_automatic_entropy_tuning: False
tau: 0.005

training_frequency: 1
running_steps: 50000000  # 50M
start_training: 10000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 500000
test_episode: 1
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
env_name: "MuJoCo"
env_id: "Ant-v4"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
learner: "SAC_Learner"
policy: "Gaussian_SAC"
representation: "Basic_Identical"
runner: "DRL"

representation_hidden_size:
actor_hidden_size: [256, 256]
critic_hidden_size: [256, 256]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 4  # number of environments
buffer_size: 1000000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
alpha: 0.2
use_automatic_entropy_tuning: True
tau: 0.005

training_frequency: 1
running_steps: 1000000
start_training: 10000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 10000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
env_name: "MetaDrive"
env_id: "metadrive"
env_seed: 1  # The random seed of the environment.
env_config:  # the configs for MetaDrive environment
  map: "C"  # see https://metadrive-simulator.readthedocs.io/en/latest/rl_environments.html#generalization-environment for choices
render: False
vectorize: "DummyVecEnv"
learner: "SAC_Learner"
policy: "Gaussian_SAC"
representation: "Basic_Identical"
runner: "DRL"

representation_hidden_size:
actor_hidden_size: [512, 512]
critic_hidden_size: [512, 512]
activation: "relu"  # The activation function of each hidden layer.
activation_action: 'tanh'

seed: 1
parallels: 4
buffer_size: 1000000
batch_size: 256
learning_rate_actor: 0.0003
learning_rate_critic: 0.0003
gamma: 0.99
alpha: 0.2
use_automatic_entropy_tuning: True
tau: 0.005

training_frequency: 1
running_steps: 1000000
start_training: 10000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 10000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "SAC"
env_name: "Drones"
env_id: "HoverAviary"
env_seed: 1  # The random seed of the environment.
obs_type: 'kin'
act_type: 'one_d_rpm'
num_drones: 1
record: False
obstacles: True
max_episode_steps: 2000  #
render: False
sleep: 0.01
vectorize: "DummyVecEnv"
learner: "SAC_Learner"
policy: "Gaussian_SAC"
representation: "Basic_Identical"
runner: "DRL"

representation_hidden_size:
actor_hidden_size: [512, 512]
critic_hidden_size: [512, 512]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10
buffer_size: 1000000
batch_size: 256
learning_rate_actor: 0.0003
learning_rate_critic: 0.0003
gamma: 0.99
alpha: 0.2
use_automatic_entropy_tuning: True
tau: 0.005

training_frequency: 1
running_steps: 1000000
start_training: 10000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 10000
test_episode: 5
log_dir: "logs/sac/"
model_dir: "models/sac/"

agent: "DDPG"
env_name: "Classic Control"
env_id: "Pendulum-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "DDPG_Policy"
representation: "Basic_MLP"
learner: "DDPG_Learner"
runner: "DRL"

representation_hidden_size: [256,]
actor_hidden_size: [256,]
critic_hidden_size: [256,]
activation: "relu"  # The activation function of each hidden layer.
activation_action: 'tanh'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.98
tau: 0.005

start_noise: 0.1
end_noise: 0.1
training_frequency: 1
running_steps: 500000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 3
log_dir: "logs/ddpg/"
model_dir: "models/ddpg/"

agent: "DDPG"
env_name: "Box2D"
env_id: "BipedalWalker-v3"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_Identical"
policy: "DDPG_Policy"
learner: "DDPG_Learner"
runner: "DRL"

representation_hidden_size:
actor_hidden_size: [256, 256]
critic_hidden_size: [256, 256]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10  # number of environments
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
tau: 0.005

start_noise: 0.5
end_noise: 0.1
training_frequency: 1
running_steps: 2000000
start_training: 1000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 5
log_dir: "logs/ddpg/"
model_dir: "models/ddpg/"

agent: "DDPG"
env_name: "MuJoCo"
env_id: "Ant-v4"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
policy: "DDPG_Policy"
representation: "Basic_Identical"
learner: "DDPG_Learner"
runner: "DRL"

representation_hidden_size:  # If you choose Basic_Identical representation, then ignore this value
actor_hidden_size: [400, 300]
critic_hidden_size: [400, 300]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 19089
parallels: 4  # number of environments
buffer_size: 200000  # replay buffer size
batch_size: 100
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
tau: 0.005

start_noise: 0.5
end_noise: 0.1
training_frequency: 1
running_steps: 1000000  # 1M
start_training: 10000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 5000
test_episode: 5

log_dir: "logs/ddpg/"
model_dir: "models/ddpg/"

agent: "DDPG"
env_name: "Drones"
env_id: "HoverAviary"
env_seed: 1  # The random seed of the environment.
obs_type: 'kin'
act_type: 'one_d_rpm'
num_drones: 1
record: False
obstacles: True
max_episode_steps: 2000  #
render: False
sleep: 0.01
vectorize: "DummyVecEnv"
policy: "DDPG_Policy"
representation: "Basic_Identical"
learner: "DDPG_Learner"
runner: "DRL"

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10
buffer_size: 1000000  # buffer
batch_size: 1024
learning_rate_actor: 0.001
learning_rate_critic: 0.001
gamma: 0.99
tau: 0.005

start_noise: 0.1
end_noise: 0.1
training_frequency: 1
running_steps: 10000000 # total step
start_training: 2000

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 100000
test_episode: 3
log_dir: "logs/ddpg/"
model_dir: "models/ddpg/"

agent: "TD3"
env_name: "Classic Control"
env_id: "Pendulum-v1"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_Identical"
policy: "TD3_Policy"
learner: "TD3_Learner"
runner: "DRL"

representation_hidden_size: [64]
actor_hidden_size: [256, ]
critic_hidden_size: [256, ]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.0005
learning_rate_critic: 0.001
gamma: 0.98
tau: 0.005
actor_update_delay: 3

start_noise: 0.25
end_noise: 0.05
training_frequency: 2
running_steps: 500000
start_training: 2000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 1
log_dir: "logs/td3/"
model_dir: "models/td3/"

agent: "TD3"
env_name: "Box2D"
env_id: "BipedalWalker-v3"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_Identical"
policy: "TD3_Policy"
learner: "TD3_Learner"
runner: "DRL"

representation_hidden_size:
actor_hidden_size: [256, 256]
critic_hidden_size: [256, 256]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 1
parallels: 10
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.0005
learning_rate_critic: 0.001
gamma: 0.99
tau: 0.005
actor_update_delay: 3

start_noise: 0.25
end_noise: 0.05
training_frequency: 2
running_steps: 2000000
start_training: 2000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 50000
test_episode: 5
log_dir: "logs/td3/"
model_dir: "models/td3/"

agent: "TD3"
env_name: "MuJoCo"
env_id: "Ant-v4"
env_seed: 1  # The random seed of the environment.
vectorize: "DummyVecEnv"
representation: "Basic_Identical"
policy: "TD3_Policy"
learner: "TD3_Learner"
runner: "DRL"

representation_hidden_size:  # If you choose Basic_Identical representation, then ignore this value
actor_hidden_size: [400, 300]
critic_hidden_size: [400, 300]
activation: "leaky_relu"
activation_action: 'tanh'

seed: 6782
parallels: 4  # number of environments
buffer_size: 200000
batch_size: 256
learning_rate_actor: 0.001
actor_update_delay: 2
learning_rate_critic: 0.001
gamma: 0.99
tau: 0.005

start_noise: 0.1
end_noise: 0.1
training_frequency: 1
running_steps: 1000000
start_training: 25000

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

eval_interval: 5000
test_episode: 5
log_dir: "logs/td3/"
model_dir: "models/td3/"

agent: "PDQN"
env_name: "Platform"
env_id: "Platform-v0"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 200
vectorize: "NOREQUIRED"
render: False
learner: "PDQN_Learner"
policy: "PDQN_Policy"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
conactor_hidden_size: [128,]
qnetwork_hidden_size: [128, ]
activation: "relu"  # The activation function of each hidden layer.
activation_action: 'tanh'

buffer_size: 20000
batch_size: 128
learning_rate: 0.001
gamma: 0.99
tau: 0.005

start_greedy: 0.5
end_greedy: 0.05
decay_step_greedy: 1000000  # 1M
start_noise: 0.1
end_noise: 0.1
training_frequency: 1
running_steps: 30000
start_training: 1000

eval_interval: 1000
test_episode: 5

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

log_dir: "logs/pdqn/"
model_dir: "models/pdqn/"

agent: "MPDQN"
env_name: "Platform"
env_id: "Platform-v0"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 200
vectorize: "NOREQUIRED"
render: False
learner: "MPDQN_Learner"
policy: "MPDQN_Policy"
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
conactor_hidden_size: [128,]
qnetwork_hidden_size: [128, ]
activation: "relu"  # The activation function of each hidden layer.
activation_action: 'tanh'

buffer_size: 20000
batch_size: 128
learning_rate: 0.001
gamma: 0.99
tau: 0.005

start_greedy: 0.5
end_greedy: 0.05
decay_step_greedy: 1000000  # 1M
start_noise: 0.1
end_noise: 0.1
training_frequency: 1
running_steps: 30000
start_training: 1000

eval_interval: 1000
test_episode: 5

use_grad_clip: False  # gradient normalization
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

log_dir: "logs/mpdqn/"
model_dir: "models/mpdqn/"

agent: "SPDQN"
env_name: "Platform"
env_id: "Platform-v0"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 200
vectorize: "NOREQUIRED"
learner: "SPDQN_Learner"
policy: "SPDQN_Policy"
render: False
representation: "Basic_MLP"
runner: "DRL"

representation_hidden_size: [128,]
conactor_hidden_size: [128,]
qnetwork_hidden_size: [128, ]
activation: "relu"  # The activation function of each hidden layer.
activation_action: 'tanh'

buffer_size: 20000
batch_size: 128
learning_rate: 0.001
gamma: 0.99
tau: 0.005

start_noise: 0.1
end_noise: 0.1
training_frequency: 1
running_steps: 30000
start_training: 1000

eval_interval: 1000
test_episode: 5

use_grad_clip: False  # gradient normalization
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
grad_clip_norm: 0.5
use_actions_mask: False
use_obsnorm: False  # Whether to use observation normalization trick.
use_rewnorm: False  # Whether to use reward normalization trick.
obsnorm_range: 5
rewnorm_range: 5

log_dir: "logs/spdqn/"
model_dir: "models/spdqn/"

MARL algorithms¶

agent: "IQL"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_MLP"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 25
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 100000
test_episode: 5
log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "RoboticWarehouse"
env_id: "rware-tiny-2ag-v1"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 100
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_MLP"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn:
representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 5000000  # 5M
training_frequency: 1
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: True

eval_interval: 100000
test_episode: 5
log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "MAgent2"
env_id: "adversarial_pursuit_v4"
env_seed: 1  # The random seed of the environment.
minimap_mode: False
max_cycles: 500
extra_features: False
map_size: 45
render_mode: "rgb_array"
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_MLP"
vectorize: "Dummy_MAgent"
runner: "RunnerMAgent"

# recurrent settings for Basic_RNN representation
use_rnn: False  # Whether to use recurrent neural networks.
rnn:
representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 10
buffer_size: 20000
batch_size: 256
learning_rate: 0.001
gamma: 0.95  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 1
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: True

eval_interval: 100000
test_episode: 5
log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
global_state: False
# environment settings
env_name: "Football"
scenario: "academy_3_vs_1_with_keeper"
env_seed: 1  # The random seed of the environment.
use_stacked_frames: False  # Whether to use stacked_frames
num_agent: 3
num_adversary: 0
obs_type: "simple115v2"  # representation used to build the observation, choices: ["simple115v2", "extracted", "pixels_gray", "pixels"]
rewards_type: "scoring,checkpoints"  # comma separated list of rewards to be added
smm_width: 96  # width of super minimap
smm_height: 72  # height of super minimap
fps: 15  # Frames per second.
max_episode_steps: 1000
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_Football"
runner: "RunnerFootball"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [128, ]
recurrent_hidden_size: 128
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [128, ]
q_hidden_size: [128, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 50
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 25000000  # 25M
training_frequency: 60
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 250000
test_episode: 50
log_dir: "logs/iql/"
model_dir: "models/iql/"
videos_dir: "./videos/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 5000000  # 5M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 50000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "IQL"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IQL_Learner"
policy: "Basic_Q_network_marl"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/iql/"
model_dir: "models/iql/"

agent: "VDN"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_MLP"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 25
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 100000
test_episode: 5
log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "RoboticWarehouse"
env_id: "rware-tiny-2ag-v1"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 100
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_MLP"
vectorize: "Dummy_RoboticWarehouse"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn:
representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 1
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: True

eval_interval: 100000
test_episode: 5
log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
global_state: False
# environment settings
env_name: "Football"
scenario: "academy_3_vs_1_with_keeper"
env_seed: 1  # The random seed of the environment.
use_stacked_frames: False  # Whether to use stacked_frames
num_agent: 3
num_adversary: 0
obs_type: "simple115v2"  # representation used to build the observation, choices: ["simple115v2", "extracted", "pixels_gray", "pixels"]
rewards_type: "scoring,checkpoints"  # comma separated list of rewards to be added
smm_width: 96  # width of super minimap
smm_height: 72  # height of super minimap
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Dummy_Football"
runner: "RunnerFootball"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [128, ]
recurrent_hidden_size: 128
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [128, ]
q_hidden_size: [128, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 50
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 25000000  # 25M
training_frequency: 1
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: True

eval_interval: 250000
test_episode: 50
log_dir: "logs/vdn/"
model_dir: "models/vdn/"
videos_dir: "./videos/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 5000000  # 5M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 50000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "VDN"  # the learning algorithms_marl
env_name: "NewEnv_MAS"
env_id: "scenarios_1"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 200
render: False
sleep: 0.01
continuous_action: False  # Continuous action space or not.
learner: "VDN_Learner"
policy: "Mixing_Q_network"
representation: "Basic_MLP"
vectorize: "Dummy_NewEnv_MAS"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn:
representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 1
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 100000
test_episode: 5
log_dir: "logs/vdn/"
model_dir: "models/vdn/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_MLP"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: False
use_actions_mask: False

hidden_dim_mixing_net: 128  # hidden units of mixing network
hidden_dim_hyper_net: 128  # hidden units of hyper network

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 25
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5
log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "RoboticWarehouse"
env_id: "rware-tiny-2ag-v1"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 100
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_MLP"
vectorize: "Dummy_RoboticWarehouse"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn:
representation_hidden_size: [64, ]
q_hidden_size: [128, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 1
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 100000
test_episode: 5
log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "Football"
scenario: "academy_3_vs_1_with_keeper"
env_seed: 1  # The random seed of the environment.
use_stacked_frames: False  # Whether to use stacked_frames
num_agent: 3
num_adversary: 0
obs_type: "simple115v2"  # representation used to build the observation, choices: ["simple115v2", "extracted", "pixels_gray", "pixels"]
rewards_type: "scoring,checkpoints"  # comma separated list of rewards to be added
smm_width: 96  # width of super minimap
smm_height: 72  # height of super minimap
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_Football"
runner: "RunnerFootball"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [128, ]
recurrent_hidden_size: 128
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [128, ]
q_hidden_size: [128, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

hidden_dim_mixing_net: 128  # hidden units of mixing network
hidden_dim_hyper_net: 128  # hidden units of hyper network

seed: 1
parallels: 50
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 25000000  # 25M
training_frequency: 1
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 250000
test_episode: 50
log_dir: "logs/qmix/"
model_dir: "models/qmix/"
videos_dir: "./videos/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 500000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 5000000  # 5M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 50000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 5000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 5000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "QMIX"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QMIX_Learner"
policy: "Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 32  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qmix/"
model_dir: "models/qmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_MLP"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [128, ]  # for Basic_MLP representation
q_hidden_size: [128, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 5000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 25
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False


eval_interval: 100000
test_episode: 5
log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 5000000  # 5M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 50000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "OWQMIX"  # choice: CWQMIX, OWQMIX
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "WQMIX_Learner"
policy: "Weighted_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
alpha: 0.1
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
hidden_dim_ff_mix_net: 256  # hidden units of mixing network

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/wqmix/"
model_dir: "models/wqmix/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"  # The environment ID.
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_MLP"  # The representation.
vectorize: "DummyVecMultiAgentEnv"  # Method to vectorize the environments.
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 32  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [32, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
q_hidden_size: [128, ]  # the units for each hidden layer
hidden_utility_dim: 256  # hidden units of the utility function
hidden_payoff_dim: 256  # hidden units of the payoff function
bias_net: "Basic_MLP"  # The choose of bias network.
hidden_bias_dim: [256, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 1  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate: 0.001
gamma: 0.95  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 25
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 100000
test_episode: 5
log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 5000000  # 5M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 50000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "DCG"  # Options: DCG, DCG_S
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "DCG_Learner"  # Name of learner
policy: "DCG_Policy"  # Name of policy
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
hidden_utility_dim: 64  # hidden units of the utility function
hidden_payoff_dim: 64  # hidden units of the payoff function
bias_net: "Basic_MLP"
hidden_bias_dim: [64, ]  # hidden units of the bias network with global states as input
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

low_rank_payoff: False  # low-rank approximation of payoff function
payoff_rank: 5  # the rank K in the paper
graph_type: "FULL"  # specific type of the coordination graph
n_msg_iterations: 8  # number of iterations for message passing during belief propagation
msg_normalized: True  # Message normalization during greedy action selection (Kok and Vlassis, 2006)

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/dcg/"
model_dir: "models/dcg/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_MLP"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: False
use_actions_mask: False

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 16
buffer_size: 1000000
batch_size: 32
learning_rate: 0.0005
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.1
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 25
sync_frequency: 10000

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5
log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 500000
start_training: 1000  # start training after n steps
running_steps: 1000000  # 1M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 50000
start_training: 1000  # start training after n steps
running_steps: 2000000  # 2M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 20000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 5000000  # 5M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 50000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 5000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 5000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "QTRAN_base"  # Options: QTRAN_base, QTRAN_alt
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "QTRAN_Learner"
policy: "Qtran_Mixing_Q_network"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
q_hidden_size: [64, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True
use_actions_mask: True

hidden_dim_mixing_net: 64  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network
qtran_net_hidden_dim: 64
lambda_opt: 1.0
lambda_nopt: 1.0

seed: 1
parallels: 8
buffer_size: 5000
batch_size: 32
learning_rate: 0.0007
gamma: 0.99  # discount factor
double_q: True  # use double q learning

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 1000000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
n_epochs: 8  # The number of training epochs after interaction.
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 16

log_dir: "logs/qtran/"
model_dir: "models/qtran/"

agent: "IAC"
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_MLP"
vectorize: "SubprocVecMultiAgentEnv"
runner: "MARL"  # Runner

# recurrent settings for Basic_RNN representation
use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [64, ]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seed.
parallels: 16  # The number of environments to run in parallel.
buffer_size: 32  # Number of the transitions (use_rnn is False), or the episodes (use_rnn is True) in replay buffer.
n_epochs: 1  # Number of epochs to train.
n_minibatch: 1 # Number of minibatch to sample and train.  batch_size = buffer_size // n_minibatch.
learning_rate: 0.0005  # Learning rate.
weight_decay: 0  # The steps to decay the greedy epsilon.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.99  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_global_state: True  # If to use global state to replace merged observations.
use_value_clip: False  # Limit the value range.
value_clip_range: 0.2  # The value clip range.
use_value_norm: False  # Use running mean and std to normalize rewards.
use_huber_loss: False  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0  # The threshold at which to change between delta-scaled L1 and L2 loss. (For huber loss).
use_advnorm: False  # If to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64,]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.0
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: False  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: False  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 1000000  # 1M
eval_interval: 5000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 5000000  # 5M
eval_interval: 25000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 1.0

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 2
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "IAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IAC_Learner"
policy: "Categorical_MAAC_Policy_Share"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/iac/"
model_dir: "models/iac/"

agent: "COMA"  # The learning algorithms.
env_name: "mpe"  # Environment name.
env_id: "simple_spread_v3"  # Environment map.
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_MLP"  # Name of representation.
representation_critic: "Basic_MLP"  # Name of representation for critic.
vectorize: "SubprocVecMultiAgentEnv"  # Method to vectorize the environment.
runner: "MARL"  # Runner.

# recurrent settings for Basic_RNN representation
use_rnn: False  # If to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
N_recurrent_layers: 1  # Number of recurrent layers.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [128, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, ]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # Whether to use parameter sharing for all agents' policies.
use_actions_mask: False  # Whether to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 16  # Number of environments that to be implemented in parallel.
buffer_size: 3200  # Total buffer size.
n_epochs: 10  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 2500000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.99  # Discount factor.

# tricks
use_linear_lr_decay: False  # Whether to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: True  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 2500000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 1000000  # The total running steps.
eval_interval: 10000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 100000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 1000000  # The total running steps.
eval_interval: 10000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 2500000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 1000000  # The total running steps.
eval_interval: 10000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 2500000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 2000000  # The total running steps.
eval_interval: 20000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 2500000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 2000000  # The total running steps.
eval_interval: 20000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 1000000
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 5000000  # The total running steps.
eval_interval: 50000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 2000000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 2500000  # The steps to decay the greedy epsilon.
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 1000000
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "COMA"  # the learning algorithms_marl
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"  # Environment map.
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "COMA_Learner"  # Name of learner.
policy: "Categorical_COMA_Policy"  # Name of policy.
representation: "Basic_RNN"  # Name of representation.
vectorize: "Subproc_StarCraft2"  # Method to vectorize the environment.
runner: "RunnerStarCraft2"  # Runner.

use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]  # The fully connected layer for Basic_RNN representation.
recurrent_hidden_size: 64  # The size of hidden layers of recurrent networks.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [128, 128]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seeds.
parallels: 8  # Number of environments that to be implemented in parallel.
buffer_size: 8  # Total buffer size.
n_epochs: 1  # Number of epochs to update the model.
n_minibatch: 1  # Number of minibatch.
learning_rate_actor: 0.0007  # Learning rate of actor.
learning_rate_critic: 0.0007  # Learning rate of critic.
weight_decay: 0  # The steps to decay the greedy epsilon.

start_greedy: 0.5  # The start value of greedy epsilon.
end_greedy: 0.01  # The end value of greedy epsilon.
decay_step_greedy: 1000000
sync_frequency: 200  # The frequency to synchronize target networks.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.99  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_advnorm: False  # Whether to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 16  # The episodes to test in each test period.

log_dir: "logs/coma/"  # The directory to store logger file.
model_dir: "models/coma/"  # The directory to store model file.

agent: "VDAC"
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_MLP"
vectorize: "SubprocVecMultiAgentEnv"
runner: "MARL"  # Runner

# recurrent settings for Basic_RNN representation
use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [64, ]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

mixer: "VDN"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network (when mixer is QMIX)
hidden_dim_hyper_net: 32  # hidden units of hyper network (when mixer is QMIX)

seed: 1  # Random seed.
parallels: 16  # The number of environments to run in parallel.
buffer_size: 32  # Number of the transitions (use_rnn is False), or the episodes (use_rnn is True) in replay buffer.
n_epochs: 1  # Number of epochs to train.
n_minibatch: 1 # Number of minibatch to sample and train.  batch_size = buffer_size // n_minibatch.
learning_rate: 0.0005  # Learning rate.
weight_decay: 0  # The steps to decay the greedy epsilon.

vf_coef: 0.1  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
gamma: 0.99  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_global_state: True  # If to use global state to replace merged observations.
use_value_clip: False  # Limit the value range.
value_clip_range: 0.2  # The value clip range.
use_value_norm: False  # Use running mean and std to normalize rewards.
use_huber_loss: False  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0  # The threshold at which to change between delta-scaled L1 and L2 loss. (For huber loss).
use_advnorm: False  # If to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.8  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64,]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: Independent, VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.0
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: False  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: False  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 1000000  # 1M
eval_interval: 5000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 5000000  # 5M
eval_interval: 25000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 1.0

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 2
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "VDAC"
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "VDAC_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"
on_policy: True

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

mixer: "QMIX"  # choices: VDN (sum), QMIX (monotonic)
hidden_dim_mixing_net: 32  # hidden units of mixing network
hidden_dim_hyper_net: 64  # hidden units of hyper network

seed: 1
parallels: 8
buffer_size: 8
n_epochs: 1
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()

running_steps: 10000000  # 10M
eval_interval: 50000
test_episode: 16

log_dir: "logs/vdac/"
model_dir: "models/vdac/"

agent: "IPPO"  # The agent name.
env_name: "mpe"  # The environment name.
env_id: "simple_spread_v3"  # The environment id.
env_seed: 1  # The random seed of the environment.
continuous_action: True  # If to use continuous control.
learner: "IPPO_Learner"  # The learner name.
policy: "Gaussian_MAAC_Policy"  # The policy name.
representation: "Basic_MLP"  # The representation name.
vectorize: "SubprocVecMultiAgentEnv"  # The method to vectorize your environment such that can run in parallel.
runner: "MARL"  # The runner.

# recurrent settings for Basic_RNN representation.
use_rnn: False  # If to use recurrent neural network as representation. (The representation should be "Basic_RNN").
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [64, ]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seed.
parallels: 16  # The number of environments to run in parallel.
buffer_size: 3200  # Number of the transitions (use_rnn is False), or the episodes (use_rnn is True) in replay buffer.
n_epochs: 10  # Number of epochs to train.
n_minibatch: 1 # Number of minibatch to sample and train.  batch_size = buffer_size // n_minibatch.
learning_rate: 0.0007  # Learning rate.
weight_decay: 0  # The steps to decay the greedy epsilon.

vf_coef: 0.5  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
target_kl: 0.25  # For MAPPO_KL learner.
clip_range: 0.2  # The clip range for ratio in MAPPO learner.
gamma: 0.99  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_global_state: False  # If to use global state to replace merged observations.
use_value_clip: True  # Limit the value range.
value_clip_range: 0.2  # The value clip range.
use_value_norm: True  # Use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0  # The threshold at which to change between delta-scaled L1 and L2 loss. (For huber loss).
use_advnorm: True  # If to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"  # the learning algorithms_marl
# environment settings
env_name: "Football"
scenario: "academy_3_vs_1_with_keeper"
env_seed: 1  # The random seed of the environment.
use_stacked_frames: False  # Whether to use stacked_frames
num_agent: 3
num_adversary: 0
obs_type: "simple115v2"  # representation used to build the observation, choices: ["simple115v2", "extracted", "pixels_gray", "pixels"]
rewards_type: "scoring,checkpoints"  # comma separated list of rewards to be added
smm_width: 96  # width of super minimap
smm_height: 72  # height of super minimap
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_Football"
runner: "RunnerFootball"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1
parallels: 50
buffer_size: 400
n_epochs: 15
n_minibatch: 2
learning_rate: 5.0e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 25000000
eval_interval: 200000
test_episode: 50

log_dir: "logs/ippo/"
model_dir: "models/ippo/"
videos_dir: "./videos/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace merged observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 1000000  # 1M
eval_interval: 10000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 10
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 5000000  # 5M
eval_interval: 50000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 10
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.05
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.05
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 1.0

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 5
n_minibatch: 2
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "IPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "IPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 5
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace joint observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/ippo/"
model_dir: "models/ippo/"

agent: "MAPPO"  # The agent name.
env_name: "mpe"  # The environment name.
env_id: "simple_spread_v3"  # The environment id.
env_seed: 1  # The random seed of the environment.
continuous_action: True  # If to use continuous control.
learner: "MAPPO_Learner"
policy: "Gaussian_MAAC_Policy"  # The policy name.
representation: "Basic_MLP"  # The representation name.
vectorize: "SubprocVecMultiAgentEnv"  # The method to vectorize your environment such that can run in parallel.
runner: "MARL"  # The runner.

# recurrent settings for Basic_RNN representation.
use_rnn: False  # If to use recurrent neural network as representation. (The representation should be "Basic_RNN").
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [64, ]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seed.
parallels: 16  # The number of environments to run in parallel.
buffer_size: 400  # Number of the transitions (use_rnn is False), or the episodes (use_rnn is True) in replay buffer.
n_epochs: 1  # Number of epochs to train.
n_minibatch: 1  # Number of minibatch to sample and train.  batch_size = buffer_size // n_minibatch.
learning_rate: 0.0007  # Learning rate.
weight_decay: 0  # The steps to decay the greedy epsilon.

vf_coef: 0.5  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
target_kl: 0.25  # For MAPPO_KL learner.
clip_range: 0.2  # Ratio clip range, for MAPPO learner.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().
gamma: 0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_global_state: False  # If to use global state to replace merged observations.
use_value_clip: True  # Limit the value range.
value_clip_range: 0.2  # The value clip range.
use_value_norm: True  # Use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0  # The threshold at which to change between delta-scaled L1 and L2 loss. (For huber loss).
use_advnorm: True  # If to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "mpe"  # Name of the environment.
env_id: "simple_push_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MAPPO_Learner"
policy: "Gaussian_MAAC_Policy"
representation: "Basic_MLP"
vectorize: "SubprocVecMultiAgentEnv"
runner: "RunnerCompetition"

# recurrent settings for Basic_RNN representation
use_rnn: False  # If to use recurrent neural network as representation. (The representation should be "Basic_RNN").
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]
critic_hidden_size: [256, ]
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1
parallels: 128
buffer_size: 3200
n_epochs: 10
n_minibatch: 1
learning_rate: 0.0007
weight_decay: 0

vf_coef: 0.5
ent_coef: 0.01
target_kl: 0.25  # for MAPPO_KL learner
clip_range: 0.2  # ratio clip range, for MAPPO learner
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.95  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace merged observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True
use_gae: True
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000
eval_interval: 100000
test_episode: 5

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "mpe"  # Name of the environment.
env_id: "simple_adversary_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MAPPO_Learner"
policy: "Gaussian_MAAC_Policy"
representation: "Basic_MLP"
vectorize: "SubprocVecMultiAgentEnv"
runner: "RunnerCompetition"

# recurrent settings for Basic_RNN representation
use_rnn: False  # If to use recurrent neural network as representation. (The representation should be "Basic_RNN").
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]
critic_hidden_size: [256, ]
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1
parallels: 128
buffer_size: 3200
n_epochs: 10
n_minibatch: 1
learning_rate: 0.0007
weight_decay: 0

vf_coef: 0.5
ent_coef: 0.01
target_kl: 0.25  # for MAPPO_KL learner
clip_range: 0.2  # ratio clip range, for MAPPO learner
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.95  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace merged observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True
use_gae: True
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000
eval_interval: 100000
test_episode: 5

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"  # the learning algorithms_marl
# environment settings
env_name: "Football"
scenario: "1_vs_1_easy"
env_seed: 1  # The random seed of the environment.
use_stacked_frames: False  # Whether to use stacked_frames
num_agent: 1
num_adversary: 0
obs_type: "simple115v2"  # representation used to build the observation, choices: ["simple115v2", "extracted", "pixels_gray", "pixels"]
rewards_type: "scoring,checkpoints"  # comma separated list of rewards to be added
smm_width: 96  # width of super minimap
smm_height: 72  # height of super minimap
fps: 15  # Frames per second.
max_episode_steps: 1000
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_Football"
runner: "RunnerFootball"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1
parallels: 50
buffer_size: 100
n_epochs: 10
n_minibatch: 1
learning_rate: 5.0e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: True  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 25000000
eval_interval: 200000
test_episode: 50

log_dir: "logs/mappo/"
model_dir: "models/mappo/"
videos_dir: "./videos/mappo/"

agent: "MAPPO"  # the learning algorithms_marl
# environment settings
env_name: "Football"
scenario: "academy_3_vs_1_with_keeper"
env_seed: 1  # The random seed of the environment.
use_stacked_frames: False  # Whether to use stacked_frames
num_agent: 3
num_adversary: 0
obs_type: "simple115v2"  # representation used to build the observation, choices: ["simple115v2", "extracted", "pixels_gray", "pixels"]
rewards_type: "scoring,checkpoints"  # comma separated list of rewards to be added
smm_width: 96  # width of super minimap
smm_height: 72  # height of super minimap
fps: 15  # Frames per second.
max_episode_steps: 1000
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_Football"
runner: "RunnerFootball"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1
parallels: 50
buffer_size: 400
n_epochs: 15
n_minibatch: 2
learning_rate: 5.0e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: True  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 25000000
eval_interval: 200000
test_episode: 50

log_dir: "logs/mappo/"
model_dir: "models/mappo/"
videos_dir: "./videos/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2m_vs_1z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to replace merged observations
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "3m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 1000000
eval_interval: 10000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 1000000  # 1M
eval_interval: 10000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "1c3s5z"  # Environment ID
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "2s3z"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 2000000
eval_interval: 20000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "25m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 10
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 5000000  # 5M
eval_interval: 50000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "5m_vs_6m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 10
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.05
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "8m_vs_9m"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64,]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 15
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.05
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.05
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "MMM2"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 1.0

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 5
n_minibatch: 2
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "MAPPO"
env_name: "StarCraft2"  # Name of the environment.
env_id: "corridor"
env_seed: 1  # The random seed of the environment.
fps: 15  # Frames per second.
learner: "MAPPO_Learner"
policy: "Categorical_MAAC_Policy"
representation: "Basic_RNN"
vectorize: "Subproc_StarCraft2"
runner: "RunnerStarCraft2"

# recurrent settings for Basic_RNN representation
use_rnn: True  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, 64, 64]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"
initialize: "orthogonal"
gain: 0.01

actor_hidden_size: []
critic_hidden_size: []
activation: "relu"  # The activation function of each hidden layer.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: True  # If to use actions mask for unavailable actions.

seed: 1
parallels: 8
buffer_size: 128
n_epochs: 5
n_minibatch: 1
learning_rate: 0.0007  # 7e-4
weight_decay: 0

vf_coef: 1.0
ent_coef: 0.01
target_kl: 0.25
clip_range: 0.2
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # if use linear learning rate decay
end_factor_lr_decay: 0.5
use_global_state: False  # if use global state to calculate values
use_value_clip: True  # limit the value range
value_clip_range: 0.2
use_value_norm: True  # use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0
use_advnorm: True  # use advantage normalization.
use_gae: True  # use GAE trick to calculate returns.
gae_lambda: 0.95
use_grad_clip: True  # gradient normalization
grad_clip_norm: 10.0

running_steps: 10000000  # 10M
eval_interval: 100000
test_episode: 16

log_dir: "logs/mappo/"
model_dir: "models/mappo/"

agent: "ISAC"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "ISAC_Learner"
policy: "Gaussian_ISAC_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
alpha: 0.01
use_automatic_entropy_tuning: True

start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: False
grad_clip_norm: 10

eval_interval: 100000
test_episode: 5

log_dir: "logs/isac/"
model_dir: "models/isac/"

agent: "ISAC"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_push_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "ISAC_Learner"
policy: "Gaussian_ISAC_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
alpha: 0.01
use_automatic_entropy_tuning: True

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
train_per_step: False  # True: train model per step; False: train model per episode.
training_frequency: 1

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5
log_dir: "logs/isac/"
model_dir: "models/isac/"

agent: "ISAC"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_adversary_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "ISAC_Learner"
policy: "Gaussian_ISAC_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerPettingzoo"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
alpha: 0.01
use_automatic_entropy_tuning: True

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
train_per_step: False  # True: train model per step; False: train model per episode.
training_frequency: 1

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5
log_dir: "logs/isac/"
model_dir: "models/isac/"

agent: "MASAC"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MASAC_Learner"
policy: "Gaussian_MASAC_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

# recurrent settings for Basic_RNN representation.
use_rnn: False  # If to use recurrent neural network as representation. (The representation should be "Basic_RNN").
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
alpha: 0.01
use_automatic_entropy_tuning: True

start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: False
grad_clip_norm: 10

eval_interval: 100000
test_episode: 5

log_dir: "logs/masac/"
model_dir: "models/masac/"

agent: "MASAC"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_push_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MASAC_Learner"
policy: "Gaussian_MASAC_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
alpha: 0.01
use_automatic_entropy_tuning: True

start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/masac/"
model_dir: "models/masac/"

agent: "MASAC"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_adversary_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MASAC_Learner"
policy: "Gaussian_MASAC_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
alpha: 0.01
use_automatic_entropy_tuning: True

start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/masac/"
model_dir: "models/masac/"

agent: "IDDPG"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "IDDPG_Learner"
policy: "Independent_DDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/iddpg/"
model_dir: "models/iddpg/"

agent: "IDDPG"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_push_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "IDDPG_Learner"
policy: "Independent_DDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/iddpg/"
model_dir: "models/iddpg/"

agent: "IDDPG"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_adversary_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "IDDPG_Learner"
policy: "Independent_DDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/iddpg/"
model_dir: "models/iddpg/"

agent: "IDDPG"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_reference_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "IDDPG_Learner"
policy: "Independent_DDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/iddpg/"
model_dir: "models/iddpg/"

agent: "IDDPG"  # the learning algorithms_marl
env_name: "Drones"
env_id: "MultiHoverAviary"
env_seed: 1  # The random seed of the environment.
obs_type: 'kin'
act_type: 'vel'
num_drones: 3
record: False
obstacles: True
max_episode_steps: 2000
render: False
sleep: 0.01
learner: "IDDPG_Learner"
policy: "Independent_DDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'tanh'
use_parameter_sharing: True

seed: 1
parallels: 10
buffer_size: 1000000
batch_size: 1024
learning_rate_actor: 0.001  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.99  # discount factor
tau: 0.005  # soft update for target networks

start_noise: 0.1
end_noise: 0.1
sigma: 0.1
start_training: 2000  # start training after n steps
running_steps: 10000000
train_per_step: True  # True: train model per step; False: train model per episode.
training_frequency: 1

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5
log_dir: "logs/iddpg/"
model_dir: "models/iddpg/"

agent: "MADDPG"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MADDPG_Learner"
policy: "MADDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/maddpg/"
model_dir: "models/maddpg/"

agent: "MADDPG"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_push_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MADDPG_Learner"
policy: "MADDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/maddpg/"
model_dir: "models/maddpg/"

agent: "MADDPG"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_adversary_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MADDPG_Learner"
policy: "MADDPG_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, ]
critic_hidden_size: [64, ]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks

start_noise: 1.0
end_noise: 0.01
sigma: 0.1
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/maddpg/"
model_dir: "models/maddpg/"

agent: "MADDPG"  # the learning algorithms_marl
env_name: "Drones"
env_id: "MultiHoverAviary"
env_seed: 1  # The random seed of the environment.
obs_type: 'kin'
act_type: 'vel'
num_drones: 3
record: False
obstacles: True
max_episode_steps: 1000
render: False
sleep: 0.01
learner: "MADDPG_Learner"
policy: "MADDPG_Policy"
representation: "Basic_Identical"
vectorize: "Dummy_Drone_MAS"
runner: "MARL"  # Runner
on_policy: False

actor_hidden_size: [256, 256]
critic_hidden_size: [256, 256]
activation: 'leaky_relu'
activation_action: 'tanh'

seed: 1
parallels: 10
buffer_size: 1000000
batch_size: 1024
learning_rate_actor: 0.001  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.99  # discount factor
tau: 0.005  # soft update for target networks

start_noise: 0.1
end_noise: 0.1
sigma: 0.1
start_training: 2000  # start training after n steps
running_steps: 10000000
train_per_step: True  # True: train model per step; False: train model per episode.
training_frequency: 1

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5
log_dir: "logs/maddpg/"
model_dir: "models/maddpg/"

agent: "MADDPG"  # the learning algorithms_marl
env_name: "NewEnv_MAS"
env_id: "scenarios_0"
env_seed: 1  # The random seed of the environment.
max_episode_steps: 200
render: False
sleep: 0.01
continuous_action: True  # Continuous action space or not.
learner: "MADDPG_Learner"
policy: "MADDPG_Policy"
representation: "Basic_Identical"
vectorize: "Dummy_NewEnv_MAS"
runner: "MARL"  # Runner
on_policy: False

actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'tanh'

seed: 1
parallels: 16
buffer_size: 1000000
batch_size: 1024
learning_rate_actor: 0.001  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.99  # discount factor
tau: 0.005  # soft update for target networks

start_noise: 0.1
end_noise: 0.1
sigma: 0.1
start_training: 2000  # start training after n steps
running_steps: 1000000
train_per_step: True  # True: train model per step; False: train model per episode.
training_frequency: 1

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 10000
test_episode: 5
log_dir: "logs/maddpg/"
model_dir: "models/maddpg/"

agent: "MATD3"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MATD3_Learner"
policy: "MATD3_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"  # Runner

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
actor_update_delay: 2

start_noise: 1.0
end_noise: 0.01
sigma: 0.1  # random noise for continuous actions
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: False
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/matd3/"
model_dir: "models/matd3/"

agent: "MATD3"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_push_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MATD3_Learner"
policy: "MATD3_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
actor_update_delay: 2

start_noise: 1.0
end_noise: 0.01
sigma: 0.1
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/matd3/"
model_dir: "models/matd3/"

agent: "MATD3"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_adversary_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: True  # Continuous action space or not.
learner: "MATD3_Learner"
policy: "MATD3_Policy"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "RunnerCompetition"

representation_hidden_size: []  # the units for each hidden layer
actor_hidden_size: [64, 64]
critic_hidden_size: [64, 64]
activation: 'leaky_relu'
activation_action: 'sigmoid'
use_parameter_sharing: True
use_actions_mask: False

seed: 1
parallels: 16
buffer_size: 100000
batch_size: 256
learning_rate_actor: 0.01  # learning rate for actor
learning_rate_critic: 0.001  # learning rate for critic
gamma: 0.95  # discount factor
tau: 0.001  # soft update for target networks
actor_update_delay: 2

start_noise: 1.0
end_noise: 0.01
sigma: 0.1
start_training: 1000  # start training after n steps
running_steps: 10000000
training_frequency: 25

use_grad_clip: True
grad_clip_norm: 0.5

eval_interval: 100000
test_episode: 5

log_dir: "logs/matd3/"
model_dir: "models/matd3/"

agent: "MFQ"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "MFQ_Learner"
policy: "MF_Q_network"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [64, ]
action_embedding_hidden_size: [64, 32]
q_hidden_size: [128, 64]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 16
buffer_size: 200000
batch_size: 256
learning_rate: 0.001
gamma: 0.95  # discount factor
policy_type: "greedy"  # choice of policy: Boltzmann policy or greedy policy. (Default is 'greedy')
temperature: 0.1  # softmax for policy (be used for the exploration rate of Boltzmann policy.)

start_greedy: 1.0
end_greedy: 0.05
decay_step_greedy: 2500000
start_training: 1000  # start training after n steps
running_steps: 10000000  # 10M
training_frequency: 25
sync_frequency: 100

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 100000
test_episode: 5
log_dir: "logs/mfq/"
model_dir: "models/mfq/"

agent: "MFQ"  # the learning algorithms_marl
env_name: "MAgent2"
env_id: "adversarial_pursuit_v4"
env_seed: 1  # The random seed of the environment.
minimap_mode: False
max_cycles: 500
extra_features: False
map_size: 45
render_mode: "rgb_array"
learner: "MFQ_Learner"
policy: "MF_Q_network"
representation: "Basic_Identical"
vectorize: "DummyVecMultiAgentEnv"
runner: "MARL"

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # Choice of recurrent networks: GRU or LSTM.
N_recurrent_layers: 1  # Number of recurrent layers.
fc_hidden_sizes: [64, ]
recurrent_hidden_size: 64
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.

representation_hidden_size: [512, ]
q_hidden_size: [512, ]  # the units for each hidden layer
activation: "relu"  # The activation function of each hidden layer.

seed: 1
parallels: 10
buffer_size: 2000
batch_size: 256
learning_rate: 0.001
gamma: 0.95  # discount factor
policy_type: "greedy"  # choice of policy: Boltzmann policy or greedy policy. (Default is 'greedy')
temperature: 0.1  # softmax for policy

start_greedy: 1.0
end_greedy: 0.95
decay_step_greedy: 5000
start_training: 1000  # start training after n steps
running_steps: 1000000
training_frequency: 1
sync_frequency: 200

use_grad_clip: False
grad_clip_norm: 0.5
use_parameter_sharing: True
use_actions_mask: False

eval_interval: 10000
test_episode: 5
log_dir: "logs/mfq/"
model_dir: "models/mfq/"

agent: "MFAC"  # the learning algorithms_marl
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
continuous_action: False  # Continuous action space or not.
learner: "MFAC_Learner"
policy: "Categorical_MFAC_Policy"
representation: "Basic_MLP"
vectorize: "SubprocVecMultiAgentEnv"
runner: "MARL"

use_rnn: False  # Whether to use recurrent neural networks.
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
action_embedding_hidden_size: [32, ]
actor_hidden_size: [64, ]
critic_hidden_size: [64, ]
activation: 'relu'
activation_action: 'sigmoid'
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.
temperature: 0.1  # softmax for policy (be used for the exploration rate of Boltzmann policy.)

seed: 1
parallels: 16
buffer_size: 3200
n_epochs: 10
n_minibatch: 1
learning_rate: 0.0007  # learning rate
weight_decay: 0

vf_coef: 0.5
ent_coef: 0.01
clip_range: 0.2  # The clip range for ratio.
gamma: 0.99  # discount factor

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_global_state: False  # If to use global state to replace merged observations.
use_value_clip: True  # Limit the value range.
value_clip_range: 0.2  # The value clip range.
use_value_norm: True  # Use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0  # The threshold at which to change between delta-scaled L1 and L2 loss. (For huber loss).
use_advnorm: True  # If to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "logs/mfac/"
model_dir: "models/mfac/"

agent: "RANDOM"
learner: "RANDOM"
env_name: "mpe"  # Name of the environment.
env_id: "simple_spread_v3"
env_seed: 1  # The random seed of the environment.
runner: "MARL"  # Runner

model_dir: ""
log_dir: ""

agent: "RANDOM"
env_name: "mpe"  # Name of the environment.
env_id: "simple_push_v3"
env_seed: 1  # The random seed of the environment.
learner: "RANDOM"
runner: "RunnerCompetition"

model_dir: ""
log_dir: ""

agent: "RANDOM"
env_name: "mpe"  # Name of the environment.
env_id: "simple_adversary_v3"
env_seed: 1  # The random seed of the environment.
learner: "RANDOM"
runner: "RunnerCompetition"

model_dir: ""
log_dir: ""