自定义环境：多智能体¶

在 XuanCe 中，用户同样可以灵活地创建并运行自己定制的多智能体（multi-agent）环境，除了使用内置的环境之外，还可以在此基础上扩展出包含多个智能体的复杂交互场景。

步骤 1：创建新的多智能体环境类¶

首先，你需要准备一个原始环境，即“部分可观测马尔可夫决策过程”（POMDP）。然后，基于 XuanCe 的基础类 RawMultiAgentEnv 定义一个新的环境。

以下是一个示例：

import numpy as np
from gymnasium.spaces import Box
from xuance.environment import RawMultiAgentEnv

class MyNewMultiAgentEnv(RawMultiAgentEnv):
    def __init__(self, env_config):
        super(MyNewMultiAgentEnv, self).__init__()
        self.env_id = env_config.env_id
        self.num_agents = 3
        self.agents = [f"agent_{i}" for i in range(self.num_agents)]
        self.state_space = Box(-np.inf, np.inf, shape=[8, ])
        self.observation_space = {agent: Box(-np.inf, np.inf, shape=[8, ]) for agent in self.agents}
        self.action_space = {agent: Box(-np.inf, np.inf, shape=[2, ]) for agent in self.agents}
        self.max_episode_steps = 25
        self._current_step = 0

    def get_env_info(self):
        return {'state_space': self.state_space,
                'observation_space': self.observation_space,
                'action_space': self.action_space,
                'agents': self.agents,
                'num_agents': self.num_agents,
                'max_episode_steps': self.max_episode_steps}

    def avail_actions(self):
        return None

    def agent_mask(self):
        """Returns boolean mask variables indicating which agents are currently alive."""
        return {agent: True for agent in self.agents}

    def state(self):
        """Returns the global state of the environment."""
        return self.state_space.sample()

    def reset(self):
        observation = {agent: self.observation_space[agent].sample() for agent in self.agents}
        info = {}
        self._current_step = 0
        return observation, info

    def step(self, action_dict):
        self._current_step += 1
        observation = {agent: self.observation_space[agent].sample() for agent in self.agents}
        rewards = {agent: np.random.random() for agent in self.agents}
        terminated = {agent: False for agent in self.agents}
        truncated = False if self._current_step < self.max_episode_steps else True
        info = {}
        return observation, rewards, terminated, truncated, info

    def render(self, *args, **kwargs):
        return np.ones([64, 64, 64])

    def close(self):
        return

步骤 2：创建配置文件并读取配置¶

接下来，你需要按照 “进一步使用” 中的步骤 1 创建一个 YAML 配置文件。

以下是一个为 DDPG 算法准备的配置示例，文件名为：ippo_new_configs.yaml。

dl_toolbox: "torch"  # The deep learning toolbox. Choices: "torch", "mindspore", "tensorlayer"
project_name: "XuanCe_Benchmark"
logger: "tensorboard"  # Choices: tensorboard, wandb.
wandb_user_name: "your_user_name"
render: True
render_mode: 'rgb_array' # Choices: 'human', 'rgb_array'.
fps: 15
test_mode: False
device: "cpu"  # Choose an calculating device. PyTorch: "cpu", "cuda:0"; TensorFlow: "cpu"/"CPU", "gpu"/"GPU"; MindSpore: "CPU", "GPU", "Ascend", "Davinci".
distributed_training: False  # Whether to use multi-GPU for distributed training.
master_port: '12355'  # The master port for current experiment when use distributed training.

agent: "IPPO"
env_name: "MyNewMultiAgentEnv"
env_id: "new_env_id"
env_seed: 1
continuous_action: True  # Continuous action space or not.
learner: "IPPO_Learner"  # The learner name.
policy: "Gaussian_MAAC_Policy"
representation: "Basic_MLP"
vectorize: "DummyVecMultiAgentEnv"

# recurrent settings for Basic_RNN representation.
use_rnn: False  # If to use recurrent neural network as representation. (The representation should be "Basic_RNN").
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01  # Gain value for network initialization.

# recurrent settings for Basic_RNN representation.
representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [64, ]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seed.
parallels: 16  # The number of environments to run in parallel.
buffer_size: 3200  # Number of the transitions (use_rnn is False), or the episodes (use_rnn is True) in replay buffer.
n_epochs: 10  # Number of epochs to train.
n_minibatch: 1 # Number of minibatch to sample and train.  batch_size = buffer_size // n_minibatch.
learning_rate: 0.0007  # Learning rate.
weight_decay: 0  # The steps to decay the greedy epsilon.

vf_coef: 0.5  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
target_kl: 0.25  # For MAPPO_KL learner.
clip_range: 0.2  # The clip range for ratio in MAPPO_Clip learner.
gamma: 0.99  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_global_state: False  # If to use global state to replace merged observations.
use_value_clip: True  # Limit the value range.
value_clip_range: 0.2  # The value clip range.
use_value_norm: True  # Use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0  # The threshold at which to change between delta-scaled L1 and L2 loss. (For huber loss).
use_advnorm: True  # If to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().

running_steps: 100000  # The total running steps.
eval_interval: 10000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "./logs/ippo/"
model_dir: "./models/ippo/"

然后，读取该配置文件:

import argparse
from xuance.common import get_configs
configs_dict = get_configs(file_dir="ippo_new_configs.yaml")
configs = argparse.Namespace(**configs_dict)

步骤 3：将环境添加到注册表中¶

在定义完一个新的环境类之后，需要将其添加到 REGISTRY_MULTI_AGENT_ENV 中进行注册。

from xuance.environment import REGISTRY_MULTI_AGENT_ENV
REGISTRY_MULTI_AGENT_ENV[configs.env_name] = MyNewMultiAgentEnv

步骤 4：创建你的环境并在 XuanCe 中运行¶

现在，你可以创建自己的环境，并直接使用 XuanCe 提供的算法运行它。

以下是使用 IPPO 算法的示例：

from xuance.environment import make_envs
from xuance.torch.agents import IPPO_Agents

envs = make_envs(configs)  # Make parallel environments.
Agent = IPPO_Agents(config=configs, envs=envs)  # Create a DDPG agent from XuanCe.
Agent.train(configs.running_steps // configs.parallels)  # Train the model for numerous steps.
Agent.save_model("final_train_model.pth")  # Save the model to model_dir.
Agent.finish()  # Finish the training.

完整代码¶

上述步骤的完整代码可在以下链接查看： https://github.com/agi-brain/xuance/blob/master/examples/new_environments/ippo_new_env.py