memory_tools_marl¶

class xuance.common.memory_tools_marl.BaseBuffer(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, observation_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, action_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, num_envs: int = 1, buffer_size: int = 1)[源代码]¶

基类：ABC

Basic buffer for MARL algorithms.

参数:

agent_keys (List[str]) – Keys that identify each agent.
state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.
observation_space (Dict[str, Dict[str, Space]]) – Observation space for one agent.
action_space (Dict[str, Dict[str, Space]]) – Action space for one agent.
n_envs (int) – Number of parallel environments.
buffer_size (int) – Buffer size of total experience data.

abstract clear(*args)[源代码]¶

abstract finish_path(*args, **kwargs)[源代码]¶

property full¶

abstract sample(*args)[源代码]¶

abstract store(*args, **kwargs)[源代码]¶

class xuance.common.memory_tools_marl.IC3Net_OnPolicyBuffer_RNN(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, max_episode_steps: int = 1, use_gae: bool | None = False, use_advnorm: bool | None = False, gamma: float | None = None, gae_lam: float | None = None, **kwargs)[源代码]¶

基类：MARL_OnPolicyBuffer_RNN

clear()[源代码]¶

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:: None –
返回:: None

clear_episodes()[源代码]¶

class xuance.common.memory_tools_marl.MARL_OffPolicyBuffer(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, batch_size: int = 1, **kwargs)[源代码]¶

基类：BaseBuffer

Replay buffer for off-policy MARL algorithms.

参数:

agent_keys (List[str]) – Keys that identify each agent.
state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.
obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).
act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).
n_envs (int) – Number of parallel environments.
buffer_size (int) – Buffer size of total experience data.
batch_size (int) – Batch size of transition data for a sample.
**kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=50,
>>> buffer_size=10000,
>>> batch_size=256,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                              act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               batch_size=batch_size)

clear()[源代码]¶

Clears the memory data in the replay buffer.

示例

An example shows the data shape:

# (n_env=50, buffer_size=10000, agent_keys=['agent_0', 'agent_1', 'agent_2'])
self.data = {
    'obs': {
        'agent_0': shape=[50, 200, 18],
        'agent_1': shape=[50, 200, 18],
        'agent_2': shape=[50, 200, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[50, 200, 5],
        'agent_1': shape=[50, 200, 5],
        'agent_2': shape=[50, 200, 5],
    },  # dim_act: 5
    ...
}

finish_path(*args, **kwargs)[源代码]¶

sample(batch_size=None)[源代码]¶

Samples a batch of data from the replay buffer.

参数:: batch_size (int) – The size of the batch data to be sampled.
返回:: The sampled data.
返回类型:: samples_dict (dict)

store(**step_data)[源代码]¶: Stores a step of data into the replay buffer.

class xuance.common.memory_tools_marl.MARL_OffPolicyBuffer_RNN(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, batch_size: int = 1, max_episode_steps: int = 1, **kwargs)[源代码]¶

基类：MARL_OffPolicyBuffer

Replay buffer for off-policy MARL algorithms with DRQN trick.

参数:

agent_keys (List[str]) – Keys that identify each agent.
state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.
obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).
act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).
n_envs (int) – Number of parallel environments.
buffer_size (int) – Buffer size of total experience data.
batch_size (int) – Batch size of episodes for a sample.
max_episode_steps (int) – The sequence length of each episode data.
**kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=50,
>>> buffer_size=10000,
>>> batch_size=256,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> max_episode_steps=60
>>> memory = MARL_OffPolicyBuffer_RNN(agent_keys=agent_keys, state_space=state_space,
...                                   obs_space=obs_space, act_space=act_space,
...                                   n_envs=n_envs, buffer_size=buffer_size, batch_size=batch_size,
...                                   max_episode_steps=max_episode_steps)

clear()[源代码]¶

Clears the memory data in the replay buffer.

示例

An example shows the data shape (buffer_size=10000, max_eps_len=60, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[10000, 61, 18],
        'agent_1': shape=[10000, 61, 18],
        'agent_2': shape=[10000, 61, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[10000, 60, 5],
        'agent_1': shape=[10000, 60, 5],
        'agent_2': shape=[10000, 60, 5],
    },  # dim_act: 5
    ...
    'filled': shape=[10000, 60],  # Step mask values. True means current step is not terminated.
}

clear_episodes()[源代码]¶

Clears an episode of data for multiple environments in the replay buffer.

示例

An example shows the data shape (n_envs=16, max_eps_len=60, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 61, 18],
        'agent_1': shape=[16, 61, 18],
        'agent_2': shape=[16, 61, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 60, 5],
        'agent_1': shape=[16, 60, 5],
        'agent_2': shape=[16, 60, 5],
    },  # dim_act: 5
    ...
    'filled': shape=[16, 60],  # Step mask values. True means current step is not terminated.
}

finish_path(i_env, **terminal_data)[源代码]¶

Address the terminal states, including store the terminal observations, avail_actions, and others.

参数:

i_env (int) – The i-th environment.
terminal_data (dict) – The terminal states.

sample(batch_size=None)[源代码]¶

Samples a batch of data for model training.

参数:: batch_size (int) – The size of the data batch, default is self.batch_size (recommended).
返回:: A dict of sampled data.
返回类型:: samples_dict (dict)

store(**step_data)[源代码]¶

Stores a step of data for each environment.

参数:: step_data (dict) – A dict of step data that to be stored into self.episode_data.

store_episodes(i_env)[源代码]¶

Stores the episode of data for ith environment into the self.data.

参数:: i_env (int) – The ith environment.

class xuance.common.memory_tools_marl.MARL_OnPolicyBuffer(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, use_gae: bool | None = False, use_advnorm: bool | None = False, gamma: float | None = None, gae_lam: float | None = None, **kwargs)[源代码]¶

基类：BaseBuffer

Replay buffer for on-policy MARL algorithms.

参数:

agent_keys (List[str]) – Keys that identify each agent.
state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.
obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).
act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).
n_envs (int) – Number of parallel environments.
buffer_size (int) – Buffer size of total experience data.
use_gae (bool) – Whether to use GAE trick.
use_advnorm (bool) – Whether to use Advantage normalization trick.
gamma (float) – Discount factor.
gae_lam (float) – gae lambda.
**kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=16,
>>> buffer_size=1600,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                               act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               use_gae=False, use_advnorm=False, gamma=0.99, gae_lam=0.95)

clear()[源代码]¶

Clears the memory data in the replay buffer.

示例

An example shows the data shape (n_env=16, buffer_size=1600, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 100, 18],
        'agent_1': shape=[16, 100, 18],
        'agent_2': shape=[16, 100, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 100, 5],
        'agent_1': shape=[16, 100, 5],
        'agent_2': shape=[16, 100, 5],
    },  # dim_act: 5
    ...
}

finish_path(i_env: int | None = None, value_next: dict | None = None, value_normalizer=None)[源代码]¶

Calculates and stores the returns and advantages when an episode is finished.

参数:

i_env (int) – The index of environment.
value_next (dict) – The critic values of the terminal state.
value_normalizer – The value normalizer method, default is None.

sample(indexes: ndarray | None = None)[源代码]¶

Samples a batch of data from the replay buffer.

参数:: indexes (int) – The indexes of the data in the buffer that will be sampled.
返回:: The sampled data.
返回类型:: samples_dict (dict)

store(**step_data)[源代码]¶: Stores a step of data into the replay buffer.

class xuance.common.memory_tools_marl.MARL_OnPolicyBuffer_RNN(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, max_episode_steps: int = 1, use_gae: bool | None = False, use_advnorm: bool | None = False, gamma: float | None = None, gae_lam: float | None = None, **kwargs)[源代码]¶

基类：MARL_OnPolicyBuffer

Replay buffer for on-policy MARL algorithms with DRQN trick.

参数:

agent_keys (List[str]) – Keys that identify each agent.
state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.
obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).
act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).
n_envs (int) – Number of parallel environments.
buffer_size (int) – Buffer size of total experience data.
max_episode_steps (int) – The sequence length of each episode data.
use_gae (bool) – Whether to use GAE trick.
use_advnorm (bool) – Whether to use Advantage normalization trick.
gamma (float) – Discount factor.
gae_lam (float) – gae lambda.
**kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=16,
>>> buffer_size=1600,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> max_episode_steps = 100
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                               act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               max_episode_steps=max_episode_steps,
...                               use_gae=False, use_advnorm=False, gamma=0.99, gae_lam=0.95)

clear()[源代码]¶

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:: None –
返回:: None

clear_episodes()[源代码]¶

finish_path(i_env: int | None = None, i_step: int | None = None, value_next: dict | None = None, value_normalizer: dict | None = None)[源代码]¶

Calculates and stores the returns and advantages when an episode is finished.

参数:

i_env (int) – The index of environment.
i_step (int) – The index of step for current environment.
value_next (Optional[dict]) – The critic values of the terminal state.
value_normalizer (Optional[dict]) – The value normalizer method, default is None.

property full¶

sample(indexes: ndarray | None = None)[源代码]¶

Samples a batch of data from the replay buffer.

参数:: indexes (int) – The indexes of the data in the buffer that will be sampled.
返回:: The sampled data.
返回类型:: samples_dict (dict)

store(**step_data)[源代码]¶

Stores a step of data for each environment.

参数:: step_data (dict) – A dict of step data that to be stored into self.episode_data.

store_episodes(i_env)[源代码]¶

Stores the episode of data for ith environment into the self.data.

参数:: i_env (int) – The ith environment.

class xuance.common.memory_tools_marl.MeanField_OffPolicyBuffer(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, batch_size: int = 1, **kwargs)[源代码]¶

基类：MARL_OffPolicyBuffer

Replay buffer for off-policy Mean-Field MARL algorithms (Mean-Field Q-Learning).

参数:

agent_keys (List[str]) – Keys that identify each agent.
state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.
obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).
act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).
n_envs (int) – Number of parallel environments.
buffer_size (int) – Buffer size of total experience data.
batch_size (int) – Batch size of transition data for a sample.
**kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=50,
>>> buffer_size=10000,
>>> batch_size=256,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                               act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               batch_size=batch_size)

clear()[源代码]¶

Clears the memory data in the replay buffer.

示例

An example shows the data shape:

# (n_env=50, buffer_size=10000, agent_keys=['agent_0', 'agent_1', 'agent_2'])
self.data = {
    'obs': {
        'agent_0': shape=[50, 200, 18],
        'agent_1': shape=[50, 200, 18],
        'agent_2': shape=[50, 200, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[50, 200, 5],
        'agent_1': shape=[50, 200, 5],
        'agent_2': shape=[50, 200, 5],
    },  # dim_act: 5
    ...
}

class xuance.common.memory_tools_marl.MeanField_OffPolicyBuffer_RNN(*args, **kwargs)[源代码]¶

基类：MARL_OffPolicyBuffer_RNN

clear()[源代码]¶

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:: None –
返回:: None

clear_episodes()[源代码]¶

Clears an episode of data for multiple environments in the replay buffer.

示例

An example shows the data shape (n_envs=16, max_eps_len=60, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 61, 18],
        'agent_1': shape=[16, 61, 18],
        'agent_2': shape=[16, 61, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 60, 5],
        'agent_1': shape=[16, 60, 5],
        'agent_2': shape=[16, 60, 5],
    },  # dim_act: 5
    ...
    'filled': shape=[16, 60],  # Step mask values. True means current step is not terminated.
}

finish_path(i_env, **terminal_data)[源代码]¶

Address the terminal states, including store the terminal observations, avail_actions, and others.

参数:

i_env (int) – The i-th environment.
terminal_data (dict) – The terminal states.

class xuance.common.memory_tools_marl.MeanField_OnPolicyBuffer(*args, **kwargs)[源代码]¶

基类：MARL_OnPolicyBuffer

Replay buffer for Mean Field Actor-Critic algorithm.

clear()[源代码]¶

Clears the memory data in the replay buffer.

示例

An example shows the data shape (n_env=16, buffer_size=1600, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 100, 18],
        'agent_1': shape=[16, 100, 18],
        'agent_2': shape=[16, 100, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 100, 5],
        'agent_1': shape=[16, 100, 5],
        'agent_2': shape=[16, 100, 5],
    },  # dim_act: 5
    ...
}

class xuance.common.memory_tools_marl.MeanField_OnPolicyBuffer_RNN(*args, **kwargs)[源代码]¶

基类：MARL_OnPolicyBuffer_RNN

clear()[源代码]¶

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:: None –
返回:: None

clear_episodes()[源代码]¶