memory_tools_marl

class xuance.common.memory_tools_marl.BaseBuffer(*args)[源代码]

基类:ABC

Basic buffer for MARL algorithms.

abstract clear(*args)[源代码]
abstract finish_path(*args, **kwargs)[源代码]
property full
abstract sample(*args)[源代码]
abstract store(*args, **kwargs)[源代码]
class xuance.common.memory_tools_marl.IC3Net_OnPolicyBuffer_RNN(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, max_episode_steps: int = 1, use_gae: bool | None = False, use_advnorm: bool | None = False, gamma: float | None = None, gae_lam: float | None = None, **kwargs)[源代码]

基类:MARL_OnPolicyBuffer_RNN

clear()[源代码]

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:

None

返回:

None

clear_episodes()[源代码]
class xuance.common.memory_tools_marl.MARL_OffPolicyBuffer(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, batch_size: int = 1, **kwargs)[源代码]

基类:BaseBuffer

Replay buffer for off-policy MARL algorithms.

参数:
  • agent_keys (List[str]) – Keys that identify each agent.

  • state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.

  • obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).

  • act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).

  • n_envs (int) – Number of parallel environments.

  • buffer_size (int) – Buffer size of total experience data.

  • batch_size (int) – Batch size of transition data for a sample.

  • **kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=50,
>>> buffer_size=10000,
>>> batch_size=256,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                              act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               batch_size=batch_size)
clear()[源代码]

Clears the memory data in the replay buffer.

示例

An example shows the data shape:

# (n_env=50, buffer_size=10000, agent_keys=['agent_0', 'agent_1', 'agent_2'])
self.data = {
    'obs': {
        'agent_0': shape=[50, 200, 18],
        'agent_1': shape=[50, 200, 18],
        'agent_2': shape=[50, 200, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[50, 200, 5],
        'agent_1': shape=[50, 200, 5],
        'agent_2': shape=[50, 200, 5],
    },  # dim_act: 5
    ...
}
finish_path(*args, **kwargs)[源代码]
sample(batch_size=None)[源代码]

Samples a batch of data from the replay buffer.

参数:

batch_size (int) – The size of the batch data to be sampled.

返回:

The sampled data.

返回类型:

samples_dict (dict)

store(**step_data)[源代码]

Stores a step of data into the replay buffer.

class xuance.common.memory_tools_marl.MARL_OffPolicyBuffer_RNN(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, batch_size: int = 1, max_episode_steps: int = 1, **kwargs)[源代码]

基类:MARL_OffPolicyBuffer

Replay buffer for off-policy MARL algorithms with DRQN trick.

参数:
  • agent_keys (List[str]) – Keys that identify each agent.

  • state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.

  • obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).

  • act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).

  • n_envs (int) – Number of parallel environments.

  • buffer_size (int) – Buffer size of total experience data.

  • batch_size (int) – Batch size of episodes for a sample.

  • max_episode_steps (int) – The sequence length of each episode data.

  • **kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=50,
>>> buffer_size=10000,
>>> batch_size=256,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> max_episode_steps=60
>>> memory = MARL_OffPolicyBuffer_RNN(agent_keys=agent_keys, state_space=state_space,
...                                   obs_space=obs_space, act_space=act_space,
...                                   n_envs=n_envs, buffer_size=buffer_size, batch_size=batch_size,
...                                   max_episode_steps=max_episode_steps)
clear()[源代码]

Clears the memory data in the replay buffer.

示例

An example shows the data shape (buffer_size=10000, max_eps_len=60, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[10000, 61, 18],
        'agent_1': shape=[10000, 61, 18],
        'agent_2': shape=[10000, 61, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[10000, 60, 5],
        'agent_1': shape=[10000, 60, 5],
        'agent_2': shape=[10000, 60, 5],
    },  # dim_act: 5
    ...
    'filled': shape=[10000, 60],  # Step mask values. True means current step is not terminated.
}
clear_episodes()[源代码]

Clears an episode of data for multiple environments in the replay buffer.

示例

An example shows the data shape (n_envs=16, max_eps_len=60, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 61, 18],
        'agent_1': shape=[16, 61, 18],
        'agent_2': shape=[16, 61, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 60, 5],
        'agent_1': shape=[16, 60, 5],
        'agent_2': shape=[16, 60, 5],
    },  # dim_act: 5
    ...
    'filled': shape=[16, 60],  # Step mask values. True means current step is not terminated.
}
finish_path(i_env, **terminal_data)[源代码]

Address the terminal states, including store the terminal observations, avail_actions, and others.

参数:
  • i_env (int) – The i-th environment.

  • terminal_data (dict) – The terminal states.

sample(batch_size=None)[源代码]

Samples a batch of data for model training.

参数:

batch_size (int) – The size of the data batch, default is self.batch_size (recommended).

返回:

A dict of sampled data.

返回类型:

samples_dict (dict)

store(**step_data)[源代码]

Stores a step of data for each environment.

参数:

step_data (dict) – A dict of step data that to be stored into self.episode_data.

store_episodes(i_env)[源代码]

Stores the episode of data for ith environment into the self.data.

参数:

i_env (int) – The ith environment.

class xuance.common.memory_tools_marl.MARL_OnPolicyBuffer(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, use_gae: bool | None = False, use_advnorm: bool | None = False, gamma: float | None = None, gae_lam: float | None = None, **kwargs)[源代码]

基类:BaseBuffer

Replay buffer for on-policy MARL algorithms.

参数:
  • agent_keys (List[str]) – Keys that identify each agent.

  • state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.

  • obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).

  • act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).

  • n_envs (int) – Number of parallel environments.

  • buffer_size (int) – Buffer size of total experience data.

  • use_gae (bool) – Whether to use GAE trick.

  • use_advnorm (bool) – Whether to use Advantage normalization trick.

  • gamma (float) – Discount factor.

  • gae_lam (float) – gae lambda.

  • **kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=16,
>>> buffer_size=1600,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                               act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               use_gae=False, use_advnorm=False, gamma=0.99, gae_lam=0.95)
clear()[源代码]

Clears the memory data in the replay buffer.

示例

An example shows the data shape (n_env=16, buffer_size=1600, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 100, 18],
        'agent_1': shape=[16, 100, 18],
        'agent_2': shape=[16, 100, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 100, 5],
        'agent_1': shape=[16, 100, 5],
        'agent_2': shape=[16, 100, 5],
    },  # dim_act: 5
    ...
}
finish_path(i_env: int | None = None, value_next: dict | None = None, value_normalizer=None)[源代码]

Calculates and stores the returns and advantages when an episode is finished.

参数:
  • i_env (int) – The index of environment.

  • value_next (dict) – The critic values of the terminal state.

  • value_normalizer – The value normalizer method, default is None.

sample(indexes: ndarray | None = None)[源代码]

Samples a batch of data from the replay buffer.

参数:

indexes (int) – The indexes of the data in the buffer that will be sampled.

返回:

The sampled data.

返回类型:

samples_dict (dict)

store(**step_data)[源代码]

Stores a step of data into the replay buffer.

class xuance.common.memory_tools_marl.MARL_OnPolicyBuffer_RNN(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, max_episode_steps: int = 1, use_gae: bool | None = False, use_advnorm: bool | None = False, gamma: float | None = None, gae_lam: float | None = None, **kwargs)[源代码]

基类:MARL_OnPolicyBuffer

Replay buffer for on-policy MARL algorithms with DRQN trick.

参数:
  • agent_keys (List[str]) – Keys that identify each agent.

  • state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.

  • obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).

  • act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).

  • n_envs (int) – Number of parallel environments.

  • buffer_size (int) – Buffer size of total experience data.

  • max_episode_steps (int) – The sequence length of each episode data.

  • use_gae (bool) – Whether to use GAE trick.

  • use_advnorm (bool) – Whether to use Advantage normalization trick.

  • gamma (float) – Discount factor.

  • gae_lam (float) – gae lambda.

  • **kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=16,
>>> buffer_size=1600,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> max_episode_steps = 100
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                               act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               max_episode_steps=max_episode_steps,
...                               use_gae=False, use_advnorm=False, gamma=0.99, gae_lam=0.95)
clear()[源代码]

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:

None

返回:

None

clear_episodes()[源代码]
finish_path(i_env: int | None = None, i_step: int | None = None, value_next: dict | None = None, value_normalizer: dict | None = None)[源代码]

Calculates and stores the returns and advantages when an episode is finished.

参数:
  • i_env (int) – The index of environment.

  • i_step (int) – The index of step for current environment.

  • value_next (Optional[dict]) – The critic values of the terminal state.

  • value_normalizer (Optional[dict]) – The value normalizer method, default is None.

property full
sample(indexes: ndarray | None = None)[源代码]

Samples a batch of data from the replay buffer.

参数:

indexes (int) – The indexes of the data in the buffer that will be sampled.

返回:

The sampled data.

返回类型:

samples_dict (dict)

store(**step_data)[源代码]

Stores a step of data for each environment.

参数:

step_data (dict) – A dict of step data that to be stored into self.episode_data.

store_episodes(i_env)[源代码]

Stores the episode of data for ith environment into the self.data.

参数:

i_env (int) – The ith environment.

class xuance.common.memory_tools_marl.MeanField_OffPolicyBuffer(agent_keys: List[str], state_space: Dict[str, gymnasium.spaces.Space] | None = None, obs_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, act_space: Dict[str, Dict[str, gymnasium.spaces.Space]] | None = None, n_envs: int = 1, buffer_size: int = 1, batch_size: int = 1, **kwargs)[源代码]

基类:MARL_OffPolicyBuffer

Replay buffer for off-policy Mean-Field MARL algorithms (Mean-Field Q-Learning).

参数:
  • agent_keys (List[str]) – Keys that identify each agent.

  • state_space (Dict[str, Space]) – Global state space, type: Discrete, Box.

  • obs_space (Dict[str, Dict[str, Space]]) – Observation space for one agent (suppose same obs space for group agents).

  • act_space (Dict[str, Dict[str, Space]]) – Action space for one agent (suppose same actions space for group agents).

  • n_envs (int) – Number of parallel environments.

  • buffer_size (int) – Buffer size of total experience data.

  • batch_size (int) – Batch size of transition data for a sample.

  • **kwargs – Other arguments.

示例

>>> state_space=None
>>> obs_space={'agent_0': Box(-inf, inf, (18,), float32),
...            'agent_1': Box(-inf, inf, (18,), float32),
...            'agent_2': Box(-inf, inf, (18,), float32)},
>>> act_space={'agent_0': Box(0.0, 1.0, (5,), float32),
...            'agent_1': Box(0.0, 1.0, (5,), float32),
...            'agent_2': Box(0.0, 1.0, (5,), float32)},
>>> n_envs=50,
>>> buffer_size=10000,
>>> batch_size=256,
>>> agent_keys=['agent_0', 'agent_1', 'agent_2'],
>>> memory = MARL_OffPolicyBuffer(agent_keys=agent_keys, state_space=state_space, obs_space=obs_space,
...                               act_space=act_space, n_envs=n_envs, buffer_size=buffer_size,
...                               batch_size=batch_size)
clear()[源代码]

Clears the memory data in the replay buffer.

示例

An example shows the data shape:

# (n_env=50, buffer_size=10000, agent_keys=['agent_0', 'agent_1', 'agent_2'])
self.data = {
    'obs': {
        'agent_0': shape=[50, 200, 18],
        'agent_1': shape=[50, 200, 18],
        'agent_2': shape=[50, 200, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[50, 200, 5],
        'agent_1': shape=[50, 200, 5],
        'agent_2': shape=[50, 200, 5],
    },  # dim_act: 5
    ...
}
class xuance.common.memory_tools_marl.MeanField_OffPolicyBuffer_RNN(*args, **kwargs)[源代码]

基类:MARL_OffPolicyBuffer_RNN

clear()[源代码]

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:

None

返回:

None

clear_episodes()[源代码]

Clears an episode of data for multiple environments in the replay buffer.

示例

An example shows the data shape (n_envs=16, max_eps_len=60, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 61, 18],
        'agent_1': shape=[16, 61, 18],
        'agent_2': shape=[16, 61, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 60, 5],
        'agent_1': shape=[16, 60, 5],
        'agent_2': shape=[16, 60, 5],
    },  # dim_act: 5
    ...
    'filled': shape=[16, 60],  # Step mask values. True means current step is not terminated.
}
finish_path(i_env, **terminal_data)[源代码]

Address the terminal states, including store the terminal observations, avail_actions, and others.

参数:
  • i_env (int) – The i-th environment.

  • terminal_data (dict) – The terminal states.

class xuance.common.memory_tools_marl.MeanField_OnPolicyBuffer(*args, **kwargs)[源代码]

基类:MARL_OnPolicyBuffer

Replay buffer for Mean Field Actor-Critic algorithm.

clear()[源代码]

Clears the memory data in the replay buffer.

示例

An example shows the data shape (n_env=16, buffer_size=1600, agent_keys=['agent_0', 'agent_1', 'agent_2']):

self.data = {
    'obs': {
        'agent_0': shape=[16, 100, 18],
        'agent_1': shape=[16, 100, 18],
        'agent_2': shape=[16, 100, 18],
    },  # dim_obs: 18
    'actions': {
        'agent_0': shape=[16, 100, 5],
        'agent_1': shape=[16, 100, 5],
        'agent_2': shape=[16, 100, 5],
    },  # dim_act: 5
    ...
}
class xuance.common.memory_tools_marl.MeanField_OnPolicyBuffer_RNN(*args, **kwargs)[源代码]

基类:MARL_OnPolicyBuffer_RNN

clear()[源代码]

Clear all buffer data in the on-policy replay buffer.

This method resets all stored observations, actions, rewards, values, and other related fields to zero.

参数:

None

返回:

None

clear_episodes()[源代码]