HER

Hindsight Experience Replay (HER)

HER is an algorithm that works with off-policy methods (DQN, SAC, TD3 and DDPG for example). HER uses the fact that even if a desired goal was not achieved, other goal may have been achieved during a rollout. It creates “virtual” transitions by relabeling transitions (changing the desired goal) from past episodes.

Warning

HER requires the environment to inherits from gym.GoalEnv

Warning

For performance reasons, the maximum number of steps per episodes must be specified. In most cases, it will be inferred if you specify max_episode_steps when registering the environment or if you use a gym.wrappers.TimeLimit (and env.spec is not None). Otherwise, you can directly pass max_episode_length to the model constructor

Warning

HER supports VecNormalize wrapper but only when online_sampling=True

Warning

Because it needs access to env.compute_reward() HER must be loaded with the env. If you just want to use the trained policy without instantiating the environment, we recommend saving the policy only.

Can I use?

Please refer to the used model (DQN, QR-DQN, SAC, TQC, TD3, or DDPG) for that section.

Example

from stable_baselines3 import HER, DDPG, DQN, SAC, TD3
from stable_baselines3.her.goal_selection_strategy import GoalSelectionStrategy
from stable_baselines3.common.bit_flipping_env import BitFlippingEnv
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.vec_env.obs_dict_wrapper import ObsDictWrapper

model_class = DQN  # works also with SAC, DDPG and TD3
N_BITS = 15

env = BitFlippingEnv(n_bits=N_BITS, continuous=model_class in [DDPG, SAC, TD3], max_steps=N_BITS)

# Available strategies (cf paper): future, final, episode
goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE

# If True the HER transitions will get sampled online
online_sampling = True
# Time limit for the episodes
max_episode_length = N_BITS

# Initialize the model
model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy, online_sampling=online_sampling,
                        verbose=1, max_episode_length=max_episode_length)
# Train the model
model.learn(1000)

model.save("./her_bit_env")
# Because it needs access to `env.compute_reward()`
# HER must be loaded with the env
model = HER.load('./her_bit_env', env=env)

obs = env.reset()
for _ in range(100):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _ = env.step(action)

    if done:
        obs = env.reset()

Results

This implementation was tested on the parking env using 3 seeds.

The complete learning curves are available in the associated PR #120.

How to replicate the results?

Clone the rl-zoo repo:

git clone https://github.com/DLR-RM/rl-baselines3-zoo
cd rl-baselines3-zoo/

Run the benchmark:

python train.py --algo her --env parking-v0 --eval-episodes 10 --eval-freq 10000

Plot the results:

python scripts/all_plots.py -a her -e parking-v0 -f logs/ --no-million

Parameters

class stable_baselines3.her.HER(policy, env, model_class, n_sampled_goal=4, goal_selection_strategy='future', online_sampling=False, max_episode_length=None, *args, **kwargs)[source]

Hindsight Experience Replay (HER) Paper: https://arxiv.org/abs/1707.01495

Warning

For performance reasons, the maximum number of steps per episodes must be specified. In most cases, it will be inferred if you specify max_episode_steps when registering the environment or if you use a gym.wrappers.TimeLimit (and env.spec is not None). Otherwise, you can directly pass max_episode_length to the model constructor

For additional offline algorithm specific arguments please have a look at the corresponding documentation.

Parameters
  • policy (Union[str, Type[BasePolicy]]) – The policy model to use.

  • env (Union[Env, VecEnv, str]) – The environment to learn from (if registered in Gym, can be str)

  • model_class (Type[OffPolicyAlgorithm]) – Off policy model which will be used with hindsight experience replay. (SAC, TD3, DDPG, DQN)

  • n_sampled_goal (int) – Number of sampled goals for replay. (offline sampling)

  • goal_selection_strategy (Union[GoalSelectionStrategy, str]) – Strategy for sampling goals for replay. One of [‘episode’, ‘final’, ‘future’, ‘random’]

  • online_sampling (bool) – Sample HER transitions online.

  • learning_rate – learning rate for the optimizer, it can be a function of the current progress remaining (from 1 to 0)

  • max_episode_length (Optional[int]) – The maximum length of an episode. If not specified, it will be automatically inferred if the environment uses a gym.wrappers.TimeLimit wrapper.

collect_rollouts(env, callback, train_freq, action_noise=None, learning_starts=0, log_interval=None)[source]

Collect experiences and store them into a ReplayBuffer.

Parameters
  • env (VecEnv) – The training environment

  • callback (BaseCallback) – Callback that will be called at each step (and at the beginning and end of the rollout)

  • train_freq (TrainFreq) – How much experience to collect by doing rollouts of current policy. Either TrainFreq(<n>, TrainFrequencyUnit.STEP) or TrainFreq(<n>, TrainFrequencyUnit.EPISODE) with <n> being an integer greater than 0.

  • action_noise (Optional[ActionNoise]) – Action noise that will be used for exploration Required for deterministic policy (e.g. TD3). This can also be used in addition to the stochastic policy for SAC.

  • learning_starts (int) – Number of steps before learning for the warm-up phase.

  • log_interval (Optional[int]) – Log data every log_interval episodes

Return type

RolloutReturn

Returns

learn(total_timesteps, callback=None, log_interval=4, eval_env=None, eval_freq=- 1, n_eval_episodes=5, tb_log_name='HER', eval_log_path=None, reset_num_timesteps=True)[source]

Return a trained model.

Parameters
  • total_timesteps (int) – The total number of samples (env steps) to train on

  • callback (Union[None, Callable, List[BaseCallback], BaseCallback]) – callback(s) called at every step with state of the algorithm.

  • log_interval (int) – The number of timesteps before logging.

  • tb_log_name (str) – the name of the run for TensorBoard logging

  • eval_env (Union[Env, VecEnv, None]) – Environment that will be used to evaluate the agent

  • eval_freq (int) – Evaluate the agent every eval_freq timesteps (this may vary a little)

  • n_eval_episodes (int) – Number of episode to evaluate the agent

  • eval_log_path (Optional[str]) – Path to a folder where the evaluations will be saved

  • reset_num_timesteps (bool) – whether or not to reset the current timestep number (used in logging)

Return type

BaseAlgorithm

Returns

the trained model

classmethod load(path, env=None, device='auto', custom_objects=None, **kwargs)[source]

Load the model from a zip-file

Parameters
  • path (Union[str, Path, BufferedIOBase]) – path to the file (or a file-like) where to load the agent from

  • env (Union[Env, VecEnv, None]) – the new environment to run the loaded model on (can be None if you only need prediction from a trained model) has priority over any saved environment

  • device (Union[device, str]) – Device on which the code should run.

  • custom_objects (Optional[Dict[str, Any]]) – Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.

  • kwargs – extra arguments to change the model when loading

Return type

BaseAlgorithm

load_replay_buffer(path, truncate_last_trajectory=True)[source]

Load a replay buffer from a pickle file and set environment for replay buffer (only online sampling).

Parameters
  • path (Union[str, Path, BufferedIOBase]) – Path to the pickled replay buffer.

  • truncate_last_trajectory (bool) – Only for online sampling. If set to True we assume that the last trajectory in the replay buffer was finished. If it is set to False we assume that we continue the same trajectory (same episode).

Return type

None

predict(observation, state=None, mask=None, deterministic=False)[source]

Get the model’s action(s) from an observation

Parameters
  • observation (ndarray) – the input observation

  • state (Optional[ndarray]) – The last states (can be None, used in recurrent policies)

  • mask (Optional[ndarray]) – The last masks (can be None, used in recurrent policies)

  • deterministic (bool) – Whether or not to return deterministic actions.

Return type

Tuple[ndarray, Optional[ndarray]]

Returns

the model’s action and the next state (used in recurrent policies)

save(path, exclude=None, include=None)[source]

Save all the attributes of the object and the model parameters in a zip-file.

Parameters
  • path (Union[str, Path, BufferedIOBase]) – path to the file where the rl agent should be saved

  • exclude (Optional[Iterable[str]]) – name of parameters that should be excluded in addition to the default one

  • include (Optional[Iterable[str]]) – name of parameters that might be excluded but should be included anyway

Return type

None

Goal Selection Strategies

class stable_baselines3.her.GoalSelectionStrategy(value)[source]

The strategies for selecting new goals when creating artificial transitions.

Obs Dict Wrapper

class stable_baselines3.her.ObsDictWrapper(venv)[source]

Wrapper for a VecEnv which overrides the observation space for Hindsight Experience Replay to support dict observations.

Parameters

env – The vectorized environment to wrap.

close()

Clean up the environment’s resources.

Return type

None

static convert_dict(observation_dict, observation_key='observation', goal_key='desired_goal')[source]

Concatenate observation and (desired) goal of observation dict.

Parameters
  • observation_dict (Dict[str, ndarray]) – Dictionary with observation.

  • observation_key (str) – Key of observation in dictionary.

  • goal_key (str) – Key of (desired) goal in dictionary.

Return type

ndarray

Returns

Concatenated observation.

env_is_wrapped(wrapper_class, indices=None)

Check if environments are wrapped with a given wrapper.

Parameters
  • method_name – The name of the environment method to invoke.

  • indices (Union[None, int, Iterable[int]]) – Indices of envs whose method to call

  • method_args – Any positional arguments to provide in the call

  • method_kwargs – Any keyword arguments to provide in the call

Return type

List[bool]

Returns

True if the env is wrapped, False otherwise, for each env queried.

env_method(method_name, *method_args, indices=None, **method_kwargs)

Call instance methods of vectorized environments.

Parameters
  • method_name (str) – The name of the environment method to invoke.

  • indices (Union[None, int, Iterable[int]]) – Indices of envs whose method to call

  • method_args – Any positional arguments to provide in the call

  • method_kwargs – Any keyword arguments to provide in the call

Return type

List[Any]

Returns

List of items returned by the environment’s method call

get_attr(attr_name, indices=None)

Return attribute from vectorized environment.

Parameters
  • attr_name (str) – The name of the attribute whose value to return

  • indices (Union[None, int, Iterable[int]]) – Indices of envs to get attribute from

Return type

List[Any]

Returns

List of values of ‘attr_name’ in all environments

get_images()

Return RGB images from each environment

Return type

Sequence[ndarray]

getattr_depth_check(name, already_found)

See base class.

Return type

str

Returns

name of module whose attribute is being shadowed, if any.

getattr_recursive(name)

Recursively check wrappers to find attribute.

Parameters

name (str) – name of attribute to look for

Return type

Any

Returns

attribute

render(mode='human')

Gym environment rendering

Parameters

mode (str) – the rendering type

Return type

Optional[ndarray]

reset()[source]

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Returns

observation

seed(seed=None)

Sets the random seeds for all environments, based on a given seed. Each individual environment will still get its own seed, by incrementing the given seed.

Parameters

seed (Optional[int]) – The random seed. May be None for completely random seeding.

Return type

List[Optional[int]]

Returns

Returns a list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when being seeded.

set_attr(attr_name, value, indices=None)

Set attribute inside vectorized environments.

Parameters
  • attr_name (str) – The name of attribute to assign new value

  • value (Any) – Value to assign to attr_name

  • indices (Union[None, int, Iterable[int]]) – Indices of envs to assign value

Return type

None

Returns

step(actions)

Step the environments with the given action

Parameters

actions (ndarray) – the action

Return type

Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, …]], ndarray, ndarray, List[Dict]]

Returns

observation, reward, done, information

step_async(actions)

Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of the step.

You should not call this if a step_async run is already pending.

Return type

None

step_wait()[source]

Wait for the step taken with step_async().

Returns

observation, reward, done, information

HER Replay Buffer

class stable_baselines3.her.HerReplayBuffer(env, buffer_size, max_episode_length, goal_selection_strategy, observation_space, action_space, device='cpu', n_envs=1, her_ratio=0.8)[source]

Replay buffer for sampling HER (Hindsight Experience Replay) transitions. In the online sampling case, these new transitions will not be saved in the replay buffer and will only be created at sampling time.

Parameters
  • env (ObsDictWrapper) – The training environment

  • buffer_size (int) – The size of the buffer measured in transitions.

  • max_episode_length (int) – The length of an episode. (time horizon)

  • goal_selection_strategy (GoalSelectionStrategy) – Strategy for sampling goals for replay. One of [‘episode’, ‘final’, ‘future’]

  • observation_space (Space) – Observation space

  • action_space (Space) – Action space

  • device (Union[device, str]) – PyTorch device

  • n_envs (int) – Number of parallel environments

Her_ratio

The ratio between HER transitions and regular transitions in percent (between 0 and 1, for online sampling) The default value her_ratio=0.8 corresponds to 4 virtual transitions for one real transition (4 / (4 + 1) = 0.8)

add(obs, next_obs, action, reward, done, infos)[source]

Add elements to the buffer.

Return type

None

extend(*args, **kwargs)

Add a new batch of transitions to the buffer

Return type

None

reset()[source]

Reset the buffer.

Return type

None

sample(batch_size, env)[source]

Sample function for online sampling of HER transition, this replaces the “regular” replay buffer sample() method in the train() function.

Parameters
  • batch_size (int) – Number of element to sample

  • env (Optional[VecNormalize]) – Associated gym VecEnv to normalize the observations/rewards when sampling

Return type

Union[ReplayBufferSamples, Tuple[ndarray, …]]

Returns

Samples.

sample_goals(episode_indices, her_indices, transitions_indices)[source]

Sample goals based on goal_selection_strategy. This is a vectorized (fast) version.

Parameters
  • episode_indices (ndarray) – Episode indices to use.

  • her_indices (ndarray) – HER indices.

  • transitions_indices (ndarray) – Transition indices to use.

Return type

ndarray

Returns

Return sampled goals.

sample_offline(n_sampled_goal=None)[source]

Sample function for offline sampling of HER transition, in that case, only one episode is used and transitions are added to the regular replay buffer.

Parameters

n_sampled_goal (Optional[int]) – Number of sampled goals for replay

Return type

Union[ReplayBufferSamples, Tuple[ndarray, …]]

Returns

at most(n_sampled_goal * episode_length) HER transitions.

set_env(env)[source]

Sets the environment.

Parameters

env (ObsDictWrapper) –

Return type

None

size()[source]
Return type

int

Returns

The current number of transitions in the buffer.

store_episode()[source]

Increment episode counter and reset transition pointer.

Return type

None

static swap_and_flatten(arr)

Swap and then flatten axes 0 (buffer_size) and 1 (n_envs) to convert shape from [n_steps, n_envs, …] (when … is the shape of the features) to [n_steps * n_envs, …] (which maintain the order)

Parameters

arr (ndarray) –

Return type

ndarray

Returns

to_torch(array, copy=True)

Convert a numpy array to a PyTorch tensor. Note: it copies the data by default

Parameters
  • array (ndarray) –

  • copy (bool) – Whether to copy or not the data (may be useful to avoid changing things be reference)

Return type

Tensor

Returns