HER¶

Hindsight Experience Replay (HER)

HER is an algorithm that works with off-policy methods (DQN, SAC, TD3 and DDPG for example). HER uses the fact that even if a desired goal was not achieved, other goal may have been achieved during a rollout. It creates “virtual” transitions by relabeling transitions (changing the desired goal) from past episodes.

Warning

HER requires the environment to inherits from gym.GoalEnv

Warning

For performance reasons, the maximum number of steps per episodes must be specified. In most cases, it will be inferred if you specify max_episode_steps when registering the environment or if you use a gym.wrappers.TimeLimit (and env.spec is not None). Otherwise, you can directly pass max_episode_length to the model constructor

Warning

HER supports VecNormalize wrapper but only when online_sampling=True

Warning

Because it needs access to env.compute_reward() HER must be loaded with the env. If you just want to use the trained policy without instantiating the environment, we recommend saving the policy only.

Notes¶

Original paper: https://arxiv.org/abs/1707.01495
OpenAI paper: Plappert et al. (2018)
OpenAI blog post: https://openai.com/blog/ingredients-for-robotics-research/

Can I use?¶

Please refer to the used model (DQN, QR-DQN, SAC, TQC, TD3, or DDPG) for that section.

Example¶

from stable_baselines3 import HER, DDPG, DQN, SAC, TD3
from stable_baselines3.her.goal_selection_strategy import GoalSelectionStrategy
from stable_baselines3.common.bit_flipping_env import BitFlippingEnv
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.vec_env.obs_dict_wrapper import ObsDictWrapper

model_class = DQN  # works also with SAC, DDPG and TD3
N_BITS = 15

env = BitFlippingEnv(n_bits=N_BITS, continuous=model_class in [DDPG, SAC, TD3], max_steps=N_BITS)

# Available strategies (cf paper): future, final, episode
goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE

# If True the HER transitions will get sampled online
online_sampling = True
# Time limit for the episodes
max_episode_length = N_BITS

# Initialize the model
model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy, online_sampling=online_sampling,
                        verbose=1, max_episode_length=max_episode_length)
# Train the model
model.learn(1000)

model.save("./her_bit_env")
# Because it needs access to `env.compute_reward()`
# HER must be loaded with the env
model = HER.load('./her_bit_env', env=env)

obs = env.reset()
for _ in range(100):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _ = env.step(action)

    if done:
        obs = env.reset()

Results¶

This implementation was tested on the parking env using 3 seeds.

The complete learning curves are available in the associated PR #120.

How to replicate the results?¶

Clone the rl-zoo repo:

git clone https://github.com/DLR-RM/rl-baselines3-zoo
cd rl-baselines3-zoo/

Run the benchmark:

python train.py --algo her --env parking-v0 --eval-episodes 10 --eval-freq 10000

Plot the results:

python scripts/all_plots.py -a her -e parking-v0 -f logs/ --no-million

Parameters¶

class stable_baselines3.her.HER(policy, env, model_class, n_sampled_goal=4, goal_selection_strategy='future', online_sampling=False, max_episode_length=None, *args, **kwargs)[source]¶

Hindsight Experience Replay (HER) Paper: https://arxiv.org/abs/1707.01495

Warning

For performance reasons, the maximum number of steps per episodes must be specified. In most cases, it will be inferred if you specify max_episode_steps when registering the environment or if you use a gym.wrappers.TimeLimit (and env.spec is not None). Otherwise, you can directly pass max_episode_length to the model constructor

For additional offline algorithm specific arguments please have a look at the corresponding documentation.

Parameters

policy (Union[str, Type[BasePolicy]]) – The policy model to use.
env (Union[Env, VecEnv, str]) – The environment to learn from (if registered in Gym, can be str)
model_class (Type[OffPolicyAlgorithm]) – Off policy model which will be used with hindsight experience replay. (SAC, TD3, DDPG, DQN)
n_sampled_goal (int) – Number of sampled goals for replay. (offline sampling)
goal_selection_strategy (Union[GoalSelectionStrategy, str]) – Strategy for sampling goals for replay. One of [‘episode’, ‘final’, ‘future’, ‘random’]
online_sampling (bool) – Sample HER transitions online.
learning_rate – learning rate for the optimizer, it can be a function of the current progress remaining (from 1 to 0)
max_episode_length (Optional[int]) – The maximum length of an episode. If not specified, it will be automatically inferred if the environment uses a gym.wrappers.TimeLimit wrapper.

collect_rollouts(env, callback, train_freq, action_noise=None, learning_starts=0, log_interval=None)[source]¶

Collect experiences and store them into a ReplayBuffer.

Parameters

env (VecEnv) – The training environment
callback (BaseCallback) – Callback that will be called at each step (and at the beginning and end of the rollout)
train_freq (TrainFreq) – How much experience to collect by doing rollouts of current policy. Either TrainFreq(<n>, TrainFrequencyUnit.STEP) or TrainFreq(<n>, TrainFrequencyUnit.EPISODE) with <n> being an integer greater than 0.
action_noise (Optional[ActionNoise]) – Action noise that will be used for exploration Required for deterministic policy (e.g. TD3). This can also be used in addition to the stochastic policy for SAC.
learning_starts (int) – Number of steps before learning for the warm-up phase.
log_interval (Optional[int]) – Log data every log_interval episodes

Return type

RolloutReturn

Returns

learn(total_timesteps, callback=None, log_interval=4, eval_env=None, eval_freq=- 1, n_eval_episodes=5, tb_log_name='HER', eval_log_path=None, reset_num_timesteps=True)[source]¶

Return a trained model.

Parameters

total_timesteps (int) – The total number of samples (env steps) to train on
callback (Union[None, Callable, List[BaseCallback], BaseCallback]) – callback(s) called at every step with state of the algorithm.
log_interval (int) – The number of timesteps before logging.
tb_log_name (str) – the name of the run for TensorBoard logging
eval_env (Union[Env, VecEnv, None]) – Environment that will be used to evaluate the agent
eval_freq (int) – Evaluate the agent every eval_freq timesteps (this may vary a little)
n_eval_episodes (int) – Number of episode to evaluate the agent
eval_log_path (Optional[str]) – Path to a folder where the evaluations will be saved
reset_num_timesteps (bool) – whether or not to reset the current timestep number (used in logging)

Return type

BaseAlgorithm

Returns

the trained model

classmethod load(path, env=None, device='auto', custom_objects=None, **kwargs)[source]¶

Load the model from a zip-file

Parameters

path (Union[str, Path, BufferedIOBase]) – path to the file (or a file-like) where to load the agent from
env (Union[Env, VecEnv, None]) – the new environment to run the loaded model on (can be None if you only need prediction from a trained model) has priority over any saved environment
device (Union[device, str]) – Device on which the code should run.
custom_objects (Optional[Dict[str, Any]]) – Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.
kwargs – extra arguments to change the model when loading

Return type

BaseAlgorithm

load_replay_buffer(path, truncate_last_trajectory=True)[source]¶

Load a replay buffer from a pickle file and set environment for replay buffer (only online sampling).

Parameters

path (Union[str, Path, BufferedIOBase]) – Path to the pickled replay buffer.
truncate_last_trajectory (bool) – Only for online sampling. If set to True we assume that the last trajectory in the replay buffer was finished. If it is set to False we assume that we continue the same trajectory (same episode).

Return type

None

predict(observation, state=None, mask=None, deterministic=False)[source]¶

Get the model’s action(s) from an observation

Parameters

observation (ndarray) – the input observation
state (Optional[ndarray]) – The last states (can be None, used in recurrent policies)
mask (Optional[ndarray]) – The last masks (can be None, used in recurrent policies)
deterministic (bool) – Whether or not to return deterministic actions.

Return type

Tuple[ndarray, Optional[ndarray]]

Returns

the model’s action and the next state (used in recurrent policies)

save(path, exclude=None, include=None)[source]¶

Save all the attributes of the object and the model parameters in a zip-file.

Parameters

path (Union[str, Path, BufferedIOBase]) – path to the file where the rl agent should be saved
exclude (Optional[Iterable[str]]) – name of parameters that should be excluded in addition to the default one
include (Optional[Iterable[str]]) – name of parameters that might be excluded but should be included anyway

Return type

None

Goal Selection Strategies¶

class stable_baselines3.her.GoalSelectionStrategy(value)[source]¶: The strategies for selecting new goals when creating artificial transitions.

Obs Dict Wrapper¶

class stable_baselines3.her.ObsDictWrapper(venv)[source]¶

Wrapper for a VecEnv which overrides the observation space for Hindsight Experience Replay to support dict observations.

Parameters: env – The vectorized environment to wrap.

close()¶

Clean up the environment’s resources.

Return type: None

static convert_dict(observation_dict, observation_key='observation', goal_key='desired_goal')[source]¶

Concatenate observation and (desired) goal of observation dict.

Parameters

observation_dict (Dict[str, ndarray]) – Dictionary with observation.
observation_key (str) – Key of observation in dictionary.
goal_key (str) – Key of (desired) goal in dictionary.

Return type

ndarray

Returns

Concatenated observation.

env_is_wrapped(wrapper_class, indices=None)¶

Check if environments are wrapped with a given wrapper.

Parameters

method_name – The name of the environment method to invoke.
indices (Union[None, int, Iterable[int]]) – Indices of envs whose method to call
method_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call

Return type

List[bool]

Returns

True if the env is wrapped, False otherwise, for each env queried.

env_method(method_name, *method_args, indices=None, **method_kwargs)¶

Call instance methods of vectorized environments.

Parameters

method_name (str) – The name of the environment method to invoke.
indices (Union[None, int, Iterable[int]]) – Indices of envs whose method to call
method_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call

Return type

List[Any]

Returns

List of items returned by the environment’s method call

get_attr(attr_name, indices=None)¶

Return attribute from vectorized environment.

Parameters

attr_name (str) – The name of the attribute whose value to return
indices (Union[None, int, Iterable[int]]) – Indices of envs to get attribute from

Return type

List[Any]

Returns

List of values of ‘attr_name’ in all environments

get_images()¶

Return RGB images from each environment

Return type: Sequence[ndarray]

getattr_depth_check(name, already_found)¶

See base class.

Return type: str
Returns: name of module whose attribute is being shadowed, if any.

getattr_recursive(name)¶

Recursively check wrappers to find attribute.

Parameters: name (str) – name of attribute to look for
Return type: Any
Returns: attribute

render(mode='human')¶

Gym environment rendering

Parameters: mode (str) – the rendering type
Return type: Optional[ndarray]

reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Returns: observation

seed(seed=None)¶

Sets the random seeds for all environments, based on a given seed. Each individual environment will still get its own seed, by incrementing the given seed.

Parameters: seed (Optional[int]) – The random seed. May be None for completely random seeding.
Return type: List[Optional[int]]
Returns: Returns a list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when being seeded.

set_attr(attr_name, value, indices=None)¶

Set attribute inside vectorized environments.

Parameters

attr_name (str) – The name of attribute to assign new value
value (Any) – Value to assign to attr_name
indices (Union[None, int, Iterable[int]]) – Indices of envs to assign value

Return type

None

Returns

step(actions)¶

Step the environments with the given action

Parameters: actions (ndarray) – the action
Return type: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, …]], ndarray, ndarray, List[Dict]]
Returns: observation, reward, done, information

step_async(actions)¶

Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of the step.

You should not call this if a step_async run is already pending.

Return type: None

step_wait()[source]¶

Wait for the step taken with step_async().

Returns: observation, reward, done, information

HER Replay Buffer¶

class stable_baselines3.her.HerReplayBuffer(env, buffer_size, max_episode_length, goal_selection_strategy, observation_space, action_space, device='cpu', n_envs=1, her_ratio=0.8)[source]¶

Replay buffer for sampling HER (Hindsight Experience Replay) transitions. In the online sampling case, these new transitions will not be saved in the replay buffer and will only be created at sampling time.

Parameters

env (ObsDictWrapper) – The training environment
buffer_size (int) – The size of the buffer measured in transitions.
max_episode_length (int) – The length of an episode. (time horizon)
goal_selection_strategy (GoalSelectionStrategy) – Strategy for sampling goals for replay. One of [‘episode’, ‘final’, ‘future’]
observation_space (Space) – Observation space
action_space (Space) – Action space
device (Union[device, str]) – PyTorch device
n_envs (int) – Number of parallel environments

Her_ratio

The ratio between HER transitions and regular transitions in percent (between 0 and 1, for online sampling) The default value her_ratio=0.8 corresponds to 4 virtual transitions for one real transition (4 / (4 + 1) = 0.8)

add(obs, next_obs, action, reward, done, infos)[source]¶

Add elements to the buffer.

Return type: None

extend(*args, **kwargs)¶

Add a new batch of transitions to the buffer

Return type: None

reset()[source]¶

Reset the buffer.

Return type: None

sample(batch_size, env)[source]¶

Sample function for online sampling of HER transition, this replaces the “regular” replay buffer sample() method in the train() function.

Parameters

batch_size (int) – Number of element to sample
env (Optional[VecNormalize]) – Associated gym VecEnv to normalize the observations/rewards when sampling

Return type

Union[ReplayBufferSamples, Tuple[ndarray, …]]

Returns

Samples.

sample_goals(episode_indices, her_indices, transitions_indices)[source]¶

Sample goals based on goal_selection_strategy. This is a vectorized (fast) version.

Parameters

episode_indices (ndarray) – Episode indices to use.
her_indices (ndarray) – HER indices.
transitions_indices (ndarray) – Transition indices to use.

Return type

ndarray

Returns

Return sampled goals.

sample_offline(n_sampled_goal=None)[source]¶

Sample function for offline sampling of HER transition, in that case, only one episode is used and transitions are added to the regular replay buffer.

Parameters: n_sampled_goal (Optional[int]) – Number of sampled goals for replay
Return type: Union[ReplayBufferSamples, Tuple[ndarray, …]]
Returns: at most(n_sampled_goal * episode_length) HER transitions.

set_env(env)[source]¶

Sets the environment.

Parameters: env (ObsDictWrapper) –
Return type: None

size()[source]¶

Return type: int
Returns: The current number of transitions in the buffer.

store_episode()[source]¶

Increment episode counter and reset transition pointer.

Return type: None

static swap_and_flatten(arr)¶

Swap and then flatten axes 0 (buffer_size) and 1 (n_envs) to convert shape from [n_steps, n_envs, …] (when … is the shape of the features) to [n_steps * n_envs, …] (which maintain the order)

Parameters: arr (ndarray) –
Return type: ndarray
Returns

to_torch(array, copy=True)¶

Convert a numpy array to a PyTorch tensor. Note: it copies the data by default

Parameters

array (ndarray) –
copy (bool) – Whether to copy or not the data (may be useful to avoid changing things be reference)

Return type

Tensor

Returns