HER¶
Hindsight Experience Replay (HER)
HER is an algorithm that works with off-policy methods (DQN, SAC, TD3 and DDPG for example). HER uses the fact that even if a desired goal was not achieved, other goal may have been achieved during a rollout. It creates “virtual” transitions by relabeling transitions (changing the desired goal) from past episodes.
Warning
HER requires the environment to inherits from gym.GoalEnv
Warning
For performance reasons, the maximum number of steps per episodes must be specified.
In most cases, it will be inferred if you specify max_episode_steps
when registering the environment
or if you use a gym.wrappers.TimeLimit
(and env.spec
is not None).
Otherwise, you can directly pass max_episode_length
to the model constructor
Warning
HER
supports VecNormalize
wrapper but only when online_sampling=True
Warning
Because it needs access to env.compute_reward()
HER
must be loaded with the env. If you just want to use the trained policy
without instantiating the environment, we recommend saving the policy only.
Notes¶
Original paper: https://arxiv.org/abs/1707.01495
OpenAI paper: Plappert et al. (2018)
OpenAI blog post: https://openai.com/blog/ingredients-for-robotics-research/
Can I use?¶
Please refer to the used model (DQN, SAC, TD3 or DDPG) for that section.
Example¶
from stable_baselines3 import HER, DDPG, DQN, SAC, TD3
from stable_baselines3.her.goal_selection_strategy import GoalSelectionStrategy
from stable_baselines3.common.bit_flipping_env import BitFlippingEnv
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.vec_env.obs_dict_wrapper import ObsDictWrapper
model_class = DQN # works also with SAC, DDPG and TD3
N_BITS = 15
env = BitFlippingEnv(n_bits=N_BITS, continuous=model_class in [DDPG, SAC, TD3], max_steps=N_BITS)
# Available strategies (cf paper): future, final, episode
goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE
# If True the HER transitions will get sampled online
online_sampling = True
# Time limit for the episodes
max_episode_length = N_BITS
# Initialize the model
model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy, online_sampling=online_sampling,
verbose=1, max_episode_length=max_episode_length)
# Train the model
model.learn(1000)
model.save("./her_bit_env")
# Because it needs access to `env.compute_reward()`
# HER must be loaded with the env
model = HER.load('./her_bit_env', env=env)
obs = env.reset()
for _ in range(100):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, _ = env.step(action)
if done:
obs = env.reset()
Results¶
This implementation was tested on the parking env using 3 seeds.
The complete learning curves are available in the associated PR #120.
How to replicate the results?¶
Clone the rl-zoo repo:
git clone https://github.com/DLR-RM/rl-baselines3-zoo
cd rl-baselines3-zoo/
Run the benchmark:
python train.py --algo her --env parking-v0 --eval-episodes 10 --eval-freq 10000
Plot the results:
python scripts/all_plots.py -a her -e parking-v0 -f logs/ --no-million
Parameters¶
-
class
stable_baselines3.her.
HER
(policy, env, model_class, n_sampled_goal=4, goal_selection_strategy='future', online_sampling=False, max_episode_length=None, *args, **kwargs)[source]¶ Hindsight Experience Replay (HER) Paper: https://arxiv.org/abs/1707.01495
Warning
For performance reasons, the maximum number of steps per episodes must be specified. In most cases, it will be inferred if you specify
max_episode_steps
when registering the environment or if you use agym.wrappers.TimeLimit
(andenv.spec
is not None). Otherwise, you can directly passmax_episode_length
to the model constructorFor additional offline algorithm specific arguments please have a look at the corresponding documentation.
- Parameters
policy (
Union
[str
,Type
[BasePolicy
]]) – The policy model to use.env (
Union
[Env
,VecEnv
,str
]) – The environment to learn from (if registered in Gym, can be str)model_class (
Type
[OffPolicyAlgorithm
]) – Off policy model which will be used with hindsight experience replay. (SAC, TD3, DDPG, DQN)n_sampled_goal (
int
) – Number of sampled goals for replay. (offline sampling)goal_selection_strategy (
Union
[GoalSelectionStrategy
,str
]) – Strategy for sampling goals for replay. One of [‘episode’, ‘final’, ‘future’, ‘random’]online_sampling (
bool
) – Sample HER transitions online.learning_rate – learning rate for the optimizer, it can be a function of the current progress remaining (from 1 to 0)
max_episode_length (
Optional
[int
]) – The maximum length of an episode. If not specified, it will be automatically inferred if the environment uses agym.wrappers.TimeLimit
wrapper.
-
collect_rollouts
(env, callback, train_freq, action_noise=None, learning_starts=0, log_interval=None)[source]¶ Collect experiences and store them into a ReplayBuffer.
- Parameters
env (
VecEnv
) – The training environmentcallback (
BaseCallback
) – Callback that will be called at each step (and at the beginning and end of the rollout)train_freq (
TrainFreq
) – How much experience to collect by doing rollouts of current policy. EitherTrainFreq(<n>, TrainFrequencyUnit.STEP)
orTrainFreq(<n>, TrainFrequencyUnit.EPISODE)
with<n>
being an integer greater than 0.action_noise (
Optional
[ActionNoise
]) – Action noise that will be used for exploration Required for deterministic policy (e.g. TD3). This can also be used in addition to the stochastic policy for SAC.learning_starts (
int
) – Number of steps before learning for the warm-up phase.log_interval (
Optional
[int
]) – Log data everylog_interval
episodes
- Return type
RolloutReturn
- Returns
-
learn
(total_timesteps, callback=None, log_interval=4, eval_env=None, eval_freq=- 1, n_eval_episodes=5, tb_log_name='HER', eval_log_path=None, reset_num_timesteps=True)[source]¶ Return a trained model.
- Parameters
total_timesteps (
int
) – The total number of samples (env steps) to train oncallback (
Union
[None
,Callable
,List
[BaseCallback
],BaseCallback
]) – callback(s) called at every step with state of the algorithm.log_interval (
int
) – The number of timesteps before logging.tb_log_name (
str
) – the name of the run for TensorBoard loggingeval_env (
Union
[Env
,VecEnv
,None
]) – Environment that will be used to evaluate the agenteval_freq (
int
) – Evaluate the agent everyeval_freq
timesteps (this may vary a little)n_eval_episodes (
int
) – Number of episode to evaluate the agenteval_log_path (
Optional
[str
]) – Path to a folder where the evaluations will be savedreset_num_timesteps (
bool
) – whether or not to reset the current timestep number (used in logging)
- Return type
- Returns
the trained model
-
classmethod
load
(path, env=None, device='auto', **kwargs)[source]¶ Load the model from a zip-file
- Parameters
path (
Union
[str
,Path
,BufferedIOBase
]) – path to the file (or a file-like) where to load the agent fromenv (
Union
[Env
,VecEnv
,None
]) – the new environment to run the loaded model on (can be None if you only need prediction from a trained model) has priority over any saved environmentdevice (
Union
[device
,str
]) – Device on which the code should run.kwargs – extra arguments to change the model when loading
- Return type
-
load_replay_buffer
(path, truncate_last_trajectory=True)[source]¶ Load a replay buffer from a pickle file and set environment for replay buffer (only online sampling).
- Parameters
path (
Union
[str
,Path
,BufferedIOBase
]) – Path to the pickled replay buffer.truncate_last_trajectory (
bool
) – Only for online sampling. If set toTrue
we assume that the last trajectory in the replay buffer was finished. If it is set toFalse
we assume that we continue the same trajectory (same episode).
- Return type
None
-
predict
(observation, state=None, mask=None, deterministic=False)[source]¶ Get the model’s action(s) from an observation
- Parameters
observation (
ndarray
) – the input observationstate (
Optional
[ndarray
]) – The last states (can be None, used in recurrent policies)mask (
Optional
[ndarray
]) – The last masks (can be None, used in recurrent policies)deterministic (
bool
) – Whether or not to return deterministic actions.
- Return type
Tuple
[ndarray
,Optional
[ndarray
]]- Returns
the model’s action and the next state (used in recurrent policies)
-
save
(path, exclude=None, include=None)[source]¶ Save all the attributes of the object and the model parameters in a zip-file.
- Parameters
path (
Union
[str
,Path
,BufferedIOBase
]) – path to the file where the rl agent should be savedexclude (
Optional
[Iterable
[str
]]) – name of parameters that should be excluded in addition to the default oneinclude (
Optional
[Iterable
[str
]]) – name of parameters that might be excluded but should be included anyway
- Return type
None
Goal Selection Strategies¶
Obs Dict Wrapper¶
-
class
stable_baselines3.her.
ObsDictWrapper
(venv)[source]¶ Wrapper for a VecEnv which overrides the observation space for Hindsight Experience Replay to support dict observations.
- Parameters
env – The vectorized environment to wrap.
-
close
()¶ Clean up the environment’s resources.
- Return type
None
-
static
convert_dict
(observation_dict, observation_key='observation', goal_key='desired_goal')[source]¶ Concatenate observation and (desired) goal of observation dict.
- Parameters
observation_dict (
Dict
[str
,ndarray
]) – Dictionary with observation.observation_key (
str
) – Key of observation in dicitonary.goal_key (
str
) – Key of (desired) goal in dicitonary.
- Return type
ndarray
- Returns
Concatenated observation.
-
env_is_wrapped
(wrapper_class, indices=None)¶ Check if environments are wrapped with a given wrapper.
- Parameters
method_name – The name of the environment method to invoke.
indices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs whose method to callmethod_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call
- Return type
List
[bool
]- Returns
True if the env is wrapped, False otherwise, for each env queried.
-
env_method
(method_name, *method_args, indices=None, **method_kwargs)¶ Call instance methods of vectorized environments.
- Parameters
method_name (
str
) – The name of the environment method to invoke.indices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs whose method to callmethod_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call
- Return type
List
[Any
]- Returns
List of items returned by the environment’s method call
-
get_attr
(attr_name, indices=None)¶ Return attribute from vectorized environment.
- Parameters
attr_name (
str
) – The name of the attribute whose value to returnindices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs to get attribute from
- Return type
List
[Any
]- Returns
List of values of ‘attr_name’ in all environments
-
get_images
()¶ Return RGB images from each environment
- Return type
Sequence
[ndarray
]
-
getattr_depth_check
(name, already_found)¶ See base class.
- Return type
str
- Returns
name of module whose attribute is being shadowed, if any.
-
getattr_recursive
(name)¶ Recursively check wrappers to find attribute.
- Parameters
name (
str
) – name of attribute to look for- Return type
Any
- Returns
attribute
-
render
(mode='human')¶ Gym environment rendering
- Parameters
mode (
str
) – the rendering type- Return type
Optional
[ndarray
]
-
reset
()[source]¶ Reset all the environments and return an array of observations, or a tuple of observation arrays.
If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.
- Returns
observation
-
seed
(seed=None)¶ Sets the random seeds for all environments, based on a given seed. Each individual environment will still get its own seed, by incrementing the given seed.
- Parameters
seed (
Optional
[int
]) – The random seed. May be None for completely random seeding.- Return type
List
[Optional
[int
]]- Returns
Returns a list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when being seeded.
-
set_attr
(attr_name, value, indices=None)¶ Set attribute inside vectorized environments.
- Parameters
attr_name (
str
) – The name of attribute to assign new valuevalue (
Any
) – Value to assign to attr_nameindices (
Union
[None
,int
,Iterable
[int
]]) – Indices of envs to assign value
- Return type
None
- Returns
-
step
(actions)¶ Step the environments with the given action
- Parameters
actions (
ndarray
) – the action- Return type
Tuple
[Union
[ndarray
,Dict
[str
,ndarray
],Tuple
[ndarray
, …]],ndarray
,ndarray
,List
[Dict
]]- Returns
observation, reward, done, information
-
step_async
(actions)¶ Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of the step.
You should not call this if a step_async run is already pending.
- Return type
None
HER Replay Buffer¶
-
class
stable_baselines3.her.
HerReplayBuffer
(env, buffer_size, max_episode_length, goal_selection_strategy, observation_space, action_space, device='cpu', n_envs=1, her_ratio=0.8)[source]¶ Replay buffer for sampling HER (Hindsight Experience Replay) transitions. In the online sampling case, these new transitions will not be saved in the replay buffer and will only be created at sampling time.
- Parameters
env (
ObsDictWrapper
) – The training environmentbuffer_size (
int
) – The size of the buffer measured in transitions.max_episode_length (
int
) – The length of an episode. (time horizon)goal_selection_strategy (
GoalSelectionStrategy
) – Strategy for sampling goals for replay. One of [‘episode’, ‘final’, ‘future’]observation_space (
Space
) – Observation spaceaction_space (
Space
) – Action spacedevice (
Union
[device
,str
]) – PyTorch devicen_envs (
int
) – Number of parallel environments
- Her_ratio
The ratio between HER transitions and regular transitions in percent (between 0 and 1, for online sampling) The default value
her_ratio=0.8
corresponds to 4 virtual transitions for one real transition (4 / (4 + 1) = 0.8)
-
add
(obs, next_obs, action, reward, done, infos)[source]¶ Add elements to the buffer.
- Return type
None
-
extend
(*args, **kwargs)¶ Add a new batch of transitions to the buffer
- Return type
None
-
sample
(batch_size, env)[source]¶ Sample function for online sampling of HER transition, this replaces the “regular” replay buffer
sample()
method in thetrain()
function.- Parameters
batch_size (
int
) – Number of element to sampleenv (
Optional
[VecNormalize
]) – Associated gym VecEnv to normalize the observations/rewards when sampling
- Return type
Union
[ReplayBufferSamples
,Tuple
[ndarray
, …]]- Returns
Samples.
-
sample_goals
(episode_indices, her_indices, transitions_indices)[source]¶ Sample goals based on goal_selection_strategy. This is a vectorized (fast) version.
- Parameters
episode_indices (
ndarray
) – Episode indices to use.her_indices (
ndarray
) – HER indices.transitions_indices (
ndarray
) – Transition indices to use.
- Return type
ndarray
- Returns
Return sampled goals.
-
sample_offline
(n_sampled_goal=None)[source]¶ Sample function for offline sampling of HER transition, in that case, only one episode is used and transitions are added to the regular replay buffer.
- Parameters
n_sampled_goal (
Optional
[int
]) – Number of sampled goals for replay- Return type
Union
[ReplayBufferSamples
,Tuple
[ndarray
, …]]- Returns
at most(n_sampled_goal * episode_length) HER transitions.
-
static
swap_and_flatten
(arr)¶ Swap and then flatten axes 0 (buffer_size) and 1 (n_envs) to convert shape from [n_steps, n_envs, …] (when … is the shape of the features) to [n_steps * n_envs, …] (which maintain the order)
- Parameters
arr (
ndarray
) –- Return type
ndarray
- Returns
-
to_torch
(array, copy=True)¶ Convert a numpy array to a PyTorch tensor. Note: it copies the data by default
- Parameters
array (
ndarray
) –copy (
bool
) – Whether to copy or not the data (may be useful to avoid changing things be reference)
- Return type
Tensor
- Returns