RL Algorithms
This table displays the RL algorithms that are implemented in the Stable Baselines3 project, along with some useful characteristics: support for discrete/continuous actions, multiprocessing.
Name |
|
|
|
|
Multi Processing |
|---|---|---|---|---|---|
ARS [1] |
✔️ |
✔️ |
❌ |
❌ |
✔️ |
A2C |
✔️ |
✔️ |
✔️ |
✔️ |
✔️ |
CrossQ [1] |
✔️ |
❌ |
❌ |
❌ |
✔️ |
DDPG |
✔️ |
❌ |
❌ |
❌ |
✔️ |
DQN |
❌ |
✔️ |
❌ |
❌ |
✔️ |
HER |
✔️ |
✔️ |
❌ |
❌ |
✔️ |
PPO |
✔️ |
✔️ |
✔️ |
✔️ |
✔️ |
QR-DQN [1] |
❌ |
️✔️ |
❌ |
❌ |
✔️ |
RecurrentPPO [1] |
✔️ |
✔️ |
✔️ |
✔️ |
✔️ |
SAC |
✔️ |
❌ |
❌ |
❌ |
✔️ |
TD3 |
✔️ |
❌ |
❌ |
❌ |
✔️ |
TQC [1] |
✔️ |
❌ |
❌ |
❌ |
✔️ |
TRPO [1] |
✔️ |
✔️ |
✔️ |
✔️ |
✔️ |
Maskable PPO [1] |
❌ |
✔️ |
✔️ |
✔️ |
✔️ |
Note
Tuple observation spaces are not supported by any environment,
however, single-level Dict spaces are (cf. Examples).
Actions gym.spaces:
Box: A N-dimensional box that contains every point in the action space.Discrete: A list of possible actions, where each timestep only one of the actions can be used.MultiDiscrete: A list of possible actions, where each timestep only one action of each discrete set can be used.MultiBinary: A list of possible actions, where each timestep any of the actions can be used in any combination.
Note
More algorithms (like QR-DQN or TQC) are implemented in our contrib repo and in our SBX (SB3 + Jax) repo (DroQ, CrossQ, SimBa, …).
Note
Some logging values (like ep_rew_mean, ep_len_mean) are only available when using a Monitor wrapper
See Issue #339 for more info.
Note
When using off-policy algorithms, Time Limits (aka timeouts) are handled
properly (cf. issue #284).
You can revert to SB3 < 2.1.0 behavior by passing handle_timeout_termination=False
via the replay_buffer_kwargs argument.
Reproducibility
Completely reproducible results are not guaranteed across PyTorch releases or different platforms. Furthermore, results need not be reproducible between CPU and GPU executions, even when using identical seeds.
In order to make computations deterministics, on your specific problem on one specific platform,
you need to pass a seed argument at the creation of a model.
If you pass an environment to the model using set_env(), then you also need to seed the environment first.
Credit: part of the Reproducibility section comes from PyTorch Documentation
Training exceeds total_timesteps
When you train an agent using SB3, you pass a total_timesteps parameter to the learn() method which defines the training budget for the agent (how many interactions with the environment are allowed).
For example:
from stable_baselines3 import PPO
model = PPO("MlpPolicy", "CartPole-v1").learn(total_timesteps=1_000)
Because of the way the algorithms work, total_timesteps is a lower bound (see issue #1150).
In the example above, PPO will effectively collect n_steps * n_envs = 2048 * 1 steps despite total_timesteps=1_000
In more details:
PPO/A2C and derivates collect
n_steps * n_envsof experience before performing an update, so if you want to have exactlytotal_timesteps, you will need to adjust those valuesSAC/DQN/TD3 and other off-policy algorithms collect
train_freq * n_envssteps before doing an update (whentrain_freqis in steps and not episodes), so if you want to have exactlytotal_timestepsyou have to adjust these values (train_freq=4by default for DQN)ARS and other population-based algorithms evaluate the policy for
n_episodeswithn_envs, so unless the number of steps per episode is fixed, it is not possible to exactly achievetotal_timestepswhen using multiple envs, each call to
env.step()corresponds ton_envstimesteps, so it is no longer possible to use theEvaluationCallbackat an exact timestep