Changelog¶
Release 1.4.0 (2022-01-18)¶
TRPO, ARS and multi env training for off-policy algorithms
Breaking Changes:¶
Dropped python 3.6 support (as announced in previous release)
Renamed
mask
argument of thepredict()
method toepisode_start
(used with RNN policies only)local variables
action
,done
andreward
were renamed to their plural form for offpolicy algorithms (actions
,dones
,rewards
), this may affect custom callbacks.Removed
episode_reward
field fromRolloutReturn()
type
Warning
An update to the HER
algorithm is planned to support multi-env training and remove the max episode length constrain.
(see PR #704)
This will be a backward incompatible change (model trained with previous version of HER
won’t work with the new version).
New Features:¶
Added
norm_obs_keys
param forVecNormalize
wrapper to configure which observation keys to normalize (@kachayev)Added experimental support to train off-policy algorithms with multiple envs (note:
HerReplayBuffer
currently not supported)Handle timeout termination properly for on-policy algorithms (when using
TimeLimit
)Added
skip
option toVecTransposeImage
to skip transforming the channel order when the heuristic is wrongAdded
copy()
andcombine()
methods toRunningMeanStd
SB3-Contrib¶
Added Trust Region Policy Optimization (TRPO) (@cyprienc)
Added Augmented Random Search (ARS) (@sgillen)
Coming soon: PPO LSTM, see https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/53
Bug Fixes:¶
Fixed a bug where
set_env()
withVecNormalize
would result in an error with off-policy algorithms (thanks @cleversonahum)FPS calculation is now performed based on number of steps performed during last
learn
call, even whenreset_num_timesteps
is set toFalse
(@kachayev)Fixed evaluation script for recurrent policies (experimental feature in SB3 contrib)
Fixed a bug where the observation would be incorrectly detected as non-vectorized instead of throwing an error
The env checker now properly checks and warns about potential issues for continuous action spaces when the boundaries are too small or when the dtype is not float32
Fixed a bug in
VecFrameStack
with channel first image envs, where the terminal observation would be wrongly created.
Deprecations:¶
Others:¶
Added a warning in the env checker when not using
np.float32
for continuous actionsImproved test coverage and error message when checking shape of observation
Added
newline="\n"
when opening CSV monitor files so that each line ends with\r\n
instead of\r\r\n
on Windows while Linux environments are not affected (@hsuehch)Fixed
device
argument inconsistency (@qgallouedec)
Documentation:¶
Add drivergym to projects page (@theDebugger811)
Add highway-env to projects page (@eleurent)
Add tactile-gym to projects page (@ac-93)
Fix indentation in the RL tips page (@cove9988)
Update GAE computation docstring
Add documentation on exporting to TFLite/Coral
Added JMLR paper and updated citation
Added link to RL Tips and Tricks video
Updated
BaseAlgorithm.load
docstring (@Demetrio92)Added a note on
load
behavior in the examples (@Demetrio92)Updated SB3 Contrib doc
Fixed A2C and migration guide guidance on how to set epsilon with RMSpropTFLike (@thomasgubler)
Fixed custom policy documentation (@IperGiove)
Added doc on Weights & Biases integration
Release 1.3.0 (2021-10-23)¶
Bug fixes and improvements for the user
Warning
This version will be the last one supporting Python 3.6 (end of life in Dec 2021). We highly recommended you to upgrade to Python >= 3.7.
Breaking Changes:¶
sde_net_arch
argument in policies is deprecated and will be removed in a future version._get_latent
(ActorCriticPolicy
) was removedAll logging keys now use underscores instead of spaces (@timokau). Concretely this changes:
time/total timesteps
totime/total_timesteps
for off-policy algorithms (PPO and A2C) and the eval callback (on-policy algorithms already used the underscored version),rollout/exploration rate
torollout/exploration_rate
androllout/success rate
torollout/success_rate
.
New Features:¶
Added methods
get_distribution
andpredict_values
forActorCriticPolicy
for A2C/PPO/TRPO (@cyprienc)Added methods
forward_actor
andforward_critic
forMlpExtractor
Added
sb3.get_system_info()
helper function to gather version information relevant to SB3 (e.g., Python and PyTorch version)Saved models now store system information where agent was trained, and load functions have
print_system_info
parameter to help debugging load issues
Bug Fixes:¶
Fixed
dtype
of observations forSimpleMultiObsEnv
Allow VecNormalize to wrap discrete-observation environments to normalize reward when observation normalization is disabled
Fixed a bug where
DQN
would throw an error when usingDiscrete
observation and stochastic actionsFixed a bug where sub-classed observation spaces could not be used
Added
force_reset
argument toload()
andset_env()
in order to be able to calllearn(reset_num_timesteps=False)
with a new environment
Deprecations:¶
Others:¶
Cap gym max version to 0.19 to avoid issues with atari-py and other breaking changes
Improved error message when using dict observation with the wrong policy
Improved error message when using
EvalCallback
with two envs not wrapped the same way.Added additional infos about supported python version for PyPi in
setup.py
Documentation:¶
Add Rocket League Gym to list of supported projects (@AechPro)
Added gym-electric-motor to project page (@wkirgsn)
Added policy-distillation-baselines to project page (@CUN-bjy)
Added ONNX export instructions (@batu)
Update read the doc env (fixed
docutils
issue)Fix PPO environment name (@IljaAvadiev)
Fix custom env doc and add env registration example
Update algorithms from SB3 Contrib
Use underscores for numeric literals in examples to improve clarity
Release 1.2.0 (2021-09-03)¶
Hotfix for VecNormalize, training/eval mode support
Breaking Changes:¶
SB3 now requires PyTorch >= 1.8.1
VecNormalize
ret
attribute was renamed toreturns
New Features:¶
Bug Fixes:¶
Hotfix for
VecNormalize
where the observation filter was not updated at reset (thanks @vwxyzjn)Fixed model predictions when using batch normalization and dropout layers by calling
train()
andeval()
(@davidblom603)Fixed model training for DQN, TD3 and SAC so that their target nets always remain in evaluation mode (@ayeright)
Passing
gradient_steps=0
to an off-policy algorithm will result in no gradient steps being taken (vs as many gradient steps as steps done in the environment during the rollout in previous versions)
Deprecations:¶
Others:¶
Enabled Python 3.9 in GitHub CI
Fixed type annotations
Refactored
predict()
by moving the preprocessing toobs_to_tensor()
method
Documentation:¶
Updated multiprocessing example
Added example of
VecEnvWrapper
Added a note about logging to tensorboard more often
Added warning about simplicity of examples and link to RL zoo (@MihaiAnca13)
Release 1.1.0 (2021-07-01)¶
Dict observation support, timeout handling and refactored HER buffer
Breaking Changes:¶
All customs environments (e.g. the
BitFlippingEnv
orIdentityEnv
) were moved tostable_baselines3.common.envs
folderRefactored
HER
which is now theHerReplayBuffer
class that can be passed to any off-policy algorithmHandle timeout termination properly for off-policy algorithms (when using
TimeLimit
)Renamed
_last_dones
anddones
to_last_episode_starts
andepisode_starts
inRolloutBuffer
.Removed
ObsDictWrapper
asDict
observation spaces are now supported
her_kwargs = dict(n_sampled_goal=2, goal_selection_strategy="future", online_sampling=True)
# SB3 < 1.1.0
# model = HER("MlpPolicy", env, model_class=SAC, **her_kwargs)
# SB3 >= 1.1.0:
model = SAC("MultiInputPolicy", env, replay_buffer_class=HerReplayBuffer, replay_buffer_kwargs=her_kwargs)
Updated the KL Divergence estimator in the PPO algorithm to be positive definite and have lower variance (@09tangriro)
Updated the KL Divergence check in the PPO algorithm to be before the gradient update step rather than after end of epoch (@09tangriro)
Removed parameter
channels_last
fromis_image_space
as it can be inferred.The logger object is now an attribute
model.logger
that be set by the user usingmodel.set_logger()
Changed the signature of
logger.configure
andutils.configure_logger
, they now return aLogger
objectRemoved
Logger.CURRENT
andLogger.DEFAULT
Moved
warn(), debug(), log(), info(), dump()
methods to theLogger
class.learn()
now throws an import error when the user tries to log to tensorboard but the package is not installed
New Features:¶
Added support for single-level
Dict
observation space (@JadenTravnik)Added
DictRolloutBuffer
DictReplayBuffer
to support dictionary observations (@JadenTravnik)Added
StackedObservations
andStackedDictObservations
that are used withinVecFrameStack
Added simple 4x4 room Dict test environments
HerReplayBuffer
now supportsVecNormalize
whenonline_sampling=False
Added VecMonitor and VecExtractDictObs wrappers to handle gym3-style vectorized environments (@vwxyzjn)
Ignored the terminal observation if the it is not provided by the environment such as the gym3-style vectorized environments. (@vwxyzjn)
Added policy_base as input to the OnPolicyAlgorithm for more flexibility (@09tangriro)
Added support for image observation when using
HER
Added
replay_buffer_class
andreplay_buffer_kwargs
arguments to off-policy algorithmsAdded
kl_divergence
helper forDistribution
classes (@09tangriro)Added support for vector environments with
num_envs > 1
(@benblack769)Added
wrapper_kwargs
argument tomake_vec_env
(@amy12xx)
Bug Fixes:¶
Fixed potential issue when calling off-policy algorithms with default arguments multiple times (the size of the replay buffer would be the same)
Fixed loading of
ent_coef
forSAC
andTQC
, it was not optimized anymore (thanks @Atlis)Fixed saving of
A2C
andPPO
policy when using gSDE (thanks @liusida)Fixed a bug where no output would be shown even if
verbose>=1
after passingverbose=0
onceFixed observation buffers dtype in DictReplayBuffer (@c-rizz)
Fixed EvalCallback tensorboard logs being logged with the incorrect timestep. They are now written with the timestep at which they were recorded. (@skandermoalla)
Deprecations:¶
Others:¶
Added
flake8-bugbear
to tests dependencies to find likely bugsUpdated
env_checker
to reflect support of dict observation spacesAdded Code of Conduct
Added tests for GAE and lambda return computation
Updated distribution entropy test (thanks @09tangriro)
Added sanity check
batch_size > 1
in PPO to avoid NaN in advantage normalization
Documentation:¶
Added gym pybullet drones project (@JacopoPan)
Added link to SuperSuit in projects (@justinkterry)
Fixed DQN example (thanks @ltbd78)
Clarified channel-first/channel-last recommendation
Update sphinx environment installation instructions (@tom-doerr)
Clarified pip installation in Zsh (@tom-doerr)
Clarified return computation for on-policy algorithms (TD(lambda) estimate was used)
Added example for using
ProcgenEnv
Added note about advanced custom policy example for off-policy algorithms
Fixed DQN unicode checkmarks
Updated migration guide (@juancroldan)
Pinned
docutils==0.16
to avoid issue with rtd themeClarified callback
save_freq
definitionAdded doc on how to pass a custom logger
Remove recurrent policies from
A2C
docs (@bstee615)
Release 1.0 (2021-03-15)¶
First Major Version
Breaking Changes:¶
Removed
stable_baselines3.common.cmd_util
(already deprecated), please useenv_util
instead
New Features:¶
Added support for
custom_objects
when loading models
Bug Fixes:¶
Fixed a bug with
DQN
predict method when usingdeterministic=False
with image space
Documentation:¶
Fixed examples
Added new project using SB3: rl_reach (@PierreExeter)
Added note about slow-down when switching to PyTorch
Add a note on continual learning and resetting environment
Others:¶
Updated RL-Zoo to reflect the fact that is it more than a collection of trained agents
Added images to illustrate the training loop and custom policies (created with https://excalidraw.com/)
Updated the custom policy section
Pre-Release 0.11.1 (2021-02-27)¶
Bug Fixes:¶
Fixed a bug where
train_freq
was not properly converted when loading a saved model
Pre-Release 0.11.0 (2021-02-27)¶
Breaking Changes:¶
evaluate_policy
now returns rewards/episode lengths from aMonitor
wrapper if one is present, this allows to return the unnormalized reward in the case of Atari games for instance.Renamed
common.vec_env.is_wrapped
tocommon.vec_env.is_vecenv_wrapped
to avoid confusion with the newis_wrapped()
helperRenamed
_get_data()
to_get_constructor_parameters()
for policies (this affects independent saving/loading of policies)Removed
n_episodes_rollout
and merged it withtrain_freq
, which now accepts a tuple(frequency, unit)
:replay_buffer
incollect_rollout
is no more optional
# SB3 < 0.11.0
# model = SAC("MlpPolicy", env, n_episodes_rollout=1, train_freq=-1)
# SB3 >= 0.11.0:
model = SAC("MlpPolicy", env, train_freq=(1, "episode"))
New Features:¶
Add support for
VecFrameStack
to stack on first or last observation dimension, along with automatic check for image spaces.VecFrameStack
now has achannels_order
argument to tell if observations should be stacked on the first or last observation dimension (originally always stacked on last).Added
common.env_util.is_wrapped
andcommon.env_util.unwrap_wrapper
functions for checking/unwrapping an environment for specific wrapper.Added
env_is_wrapped()
method forVecEnv
to check if its environments are wrapped with given Gym wrappers.Added
monitor_kwargs
parameter tomake_vec_env
andmake_atari_env
Wrap the environments automatically with a
Monitor
wrapper when possible.EvalCallback
now logs the success rate when available (is_success
must be present in the info dict)Added new wrappers to log images and matplotlib figures to tensorboard. (@zampanteymedio)
Add support for text records to
Logger
. (@lorenz-h)
Bug Fixes:¶
Fixed bug where code added VecTranspose on channel-first image environments (thanks @qxcv)
Fixed
DQN
predict method when using singlegym.Env
withdeterministic=False
Fixed bug that the arguments order of
explained_variance()
inppo.py
anda2c.py
is not correct (@thisray)Fixed bug where full
HerReplayBuffer
leads to an index error. (@megan-klaiber)Fixed bug where replay buffer could not be saved if it was too big (> 4 Gb) for python<3.8 (thanks @hn2)
Added informative
PPO
construction error in edge-case scenario wheren_steps * n_envs = 1
(size of rollout buffer), which otherwise causes downstream breaking errors in training (@decodyng)Fixed discrete observation space support when using multiple envs with A2C/PPO (thanks @ardabbour)
Fixed a bug for TD3 delayed update (the update was off-by-one and not delayed when
train_freq=1
)Fixed numpy warning (replaced
np.bool
withbool
)Fixed a bug where
VecNormalize
was not normalizing the terminal observationFixed a bug where
VecTranspose
was not transposing the terminal observationFixed a bug where the terminal observation stored in the replay buffer was not the right one for off-policy algorithms
Fixed a bug where
action_noise
was not used when usingHER
(thanks @ShangqunYu)
Deprecations:¶
Others:¶
Add more issue templates
Add signatures to callable type annotations (@ernestum)
Improve error message in
NatureCNN
Added checks for supported action spaces to improve clarity of error messages for the user
Renamed variables in the
train()
method ofSAC
,TD3
andDQN
to match SB3-Contrib.Updated docker base image to Ubuntu 18.04
Set tensorboard min version to 2.2.0 (earlier version are apparently not working with PyTorch)
Added warning for
PPO
whenn_steps * n_envs
is not a multiple ofbatch_size
(last mini-batch truncated) (@decodyng)Removed some warnings in the tests
Documentation:¶
Updated algorithm table
Minor docstring improvements regarding rollout (@stheid)
Fix migration doc for
A2C
(epsilon parameter)Fix
clip_range
docstringFix duplicated parameter in
EvalCallback
docstring (thanks @tfederico)Added example of learning rate schedule
Added SUMO-RL as example project (@LucasAlegre)
Fix docstring of classes in atari_wrappers.py which were inside the constructor (@LucasAlegre)
Added SB3-Contrib page
Fix bug in the example code of DQN (@AptX395)
Add example on how to access the tensorboard summary writer directly. (@lorenz-h)
Updated migration guide
Updated custom policy doc (separate policy architecture recommended)
Added a note about OpenCV headless version
Corrected typo on documentation (@mschweizer)
Provide the environment when loading the model in the examples (@lorepieri8)
Pre-Release 0.10.0 (2020-10-28)¶
HER with online and offline sampling, bug fixes for features extraction
Breaking Changes:¶
Warning: Renamed
common.cmd_util
tocommon.env_util
for clarity (affectsmake_vec_env
andmake_atari_env
functions)
New Features:¶
Allow custom actor/critic network architectures using
net_arch=dict(qf=[400, 300], pi=[64, 64])
for off-policy algorithms (SAC, TD3, DDPG)Added Hindsight Experience Replay
HER
. (@megan-klaiber)VecNormalize
now supportsgym.spaces.Dict
observation spacesSupport logging videos to Tensorboard (@SwamyDev)
Added
share_features_extractor
argument toSAC
andTD3
policies
Bug Fixes:¶
Fix GAE computation for on-policy algorithms (off-by one for the last value) (thanks @Wovchena)
Fixed potential issue when loading a different environment
Fix ignoring the exclude parameter when recording logs using json, csv or log as logging format (@SwamyDev)
Make
make_vec_env
support theenv_kwargs
argument when using an env ID str (@ManifoldFR)Fix model creation initializing CUDA even when device=”cpu” is provided
Fix
check_env
not checking if the env has a Dict actionspace before calling_check_nan
(@wmmc88)Update the check for spaces unsupported by Stable Baselines 3 to include checks on the action space (@wmmc88)
Fixed feature extractor bug for target network where the same net was shared instead of being separate. This bug affects
SAC
,DDPG
andTD3
when usingCnnPolicy
(or custom feature extractor)Fixed a bug when passing an environment when loading a saved model with a
CnnPolicy
, the passed env was not wrapped properly (the bug was introduced when implementingHER
so it should not be present in previous versions)
Deprecations:¶
Others:¶
Improved typing coverage
Improved error messages for unsupported spaces
Added
.vscode
to the gitignore
Documentation:¶
Added first draft of migration guide
Added intro to imitation library (@shwang)
Enabled doc for
CnnPolicies
Added advanced saving and loading example
Added base doc for exporting models
Added example for getting and setting model parameters
Pre-Release 0.9.0 (2020-10-03)¶
Bug fixes, get/set parameters and improved docs
Breaking Changes:¶
Removed
device
keyword argument of policies; usepolicy.to(device)
instead. (@qxcv)Rename
BaseClass.get_torch_variables
->BaseClass._get_torch_save_params
andBaseClass.excluded_save_params
->BaseClass._excluded_save_params
Renamed saved items
tensors
topytorch_variables
for claritymake_atari_env
,make_vec_env
andset_random_seed
must be imported with (and not directly fromstable_baselines3.common
):
from stable_baselines3.common.cmd_util import make_atari_env, make_vec_env
from stable_baselines3.common.utils import set_random_seed
New Features:¶
Added
unwrap_vec_wrapper()
tocommon.vec_env
to extractVecEnvWrapper
if neededAdded
StopTrainingOnMaxEpisodes
to callback collection (@xicocaio)Added
device
keyword argument toBaseAlgorithm.load()
(@liorcohen5)Callbacks have access to rollout collection locals as in SB2. (@PartiallyTyped)
Added
get_parameters
andset_parameters
for accessing/setting parameters of the agentAdded actor/critic loss logging for TD3. (@mloo3)
Bug Fixes:¶
Added
unwrap_vec_wrapper()
tocommon.vec_env
to extractVecEnvWrapper
if neededFixed a bug where the environment was reset twice when using
evaluate_policy
Fix logging of
clip_fraction
in PPO (@diditforlulz273)Fixed a bug where cuda support was wrongly checked when passing the GPU index, e.g.,
device="cuda:0"
(@liorcohen5)Fixed a bug when the random seed was not properly set on cuda when passing the GPU index
Deprecations:¶
Others:¶
Improve typing coverage of the
VecEnv
Fix type annotation of
make_vec_env
(@ManifoldFR)Removed
AlreadySteppingError
andNotSteppingError
that were not usedFixed typos in SAC and TD3
Reorganized functions for clarity in
BaseClass
(save/load functions close to each other, private functions at top)Clarified docstrings on what is saved and loaded to/from files
Simplified
save_to_zip_file
function by removing duplicate codeStore library version along with the saved models
DQN loss is now logged
Documentation:¶
Added
StopTrainingOnMaxEpisodes
details and example (@xicocaio)Updated custom policy section (added custom feature extractor example)
Re-enable
sphinx_autodoc_typehints
Updated doc style for type hints and remove duplicated type hints
Pre-Release 0.8.0 (2020-08-03)¶
DQN, DDPG, bug fixes and performance matching for Atari games
Breaking Changes:¶
AtariWrapper
and other Atari wrappers were updated to match SB2 onessave_replay_buffer
now receives as argument the file path instead of the folder path (@tirafesi)Refactored
Critic
class forTD3
andSAC
, it is now calledContinuousCritic
and has an additional parametern_critics
SAC
andTD3
now accept an arbitrary number of critics (e.g.policy_kwargs=dict(n_critics=3)
) instead of only 2 previously
New Features:¶
Added
DQN
Algorithm (@Artemis-Skade)Buffer dtype is now set according to action and observation spaces for
ReplayBuffer
Added warning when allocation of a buffer may exceed the available memory of the system when
psutil
is availableSaving models now automatically creates the necessary folders and raises appropriate warnings (@PartiallyTyped)
Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@PartiallyTyped)
Added
DDPG
algorithm as a special case ofTD3
.Introduced
BaseModel
abstract parent forBasePolicy
, which critics inherit from.
Bug Fixes:¶
Fixed a bug in the
close()
method ofSubprocVecEnv
, causing wrappers further down in the wrapper stack to not be closed. (@NeoExtended)Fix target for updating q values in SAC: the entropy term was not conditioned by terminals states
Use
cloudpickle.load
instead ofpickle.load
inCloudpickleWrapper
. (@shwang)Fixed a bug with orthogonal initialization when bias=False in custom policy (@rk37)
Fixed approximate entropy calculation in PPO and A2C. (@andyshih12)
Fixed DQN target network sharing feature extractor with the main network.
Fixed storing correct
dones
in on-policy algorithm rollout collection. (@andyshih12)Fixed number of filters in final convolutional layer in NatureCNN to match original implementation.
Deprecations:¶
Others:¶
Refactored off-policy algorithm to share the same
.learn()
methodSplit the
collect_rollout()
method for off-policy algorithmsAdded
_on_step()
for off-policy base classOptimized replay buffer size by removing the need of
next_observations
numpy arrayOptimized polyak updates (1.5-1.95 speedup) through inplace operations (@PartiallyTyped)
Switch to
black
codestyle and addedmake format
,make check-codestyle
andcommit-checks
Ignored errors from newer pytype version
Added a check when using
gSDE
Removed codacy dependency from Dockerfile
Added
common.sb2_compat.RMSpropTFLike
optimizer, which corresponds closer to the implementation of RMSprop from Tensorflow.
Documentation:¶
Updated notebook links
Fixed a typo in the section of Enjoy a Trained Agent, in RL Baselines3 Zoo README. (@blurLake)
Added Unity reacher to the projects page (@koulakis)
Added PyBullet colab notebook
Fixed typo in PPO example code (@joeljosephjin)
Fixed typo in custom policy doc (@RaphaelWag)
Pre-Release 0.7.0 (2020-06-10)¶
Hotfix for PPO/A2C + gSDE, internal refactoring and bug fixes
Breaking Changes:¶
render()
method ofVecEnvs
now only accept one argument:mode
Created new file common/torch_layers.py, similar to SB refactoring
Contains all PyTorch network layer definitions and feature extractors:
MlpExtractor
,create_mlp
,NatureCNN
Renamed
BaseRLModel
toBaseAlgorithm
(along with offpolicy and onpolicy variants)Moved on-policy and off-policy base algorithms to
common/on_policy_algorithm.py
andcommon/off_policy_algorithm.py
, respectively.Moved
PPOPolicy
toActorCriticPolicy
in common/policies.pyMoved
PPO
(algorithm class) intoOnPolicyAlgorithm
(common/on_policy_algorithm.py
), to be shared with A2CMoved following functions from
BaseAlgorithm
:_load_from_file
toload_from_zip_file
(save_util.py)_save_to_file_zip
tosave_to_zip_file
(save_util.py)safe_mean
tosafe_mean
(utils.py)check_env
tocheck_for_correct_spaces
(utils.py. Renamed to avoid confusion with environment checker tools)
Moved static function
_is_vectorized_observation
from common/policies.py to common/utils.py under nameis_vectorized_observation
.Removed
{save,load}_running_average
functions ofVecNormalize
in favor ofload/save
.Removed
use_gae
parameter fromRolloutBuffer.compute_returns_and_advantage
.
New Features:¶
Bug Fixes:¶
Fixed
render()
method forVecEnvs
Fixed
seed()
method forSubprocVecEnv
Fixed loading on GPU for testing when using gSDE and
deterministic=False
Fixed
register_policy
to allow re-registering same policy for same sub-class (i.e. assign same value to same key).Fixed a bug where the gradient was passed when using
gSDE
withPPO
/A2C
, this does not affectSAC
Deprecations:¶
Others:¶
Re-enable unsafe
fork
start method in the tests (was causing a deadlock with tensorflow)Added a test for seeding
SubprocVecEnv
and renderingFixed reference in NatureCNN (pointed to older version with different network architecture)
Fixed comments saying “CxWxH” instead of “CxHxW” (same style as in torch docs / commonly used)
Added bit further comments on register/getting policies (“MlpPolicy”, “CnnPolicy”).
Renamed
progress
(value from 1 in start of training to 0 in end) toprogress_remaining
.Added
policies.py
files for A2C/PPO, which define MlpPolicy/CnnPolicy (renamed ActorCriticPolicies).Added some missing tests for
VecNormalize
,VecCheckNan
andPPO
.
Documentation:¶
Added a paragraph on “MlpPolicy”/”CnnPolicy” and policy naming scheme under “Developer Guide”
Fixed second-level listing in changelog
Pre-Release 0.6.0 (2020-06-01)¶
Tensorboard support, refactored logger
Breaking Changes:¶
Remove State-Dependent Exploration (SDE) support for
TD3
Methods were renamed in the logger:
logkv
->record
,writekvs
->write
,writeseq
->write_sequence
,logkvs
->record_dict
,dumpkvs
->dump
,getkvs
->get_log_dict
,logkv_mean
->record_mean
,
New Features:¶
Added env checker (Sync with Stable Baselines)
Added
VecCheckNan
andVecVideoRecorder
(Sync with Stable Baselines)Added determinism tests
Added
cmd_util
andatari_wrappers
Added support for
MultiDiscrete
andMultiBinary
observation spaces (@rolandgvc)Added
MultiCategorical
andBernoulli
distributions for PPO/A2C (@rolandgvc)Added support for logging to tensorboard (@rolandgvc)
Added
VectorizedActionNoise
for continuous vectorized environments (@PartiallyTyped)Log evaluation in the
EvalCallback
using the logger
Bug Fixes:¶
Fixed a bug that prevented model trained on cpu to be loaded on gpu
Fixed version number that had a new line included
Fixed weird seg fault in docker image due to FakeImageEnv by reducing screen size
Fixed
sde_sample_freq
that was not taken into account for SACPass logger module to
BaseCallback
otherwise they cannot write in the one used by the algorithms
Deprecations:¶
Others:¶
Renamed to Stable-Baseline3
Added Dockerfile
Sync
VecEnvs
with Stable-BaselinesUpdate requirement:
gym>=0.17
Added
.readthedoc.yml
fileAdded
flake8
andmake lint
commandAdded Github workflow
Added warning when passing both
train_freq
andn_episodes_rollout
to Off-Policy Algorithms
Documentation:¶
Added most documentation (adapted from Stable-Baselines)
Added link to CONTRIBUTING.md in the README (@kinalmehta)
Added gSDE project and update docstrings accordingly
Fix
TD3
example code block
Pre-Release 0.5.0 (2020-05-05)¶
CnnPolicy support for image observations, complete saving/loading for policies
Breaking Changes:¶
Previous loading of policy weights is broken and replace by the new saving/loading for policy
New Features:¶
Added
optimizer_class
andoptimizer_kwargs
topolicy_kwargs
in order to easily customizer optimizersComplete independent save/load for policies
Add
CnnPolicy
andVecTransposeImage
to support images as input
Bug Fixes:¶
Fixed
reset_num_timesteps
behavior, soenv.reset()
is not called ifreset_num_timesteps=True
Fixed
squashed_output
that was not pass to policy constructor forSAC
andTD3
(would result in scaled actions for unscaled action spaces)
Deprecations:¶
Others:¶
Cleanup rollout return
Added
get_device
util to manage PyTorch devicesAdded type hints to logger + use f-strings
Documentation:¶
Pre-Release 0.4.0 (2020-02-14)¶
Proper pre-processing, independent save/load for policies
Breaking Changes:¶
Removed CEMRL
Model saved with previous versions cannot be loaded (because of the pre-preprocessing)
New Features:¶
Add support for
Discrete
observation spacesAdd saving/loading for policy weights, so the policy can be used without the model
Bug Fixes:¶
Fix type hint for activation functions
Deprecations:¶
Others:¶
Refactor handling of observation and action spaces
Refactored features extraction to have proper preprocessing
Refactored action distributions
Pre-Release 0.3.0 (2020-02-14)¶
Bug fixes, sync with Stable-Baselines, code cleanup
Breaking Changes:¶
Removed default seed
Bump dependencies (PyTorch and Gym)
predict()
now returns a tuple to match Stable-Baselines behavior
New Features:¶
Better logging for
SAC
andPPO
Bug Fixes:¶
Synced callbacks with Stable-Baselines
Fixed colors in
results_plotter
Fix entropy computation (now summed over action dim)
Others:¶
SAC with SDE now sample only one matrix
Added
clip_mean
parameter to SAC policyBuffers now return
NamedTuple
More typing
Add test for
expln
Renamed
learning_rate
tolr_schedule
Add
version.txt
Add more tests for distribution
Documentation:¶
Deactivated
sphinx_autodoc_typehints
extension
Pre-Release 0.2.0 (2020-02-14)¶
Python 3.6+ required, type checking, callbacks, doc build
Breaking Changes:¶
Python 2 support was dropped, Stable Baselines3 now requires Python 3.6 or above
Return type of
evaluation.evaluate_policy()
has been changedRefactored the replay buffer to avoid transformation between PyTorch and NumPy
Created OffPolicyRLModel base class
Remove deprecated JSON format for Monitor
New Features:¶
Add
seed()
method toVecEnv
classAdd support for Callback (cf https://github.com/hill-a/stable-baselines/pull/644)
Add methods for saving and loading replay buffer
Add
extend()
method to the buffersAdd
get_vec_normalize_env()
toBaseRLModel
to retrieveVecNormalize
wrapper when it existsAdd
results_plotter
from Stable BaselinesImprove
predict()
method to handle different type of observations (single, vectorized, …)
Bug Fixes:¶
Fix loading model on CPU that were trained on GPU
Fix
reset_num_timesteps
that was not usedFix entropy computation for squashed Gaussian (approximate it now)
Fix seeding when using multiple environments (different seed per env)
Others:¶
Add type check
Converted all format string to f-strings
Add test for
OrnsteinUhlenbeckActionNoise
Add type aliases in
common.type_aliases
Documentation:¶
fix documentation build
Pre-Release 0.1.0 (2020-01-20)¶
First Release: base algorithms and state-dependent exploration
New Features:¶
Initial release of A2C, CEM-RL, PPO, SAC and TD3, working only with
Box
input spaceState-Dependent Exploration (SDE) for A2C, PPO, SAC and TD3
Maintainers¶
Stable-Baselines3 is currently maintained by Antonin Raffin (aka @araffin), Ashley Hill (aka @hill-a), Maximilian Ernestus (aka @ernestum), Adam Gleave (@AdamGleave) and Anssi Kanervisto (aka @Miffyli).
Contributors:¶
In random order…
Thanks to the maintainers of V2: @hill-a @enerijunior @AdamGleave @Miffyli
And all the contributors: @bjmuld @iambenzo @iandanforth @r7vme @brendenpetersen @huvar @abhiskk @JohannesAck @EliasHasle @mrakgr @Bleyddyn @antoine-galataud @junhyeokahn @AdamGleave @keshaviyengar @tperol @XMaster96 @kantneel @Pastafarianist @GerardMaggiolino @PatrickWalter214 @yutingsz @sc420 @Aaahh @billtubbs @Miffyli @dwiel @miguelrass @qxcv @jaberkow @eavelardev @ruifeng96150 @pedrohbtp @srivatsankrishnan @evilsocket @MarvineGothic @jdossgollin @stheid @SyllogismRXS @rusu24edward @jbulow @Antymon @seheevic @justinkterry @edbeeching @flodorner @KuKuXia @NeoExtended @PartiallyTyped @mmcenta @richardwu @kinalmehta @rolandgvc @tkelestemur @mloo3 @tirafesi @blurLake @koulakis @joeljosephjin @shwang @rk37 @andyshih12 @RaphaelWag @xicocaio @diditforlulz273 @liorcohen5 @ManifoldFR @mloo3 @SwamyDev @wmmc88 @megan-klaiber @thisray @tfederico @hn2 @LucasAlegre @AptX395 @zampanteymedio @JadenTravnik @decodyng @ardabbour @lorenz-h @mschweizer @lorepieri8 @vwxyzjn @ShangqunYu @PierreExeter @JacopoPan @ltbd78 @tom-doerr @Atlis @liusida @09tangriro @amy12xx @juancroldan @benblack769 @bstee615 @c-rizz @skandermoalla @MihaiAnca13 @davidblom603 @ayeright @cyprienc @wkirgsn @AechPro @CUN-bjy @batu @IljaAvadiev @timokau @kachayev @cleversonahum @eleurent @ac-93 @cove9988 @theDebugger811 @hsuehch @Demetrio92 @thomasgubler @IperGiove