Changelog¶

Pre-Release 0.8.0 (2020-08-03)¶

DQN, DDPG, bug fixes and performance matching for Atari games

Breaking Changes:¶

AtariWrapper and other Atari wrappers were updated to match SB2 ones
save_replay_buffer now receives as argument the file path instead of the folder path (@tirafesi)
Refactored Critic class for TD3 and SAC, it is now called ContinuousCritic and has an additional parameter n_critics
SAC and TD3 now accept an arbitrary number of critics (e.g. policy_kwargs=dict(n_critics=3))
instead of only 2 previously

New Features:¶

Added DQN Algorithm (@Artemis-Skade)
Buffer dtype is now set according to action and observation spaces for ReplayBuffer
Added warning when allocation of a buffer may exceed the available memory of the system when psutil is available
Saving models now automatically creates the necessary folders and raises appropriate warnings (@PartiallyTyped)
Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@PartiallyTyped)
Added DDPG algorithm as a special case of TD3.
Introduced BaseModel abstract parent for BasePolicy, which critics inherit from.

Bug Fixes:¶

Fixed a bug in the close() method of SubprocVecEnv, causing wrappers further down in the wrapper stack to not be closed. (@NeoExtended)
Fix target for updating q values in SAC: the entropy term was not conditioned by terminals states
Use cloudpickle.load instead of pickle.load in CloudpickleWrapper. (@shwang)
Fixed a bug with orthogonal initialization when bias=False in custom policy (@rk37)
Fixed approximate entropy calculation in PPO and A2C. (@andyshih12)
Fixed DQN target network sharing feature extractor with the main network.
Fixed storing correct dones in on-policy algorithm rollout collection. (@andyshih12)
Fixed number of filters in final convolutional layer in NatureCNN to match original implementation.

Deprecations:¶

Others:¶

Refactored off-policy algorithm to share the same .learn() method
Split the collect_rollout() method for off-policy algorithms
Added _on_step() for off-policy base class
Optimized replay buffer size by removing the need of next_observations numpy array
Optimized polyak updates (1.5-1.95 speedup) through inplace operations (@PartiallyTyped)
Switch to black codestyle and added make format, make check-codestyle and commit-checks
Ignored errors from newer pytype version
Added a check when using gSDE
Removed codacy dependency from Dockerfile
Added common.sb2_compat.RMSpropTFLike optimizer, which corresponds closer to the implementation of RMSprop from Tensorflow.

Documentation:¶

Updated notebook links
Fixed a typo in the section of Enjoy a Trained Agent, in RL Baselines3 Zoo README. (@blurLake)
Added Unity reacher to the projects page (@koulakis)
Added PyBullet colab notebook
Fixed typo in PPO example code (@joeljosephjin)
Fixed typo in custom policy doc (@RaphaelWag)

Pre-Release 0.7.0 (2020-06-10)¶

Hotfix for PPO/A2C + gSDE, internal refactoring and bug fixes

Breaking Changes:¶

render() method of VecEnvs now only accept one argument: mode
Created new file common/torch_layers.py, similar to SB refactoring
- Contains all PyTorch network layer definitions and feature extractors: MlpExtractor, create_mlp, NatureCNN
Renamed BaseRLModel to BaseAlgorithm (along with offpolicy and onpolicy variants)
Moved on-policy and off-policy base algorithms to common/on_policy_algorithm.py and common/off_policy_algorithm.py, respectively.
Moved PPOPolicy to ActorCriticPolicy in common/policies.py
Moved PPO (algorithm class) into OnPolicyAlgorithm (common/on_policy_algorithm.py), to be shared with A2C
Moved following functions from BaseAlgorithm:
- _load_from_file to load_from_zip_file (save_util.py)
- _save_to_file_zip to save_to_zip_file (save_util.py)
- safe_mean to safe_mean (utils.py)
- check_env to check_for_correct_spaces (utils.py. Renamed to avoid confusion with environment checker tools)
Moved static function _is_vectorized_observation from common/policies.py to common/utils.py under name is_vectorized_observation.
Removed {save,load}_running_average functions of VecNormalize in favor of load/save.
Removed use_gae parameter from RolloutBuffer.compute_returns_and_advantage.

New Features:¶

Bug Fixes:¶

Fixed render() method for VecEnvs
Fixed seed() method for SubprocVecEnv
Fixed loading on GPU for testing when using gSDE and deterministic=False
Fixed register_policy to allow re-registering same policy for same sub-class (i.e. assign same value to same key).
Fixed a bug where the gradient was passed when using gSDE with PPO/A2C, this does not affect SAC

Deprecations:¶

Others:¶

Re-enable unsafe fork start method in the tests (was causing a deadlock with tensorflow)
Added a test for seeding SubprocVecEnv and rendering
Fixed reference in NatureCNN (pointed to older version with different network architecture)
Fixed comments saying “CxWxH” instead of “CxHxW” (same style as in torch docs / commonly used)
Added bit further comments on register/getting policies (“MlpPolicy”, “CnnPolicy”).
Renamed progress (value from 1 in start of training to 0 in end) to progress_remaining.
Added policies.py files for A2C/PPO, which define MlpPolicy/CnnPolicy (renamed ActorCriticPolicies).
Added some missing tests for VecNormalize, VecCheckNan and PPO.

Documentation:¶

Added a paragraph on “MlpPolicy”/”CnnPolicy” and policy naming scheme under “Developer Guide”
Fixed second-level listing in changelog

Pre-Release 0.6.0 (2020-06-01)¶

Tensorboard support, refactored logger

Breaking Changes:¶

Methods were renamed in the logger:
- logkv -> record, writekvs -> write, writeseq -> write_sequence,
- logkvs -> record_dict, dumpkvs -> dump,
- getkvs -> get_log_dict, logkv_mean -> record_mean

New Features:¶

Added env checker (Sync with Stable Baselines)
Added VecCheckNan and VecVideoRecorder (Sync with Stable Baselines)
Added determinism tests
Added cmd_util and atari_wrappers
Added support for MultiDiscrete and MultiBinary observation spaces (@rolandgvc)
Added MultiCategorical and Bernoulli distributions for PPO/A2C (@rolandgvc)
Added support for logging to tensorboard (@rolandgvc)
Added VectorizedActionNoise for continuous vectorized environments (@PartiallyTyped)
Log evaluation in the EvalCallback using the logger

Bug Fixes:¶

Fixed a bug that prevented model trained on cpu to be loaded on gpu
Fixed version number that had a new line included
Fixed weird seg fault in docker image due to FakeImageEnv by reducing screen size
Fixed sde_sample_freq that was not taken into account for SAC
Pass logger module to BaseCallback otherwise they cannot write in the one used by the algorithms

Deprecations:¶

Others:¶

Renamed to Stable-Baseline3
Added Dockerfile
Sync VecEnvs with Stable-Baselines
Update requirement: gym>=0.17
Added .readthedoc.yml file
Added flake8 and make lint command
Added Github workflow
Added warning when passing both train_freq and n_episodes_rollout to Off-Policy Algorithms

Documentation:¶

Added most documentation (adapted from Stable-Baselines)
Added link to CONTRIBUTING.md in the README (@kinalmehta)
Added gSDE project and update docstrings accordingly
Fix TD3 example code block

Pre-Release 0.5.0 (2020-05-05)¶

CnnPolicy support for image observations, complete saving/loading for policies

Breaking Changes:¶

Previous loading of policy weights is broken and replace by the new saving/loading for policy

New Features:¶

Added optimizer_class and optimizer_kwargs to policy_kwargs in order to easily customizer optimizers
Complete independent save/load for policies
Add CnnPolicy and VecTransposeImage to support images as input

Bug Fixes:¶

Fixed reset_num_timesteps behavior, so env.reset() is not called if reset_num_timesteps=True
Fixed squashed_output that was not pass to policy constructor for SAC and TD3 (would result in scaled actions for unscaled action spaces)

Deprecations:¶

Others:¶

Cleanup rollout return
Added get_device util to manage PyTorch devices
Added type hints to logger + use f-strings

Documentation:¶

Pre-Release 0.4.0 (2020-02-14)¶

Proper pre-processing, independent save/load for policies

Breaking Changes:¶

Removed CEMRL
Model saved with previous versions cannot be loaded (because of the pre-preprocessing)

New Features:¶

Add support for Discrete observation spaces
Add saving/loading for policy weights, so the policy can be used without the model

Bug Fixes:¶

Fix type hint for activation functions

Deprecations:¶

Others:¶

Refactor handling of observation and action spaces
Refactored features extraction to have proper preprocessing
Refactored action distributions

Pre-Release 0.3.0 (2020-02-14)¶

Bug fixes, sync with Stable-Baselines, code cleanup

Breaking Changes:¶

Removed default seed
Bump dependencies (PyTorch and Gym)
predict() now returns a tuple to match Stable-Baselines behavior

New Features:¶

Better logging for SAC and PPO

Bug Fixes:¶

Synced callbacks with Stable-Baselines
Fixed colors in results_plotter
Fix entropy computation (now summed over action dim)

Others:¶

SAC with SDE now sample only one matrix
Added clip_mean parameter to SAC policy
Buffers now return NamedTuple
More typing
Add test for expln
Renamed learning_rate to lr_schedule
Add version.txt
Add more tests for distribution

Documentation:¶

Deactivated sphinx_autodoc_typehints extension

Pre-Release 0.2.0 (2020-02-14)¶

Python 3.6+ required, type checking, callbacks, doc build

Breaking Changes:¶

Python 2 support was dropped, Stable Baselines3 now requires Python 3.6 or above
Return type of evaluation.evaluate_policy() has been changed
Refactored the replay buffer to avoid transformation between PyTorch and NumPy
Created OffPolicyRLModel base class
Remove deprecated JSON format for Monitor

New Features:¶

Add seed() method to VecEnv class
Add support for Callback (cf https://github.com/hill-a/stable-baselines/pull/644)
Add methods for saving and loading replay buffer
Add extend() method to the buffers
Add get_vec_normalize_env() to BaseRLModel to retrieve VecNormalize wrapper when it exists
Add results_plotter from Stable Baselines
Improve predict() method to handle different type of observations (single, vectorized, …)

Bug Fixes:¶

Fix loading model on CPU that were trained on GPU
Fix reset_num_timesteps that was not used
Fix entropy computation for squashed Gaussian (approximate it now)
Fix seeding when using multiple environments (different seed per env)

Others:¶

Add type check
Converted all format string to f-strings
Add test for OrnsteinUhlenbeckActionNoise
Add type aliases in common.type_aliases

Documentation:¶

fix documentation build

Pre-Release 0.1.0 (2020-01-20)¶

First Release: base algorithms and state-dependent exploration

New Features:¶

Initial release of A2C, CEM-RL, PPO, SAC and TD3, working only with Box input space
State-Dependent Exploration (SDE) for A2C, PPO, SAC and TD3

Maintainers¶

Stable-Baselines3 is currently maintained by Antonin Raffin (aka @araffin), Ashley Hill (aka @hill-a), Maximilian Ernestus (aka @erniejunior), Adam Gleave (@AdamGleave) and Anssi Kanervisto (aka @Miffyli).

Contributors:¶

In random order…

Thanks to the maintainers of V2: @hill-a @enerijunior @AdamGleave @Miffyli

And all the contributors: @bjmuld @iambenzo @iandanforth @r7vme @brendenpetersen @huvar @abhiskk @JohannesAck @EliasHasle @mrakgr @Bleyddyn @antoine-galataud @junhyeokahn @AdamGleave @keshaviyengar @tperol @XMaster96 @kantneel @Pastafarianist @GerardMaggiolino @PatrickWalter214 @yutingsz @sc420 @Aaahh @billtubbs @Miffyli @dwiel @miguelrass @qxcv @jaberkow @eavelardev @ruifeng96150 @pedrohbtp @srivatsankrishnan @evilsocket @MarvineGothic @jdossgollin @SyllogismRXS @rusu24edward @jbulow @Antymon @seheevic @justinkterry @edbeeching @flodorner @KuKuXia @NeoExtended @PartiallyTyped @mmcenta @richardwu @kinalmehta @rolandgvc @tkelestemur @mloo3 @tirafesi @blurLake @koulakis @joeljosephjin @shwang @rk37 @andyshih12 @RaphaelWag