DQN¶
Available Policies
alias of |
|
Policy class for DQN when using images as input. |
Notes¶
Original paper: https://arxiv.org/abs/1312.5602
Further reference: https://www.nature.com/articles/nature14236
Note
This implementation provides only vanilla Deep Q-Learning and has no extensions such as Double-DQN, Dueling-DQN and Prioritized Experience Replay.
Can I use?¶
Recurrent policies: ❌
Multi processing: ❌
Gym spaces:
Space |
Action |
Observation |
---|---|---|
Discrete |
✔ |
✔ |
Box |
❌ |
✔ |
MultiDiscrete |
❌ |
✔ |
MultiBinary |
❌ |
✔ |
Example¶
import gym
import numpy as np
from stable_baselines3 import DQN
from stable_baselines3.dqn import MlpPolicy
env = gym.make('Pendulum-v0')
model = DQN(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000, log_interval=4)
model.save("dqn_pendulum")
del model # remove to demonstrate saving and loading
model = DQN.load("dqn_pendulum")
obs = env.reset()
while True:
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()
Parameters¶
-
class
stable_baselines3.dqn.
DQN
(policy: Union[str, Type[stable_baselines3.dqn.policies.DQNPolicy]], env: Union[gym.core.Env, stable_baselines3.common.vec_env.base_vec_env.VecEnv, str], learning_rate: Union[float, Callable] = 0.0001, buffer_size: int = 1000000, learning_starts: int = 50000, batch_size: Optional[int] = 32, tau: float = 1.0, gamma: float = 0.99, train_freq: int = 4, gradient_steps: int = 1, n_episodes_rollout: int = - 1, optimize_memory_usage: bool = False, target_update_interval: int = 10000, exploration_fraction: float = 0.1, exploration_initial_eps: float = 1.0, exploration_final_eps: float = 0.05, max_grad_norm: float = 10, tensorboard_log: Optional[str] = None, create_eval_env: bool = False, policy_kwargs: Optional[Dict[str, Any]] = None, verbose: int = 0, seed: Optional[int] = None, device: Union[torch.device, str] = 'auto', _init_setup_model: bool = True)[source]¶ Deep Q-Network (DQN)
Paper: https://arxiv.org/abs/1312.5602, https://www.nature.com/articles/nature14236 Default hyperparameters are taken from the nature paper, except for the optimizer and learning rate that were taken from Stable Baselines defaults.
- Parameters
policy – (DQNPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, …)
env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
learning_rate – (float or callable) The learning rate, it can be a function of the current progress (from 1 to 0)
buffer_size – (int) size of the replay buffer
learning_starts – (int) how many steps of the model to collect transitions for before learning starts
batch_size – (int) Minibatch size for each gradient update
tau – (float) the soft update coefficient (“Polyak update”, between 0 and 1) default 1 for hard update
gamma – (float) the discount factor
train_freq – (int) Update the model every
train_freq
steps. Set to -1 to disable.gradient_steps – (int) How many gradient steps to do after each rollout (see
train_freq
andn_episodes_rollout
) Set to-1
means to do as many gradient steps as steps done in the environment during the rollout.n_episodes_rollout – (int) Update the model every
n_episodes_rollout
episodes. Note that this cannot be used at the same time astrain_freq
. Set to -1 to disable.optimize_memory_usage – (bool) Enable a memory efficient variant of the replay buffer at a cost of more complexity. See https://github.com/DLR-RM/stable-baselines3/issues/37#issuecomment-637501195
target_update_interval – (int) update the target network every
target_update_interval
environment steps.exploration_fraction – (float) fraction of entire training period over which the exploration rate is reduced
exploration_initial_eps – (float) initial value of random action probability
exploration_final_eps – (float) final value of random action probability
max_grad_norm – (float) The maximum value for the gradient clipping
tensorboard_log – (str) the log location for tensorboard (if None, no logging)
create_eval_env – (bool) Whether to create a second environment that will be used for evaluating the agent periodically. (Only available when passing string for the environment)
policy_kwargs – (dict) additional arguments to be passed to the policy on creation
verbose – (int) the verbosity level: 0 no output, 1 info, 2 debug
seed – (int) Seed for the pseudo random generators
device – (str or th.device) Device (cpu, cuda, …) on which the code should be run. Setting it to auto, the code will be run on the GPU if possible.
_init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
collect_rollouts
(env: stable_baselines3.common.vec_env.base_vec_env.VecEnv, callback: stable_baselines3.common.callbacks.BaseCallback, n_episodes: int = 1, n_steps: int = - 1, action_noise: Optional[stable_baselines3.common.noise.ActionNoise] = None, learning_starts: int = 0, replay_buffer: Optional[stable_baselines3.common.buffers.ReplayBuffer] = None, log_interval: Optional[int] = None) → stable_baselines3.common.type_aliases.RolloutReturn¶ Collect experiences and store them into a ReplayBuffer.
- Parameters
env – (VecEnv) The training environment
callback – (BaseCallback) Callback that will be called at each step (and at the beginning and end of the rollout)
n_episodes – (int) Number of episodes to use to collect rollout data You can also specify a
n_steps
insteadn_steps – (int) Number of steps to use to collect rollout data You can also specify a
n_episodes
instead.action_noise – (Optional[ActionNoise]) Action noise that will be used for exploration Required for deterministic policy (e.g. TD3). This can also be used in addition to the stochastic policy for SAC.
learning_starts – (int) Number of steps before learning for the warm-up phase.
replay_buffer – (ReplayBuffer)
log_interval – (int) Log data every
log_interval
episodes
- Returns
(RolloutReturn)
-
excluded_save_params
() → List[str][source]¶ Returns the names of the parameters that should be excluded by default when saving the model.
- Returns
(List[str]) List of parameters that should be excluded from save
-
get_env
() → Optional[stable_baselines3.common.vec_env.base_vec_env.VecEnv]¶ Returns the current environment (can be None if not defined).
- Returns
(Optional[VecEnv]) The current environment
-
get_vec_normalize_env
() → Optional[stable_baselines3.common.vec_env.vec_normalize.VecNormalize]¶ Return the
VecNormalize
wrapper of the training env if it exists. :return: Optional[VecNormalize] TheVecNormalize
env.
-
learn
(total_timesteps: int, callback: Union[None, Callable, List[stable_baselines3.common.callbacks.BaseCallback], stable_baselines3.common.callbacks.BaseCallback] = None, log_interval: int = 4, eval_env: Optional[Union[gym.core.Env, stable_baselines3.common.vec_env.base_vec_env.VecEnv]] = None, eval_freq: int = - 1, n_eval_episodes: int = 5, tb_log_name: str = 'DQN', eval_log_path: Optional[str] = None, reset_num_timesteps: bool = True) → stable_baselines3.common.off_policy_algorithm.OffPolicyAlgorithm[source]¶ Return a trained model.
- Parameters
total_timesteps – (int) The total number of samples (env steps) to train on
callback – (MaybeCallback) callback(s) called at every step with state of the algorithm.
log_interval – (int) The number of timesteps before logging.
tb_log_name – (str) the name of the run for TensorBoard logging
eval_env – (gym.Env) Environment that will be used to evaluate the agent
eval_freq – (int) Evaluate the agent every
eval_freq
timesteps (this may vary a little)n_eval_episodes – (int) Number of episode to evaluate the agent
eval_log_path – (Optional[str]) Path to a folder where the evaluations will be saved
reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)
- Returns
(BaseAlgorithm) the trained model
-
classmethod
load
(load_path: str, env: Optional[Union[gym.core.Env, stable_baselines3.common.vec_env.base_vec_env.VecEnv]] = None, **kwargs) → stable_baselines3.common.base_class.BaseAlgorithm¶ Load the model from a zip-file
- Parameters
load_path – the location of the saved data
env – the new environment to run the loaded model on (can be None if you only need prediction from a trained model) has priority over any saved environment
kwargs – extra arguments to change the model when loading
-
load_replay_buffer
(path: Union[str, pathlib.Path, io.BufferedIOBase]) → None¶ Load a replay buffer from a pickle file.
- Parameters
path – (Union[str, pathlib.Path, io.BufferedIOBase]) Path to the pickled replay buffer.
-
predict
(observation: numpy.ndarray, state: Optional[numpy.ndarray] = None, mask: Optional[numpy.ndarray] = None, deterministic: bool = False) → Tuple[numpy.ndarray, Optional[numpy.ndarray]][source]¶ Overrides the base_class predict function to include epsilon-greedy exploration.
- Parameters
observation – (np.ndarray) the input observation
state – (Optional[np.ndarray]) The last states (can be None, used in recurrent policies)
mask – (Optional[np.ndarray]) The last masks (can be None, used in recurrent policies)
deterministic – (bool) Whether or not to return deterministic actions.
- Returns
(Tuple[np.ndarray, Optional[np.ndarray]]) the model’s action and the next state (used in recurrent policies)
-
save
(path: Union[str, pathlib.Path, io.BufferedIOBase], exclude: Optional[Iterable[str]] = None, include: Optional[Iterable[str]] = None) → None¶ Save all the attributes of the object and the model parameters in a zip-file.
- Parameters
pathlib.Path, io.BufferedIOBase]) ((Union[str,) – path to the file where the rl agent should be saved
exclude – name of parameters that should be excluded in addition to the default one
include – name of parameters that might be excluded but should be included anyway
-
save_replay_buffer
(path: Union[str, pathlib.Path, io.BufferedIOBase]) → None¶ Save the replay buffer as a pickle file.
- Parameters
path – (Union[str,pathlib.Path, io.BufferedIOBase]) Path to the file where the replay buffer should be saved. if path is a str or pathlib.Path, the path is automatically created if necessary.
-
set_env
(env: Union[gym.core.Env, stable_baselines3.common.vec_env.base_vec_env.VecEnv]) → None¶ Checks the validity of the environment, and if it is coherent, set it as the current environment. Furthermore wrap any non vectorized env into a vectorized checked parameters: - observation_space - action_space
- Parameters
env – The environment for learning a policy
-
set_random_seed
(seed: Optional[int] = None) → None¶ Set the seed of the pseudo-random generators (python, numpy, pytorch, gym, action_space)
- Parameters
seed – (int)
DQN Policies¶
-
stable_baselines3.dqn.
MlpPolicy
¶ alias of
stable_baselines3.dqn.policies.DQNPolicy
-
class
stable_baselines3.dqn.
CnnPolicy
(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, lr_schedule: Callable, net_arch: Optional[List[int]] = None, device: Union[torch.device, str] = 'auto', activation_fn: Type[torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, features_extractor_class: Type[stable_baselines3.common.torch_layers.BaseFeaturesExtractor] = <class 'stable_baselines3.common.torch_layers.NatureCNN'>, features_extractor_kwargs: Optional[Dict[str, Any]] = None, normalize_images: bool = True, optimizer_class: Type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_kwargs: Optional[Dict[str, Any]] = None)[source]¶ Policy class for DQN when using images as input.
- Parameters
observation_space – (gym.spaces.Space) Observation space
action_space – (gym.spaces.Space) Action space
lr_schedule – (callable) Learning rate schedule (could be constant)
net_arch – (Optional[List[int]]) The specification of the policy and value networks.
device – (str or th.device) Device on which the code should run.
activation_fn – (Type[nn.Module]) Activation function
features_extractor_class – (Type[BaseFeaturesExtractor]) Features extractor to use.
normalize_images – (bool) Whether to normalize images or not, dividing by 255.0 (True by default)
optimizer_class – (Type[th.optim.Optimizer]) The optimizer to use,
th.optim.Adam
by defaultoptimizer_kwargs – (Optional[Dict[str, Any]]) Additional keyword arguments, excluding the learning rate, to pass to the optimizer