Probability Distributions¶
Probability distributions used for the different action spaces:
CategoricalDistribution
-> DiscreteDiagGaussianDistribution
-> Box (continuous actions)StateDependentNoiseDistribution
-> Box (continuous actions) whenuse_sde=True
The policy networks output parameters for the distributions (named flat
in the methods).
Actions are then sampled from those distributions.
For instance, in the case of discrete actions. The policy network outputs probability
of taking each action. The CategoricalDistribution
allows to sample from it,
computes the entropy, the log probability (log_prob
) and backpropagate the gradient.
In the case of continuous actions, a Gaussian distribution is used. The policy network outputs
mean and (log) std of the distribution (assumed to be a DiagGaussianDistribution
).
Probability distributions.
-
class
stable_baselines3.common.distributions.
BernoulliDistribution
(action_dims: int)[source]¶ Bernoulli distribution for MultiBinary action spaces.
- Parameters
action_dim – (int) Number of binary actions
-
actions_from_params
(action_logits: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Returns samples from the probability distribution given its parameters.
- Returns
(th.Tensor) actions
-
entropy
() → torch.Tensor[source]¶ Returns Shannon’s entropy of the probability
- Returns
(Optional[th.Tensor]) the entropy, or None if no analytical form is known
-
log_prob
(actions: torch.Tensor) → torch.Tensor[source]¶ Returns the log likelihood
- Parameters
x – (th.Tensor) the taken action
- Returns
(th.Tensor) The log likelihood of the distribution
-
log_prob_from_params
(action_logits: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Returns samples and the associated log probabilities from the probability distribution given its parameters.
- Returns
(th.Tuple[th.Tensor, th.Tensor]) actions and log prob
-
mode
() → torch.Tensor[source]¶ Returns the most likely action (deterministic output) from the probability distribution
- Returns
(th.Tensor) the stochastic action
-
proba_distribution
(action_logits: torch.Tensor) → stable_baselines3.common.distributions.BernoulliDistribution[source]¶ Set parameters of the distribution.
- Returns
(Distribution) self
-
proba_distribution_net
(latent_dim: int) → torch.nn.modules.module.Module[source]¶ Create the layer that represents the distribution: it will be the logits of the Bernoulli distribution.
- Parameters
latent_dim – (int) Dimension of the last layer of the policy network (before the action layer)
- Returns
(nn.Linear)
-
class
stable_baselines3.common.distributions.
CategoricalDistribution
(action_dim: int)[source]¶ Categorical distribution for discrete actions.
- Parameters
action_dim – (int) Number of discrete actions
-
actions_from_params
(action_logits: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Returns samples from the probability distribution given its parameters.
- Returns
(th.Tensor) actions
-
entropy
() → torch.Tensor[source]¶ Returns Shannon’s entropy of the probability
- Returns
(Optional[th.Tensor]) the entropy, or None if no analytical form is known
-
log_prob
(actions: torch.Tensor) → torch.Tensor[source]¶ Returns the log likelihood
- Parameters
x – (th.Tensor) the taken action
- Returns
(th.Tensor) The log likelihood of the distribution
-
log_prob_from_params
(action_logits: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Returns samples and the associated log probabilities from the probability distribution given its parameters.
- Returns
(th.Tuple[th.Tensor, th.Tensor]) actions and log prob
-
mode
() → torch.Tensor[source]¶ Returns the most likely action (deterministic output) from the probability distribution
- Returns
(th.Tensor) the stochastic action
-
proba_distribution
(action_logits: torch.Tensor) → stable_baselines3.common.distributions.CategoricalDistribution[source]¶ Set parameters of the distribution.
- Returns
(Distribution) self
-
proba_distribution_net
(latent_dim: int) → torch.nn.modules.module.Module[source]¶ Create the layer that represents the distribution: it will be the logits of the Categorical distribution. You can then get probabilities using a softmax.
- Parameters
latent_dim – (int) Dimension of the last layer of the policy network (before the action layer)
- Returns
(nn.Linear)
-
class
stable_baselines3.common.distributions.
DiagGaussianDistribution
(action_dim: int)[source]¶ Gaussian distribution with diagonal covariance matrix, for continuous actions.
- Parameters
action_dim – (int) Dimension of the action space.
-
actions_from_params
(mean_actions: torch.Tensor, log_std: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Returns samples from the probability distribution given its parameters.
- Returns
(th.Tensor) actions
-
entropy
() → torch.Tensor[source]¶ Returns Shannon’s entropy of the probability
- Returns
(Optional[th.Tensor]) the entropy, or None if no analytical form is known
-
log_prob
(actions: torch.Tensor) → torch.Tensor[source]¶ Get the log probabilities of actions according to the distribution. Note that you must first call the
proba_distribution()
method.- Parameters
actions – (th.Tensor)
- Returns
(th.Tensor)
-
log_prob_from_params
(mean_actions: torch.Tensor, log_std: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute the log probability of taking an action given the distribution parameters.
- Parameters
mean_actions – (th.Tensor)
log_std – (th.Tensor)
- Returns
(Tuple[th.Tensor, th.Tensor])
-
mode
() → torch.Tensor[source]¶ Returns the most likely action (deterministic output) from the probability distribution
- Returns
(th.Tensor) the stochastic action
-
proba_distribution
(mean_actions: torch.Tensor, log_std: torch.Tensor) → stable_baselines3.common.distributions.DiagGaussianDistribution[source]¶ Create the distribution given its parameters (mean, std)
- Parameters
mean_actions – (th.Tensor)
log_std – (th.Tensor)
- Returns
(DiagGaussianDistribution)
-
proba_distribution_net
(latent_dim: int, log_std_init: float = 0.0) → Tuple[torch.nn.modules.module.Module, torch.nn.parameter.Parameter][source]¶ Create the layers and parameter that represent the distribution: one output will be the mean of the Gaussian, the other parameter will be the standard deviation (log std in fact to allow negative values)
- Parameters
latent_dim – (int) Dimension of the last layer of the policy (before the action layer)
log_std_init – (float) Initial value for the log standard deviation
- Returns
(nn.Linear, nn.Parameter)
-
class
stable_baselines3.common.distributions.
Distribution
[source]¶ Abstract base class for distributions.
-
abstract
actions_from_params
(*args, **kwargs) → torch.Tensor[source]¶ Returns samples from the probability distribution given its parameters.
- Returns
(th.Tensor) actions
-
abstract
entropy
() → Optional[torch.Tensor][source]¶ Returns Shannon’s entropy of the probability
- Returns
(Optional[th.Tensor]) the entropy, or None if no analytical form is known
-
get_actions
(deterministic: bool = False) → torch.Tensor[source]¶ Return actions according to the probability distribution.
- Parameters
deterministic – (bool)
- Returns
(th.Tensor)
-
abstract
log_prob
(x: torch.Tensor) → torch.Tensor[source]¶ Returns the log likelihood
- Parameters
x – (th.Tensor) the taken action
- Returns
(th.Tensor) The log likelihood of the distribution
-
abstract
log_prob_from_params
(*args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶ Returns samples and the associated log probabilities from the probability distribution given its parameters.
- Returns
(th.Tuple[th.Tensor, th.Tensor]) actions and log prob
-
abstract
mode
() → torch.Tensor[source]¶ Returns the most likely action (deterministic output) from the probability distribution
- Returns
(th.Tensor) the stochastic action
-
abstract
proba_distribution
(*args, **kwargs) → stable_baselines3.common.distributions.Distribution[source]¶ Set parameters of the distribution.
- Returns
(Distribution) self
-
abstract
-
class
stable_baselines3.common.distributions.
MultiCategoricalDistribution
(action_dims: List[int])[source]¶ MultiCategorical distribution for multi discrete actions.
- Parameters
action_dims – (List[int]) List of sizes of discrete action spaces
-
actions_from_params
(action_logits: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Returns samples from the probability distribution given its parameters.
- Returns
(th.Tensor) actions
-
entropy
() → torch.Tensor[source]¶ Returns Shannon’s entropy of the probability
- Returns
(Optional[th.Tensor]) the entropy, or None if no analytical form is known
-
log_prob
(actions: torch.Tensor) → torch.Tensor[source]¶ Returns the log likelihood
- Parameters
x – (th.Tensor) the taken action
- Returns
(th.Tensor) The log likelihood of the distribution
-
log_prob_from_params
(action_logits: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Returns samples and the associated log probabilities from the probability distribution given its parameters.
- Returns
(th.Tuple[th.Tensor, th.Tensor]) actions and log prob
-
mode
() → torch.Tensor[source]¶ Returns the most likely action (deterministic output) from the probability distribution
- Returns
(th.Tensor) the stochastic action
-
proba_distribution
(action_logits: torch.Tensor) → stable_baselines3.common.distributions.MultiCategoricalDistribution[source]¶ Set parameters of the distribution.
- Returns
(Distribution) self
-
proba_distribution_net
(latent_dim: int) → torch.nn.modules.module.Module[source]¶ Create the layer that represents the distribution: it will be the logits (flattened) of the MultiCategorical distribution. You can then get probabilities using a softmax on each sub-space.
- Parameters
latent_dim – (int) Dimension of the last layer of the policy network (before the action layer)
- Returns
(nn.Linear)
-
class
stable_baselines3.common.distributions.
SquashedDiagGaussianDistribution
(action_dim: int, epsilon: float = 1e-06)[source]¶ Gaussian distribution with diagonal covariance matrix, followed by a squashing function (tanh) to ensure bounds.
- Parameters
action_dim – (int) Dimension of the action space.
epsilon – (float) small value to avoid NaN due to numerical imprecision.
-
entropy
() → Optional[torch.Tensor][source]¶ Returns Shannon’s entropy of the probability
- Returns
(Optional[th.Tensor]) the entropy, or None if no analytical form is known
-
log_prob
(actions: torch.Tensor, gaussian_actions: Optional[torch.Tensor] = None) → torch.Tensor[source]¶ Get the log probabilities of actions according to the distribution. Note that you must first call the
proba_distribution()
method.- Parameters
actions – (th.Tensor)
- Returns
(th.Tensor)
-
log_prob_from_params
(mean_actions: torch.Tensor, log_std: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute the log probability of taking an action given the distribution parameters.
- Parameters
mean_actions – (th.Tensor)
log_std – (th.Tensor)
- Returns
(Tuple[th.Tensor, th.Tensor])
-
mode
() → torch.Tensor[source]¶ Returns the most likely action (deterministic output) from the probability distribution
- Returns
(th.Tensor) the stochastic action
-
proba_distribution
(mean_actions: torch.Tensor, log_std: torch.Tensor) → stable_baselines3.common.distributions.SquashedDiagGaussianDistribution[source]¶ Create the distribution given its parameters (mean, std)
- Parameters
mean_actions – (th.Tensor)
log_std – (th.Tensor)
- Returns
(DiagGaussianDistribution)
-
class
stable_baselines3.common.distributions.
StateDependentNoiseDistribution
(action_dim: int, full_std: bool = True, use_expln: bool = False, squash_output: bool = False, learn_features: bool = False, epsilon: float = 1e-06)[source]¶ Distribution class for using generalized State Dependent Exploration (gSDE). Paper: https://arxiv.org/abs/2005.05719
It is used to create the noise exploration matrix and compute the log probability of an action with that noise.
- Parameters
action_dim – (int) Dimension of the action space.
full_std – (bool) Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,)
use_expln – (bool) Use
expln()
function instead ofexp()
to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice,exp()
is usually enough.squash_output – (bool) Whether to squash the output using a tanh function, this ensures bounds are satisfied.
learn_features – (bool) Whether to learn features for gSDE or not. This will enable gradients to be backpropagated through the features
latent_sde
in the code.epsilon – (float) small value to avoid NaN due to numerical imprecision.
-
actions_from_params
(mean_actions: torch.Tensor, log_std: torch.Tensor, latent_sde: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Returns samples from the probability distribution given its parameters.
- Returns
(th.Tensor) actions
-
entropy
() → Optional[torch.Tensor][source]¶ Returns Shannon’s entropy of the probability
- Returns
(Optional[th.Tensor]) the entropy, or None if no analytical form is known
-
get_std
(log_std: torch.Tensor) → torch.Tensor[source]¶ Get the standard deviation from the learned parameter (log of it by default). This ensures that the std is positive.
- Parameters
log_std – (th.Tensor)
- Returns
(th.Tensor)
-
log_prob
(actions: torch.Tensor) → torch.Tensor[source]¶ Returns the log likelihood
- Parameters
x – (th.Tensor) the taken action
- Returns
(th.Tensor) The log likelihood of the distribution
-
log_prob_from_params
(mean_actions: torch.Tensor, log_std: torch.Tensor, latent_sde: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Returns samples and the associated log probabilities from the probability distribution given its parameters.
- Returns
(th.Tuple[th.Tensor, th.Tensor]) actions and log prob
-
mode
() → torch.Tensor[source]¶ Returns the most likely action (deterministic output) from the probability distribution
- Returns
(th.Tensor) the stochastic action
-
proba_distribution
(mean_actions: torch.Tensor, log_std: torch.Tensor, latent_sde: torch.Tensor) → stable_baselines3.common.distributions.StateDependentNoiseDistribution[source]¶ Create the distribution given its parameters (mean, std)
- Parameters
mean_actions – (th.Tensor)
log_std – (th.Tensor)
latent_sde – (th.Tensor)
- Returns
(StateDependentNoiseDistribution)
-
proba_distribution_net
(latent_dim: int, log_std_init: float = - 2.0, latent_sde_dim: Optional[int] = None) → Tuple[torch.nn.modules.module.Module, torch.nn.parameter.Parameter][source]¶ Create the layers and parameter that represent the distribution: one output will be the deterministic action, the other parameter will be the standard deviation of the distribution that control the weights of the noise matrix.
- Parameters
latent_dim – (int) Dimension of the last layer of the policy (before the action layer)
log_std_init – (float) Initial value for the log standard deviation
latent_sde_dim – (Optional[int]) Dimension of the last layer of the feature extractor for gSDE. By default, it is shared with the policy network.
- Returns
(nn.Linear, nn.Parameter)
-
class
stable_baselines3.common.distributions.
TanhBijector
(epsilon: float = 1e-06)[source]¶ Bijective transformation of a probability distribution using a squashing function (tanh) TODO: use Pyro instead (https://pyro.ai/)
- Parameters
epsilon – (float) small value to avoid NaN due to numerical imprecision.
-
static
atanh
(x: torch.Tensor) → torch.Tensor[source]¶ Inverse of Tanh
Taken from pyro: https://github.com/pyro-ppl/pyro 0.5 * torch.log((1 + x ) / (1 - x))
-
stable_baselines3.common.distributions.
make_proba_distribution
(action_space: gym.spaces.space.Space, use_sde: bool = False, dist_kwargs: Optional[Dict[str, Any]] = None) → stable_baselines3.common.distributions.Distribution[source]¶ Return an instance of Distribution for the correct type of action space
- Parameters
action_space – (gym.spaces.Space) the input action space
use_sde – (bool) Force the use of StateDependentNoiseDistribution instead of DiagGaussianDistribution
dist_kwargs – (Optional[Dict[str, Any]]) Keyword arguments to pass to the probability distribution
- Returns
(Distribution) the appropriate Distribution object
-
stable_baselines3.common.distributions.
sum_independent_dims
(tensor: torch.Tensor) → torch.Tensor[source]¶ Continuous actions are usually considered to be independent, so we can sum components of the
log_prob
or the entropy.- Parameters
tensor – (th.Tensor) shape: (n_batch, n_actions) or (n_batch,)
- Returns
(th.Tensor) shape: (n_batch,)