# Probability Distributions

Probability distributions used for the different action spaces:

• `CategoricalDistribution` -> Discrete

• `DiagGaussianDistribution` -> Box (continuous actions)

• `StateDependentNoiseDistribution` -> Box (continuous actions) when `use_sde=True`

The policy networks output parameters for the distributions (named `flat` in the methods). Actions are then sampled from those distributions.

For instance, in the case of discrete actions. The policy network outputs probability of taking each action. The `CategoricalDistribution` allows sampling from it, computes the entropy, the log probability (`log_prob`) and backpropagate the gradient.

In the case of continuous actions, a Gaussian distribution is used. The policy network outputs mean and (log) std of the distribution (assumed to be a `DiagGaussianDistribution`).

Probability distributions.

class stable_baselines3.common.distributions.BernoulliDistribution(action_dims)[source]

Bernoulli distribution for MultiBinary action spaces.

Parameters:
• action_dim – Number of binary actions

• action_dims (int) –

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:
• action_logits (Tensor) –

• deterministic (bool) –

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

Return type:

Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:
• x – the taken action

• actions (Tensor) –

Returns:

The log likelihood of the distribution

Return type:

Tensor

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

Parameters:

action_logits (Tensor) –

Return type:

Tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Returns:

self

Parameters:
• self (SelfBernoulliDistribution) –

• action_logits (Tensor) –

Return type:

SelfBernoulliDistribution

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits of the Bernoulli distribution.

Parameters:

latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)

Returns:

Return type:

Module

sample()[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

class stable_baselines3.common.distributions.CategoricalDistribution(action_dim)[source]

Categorical distribution for discrete actions.

Parameters:

action_dim (int) – Number of discrete actions

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:
• action_logits (Tensor) –

• deterministic (bool) –

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

Return type:

Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:
• x – the taken action

• actions (Tensor) –

Returns:

The log likelihood of the distribution

Return type:

Tensor

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

Parameters:

action_logits (Tensor) –

Return type:

Tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Returns:

self

Parameters:
• self (SelfCategoricalDistribution) –

• action_logits (Tensor) –

Return type:

SelfCategoricalDistribution

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits of the Categorical distribution. You can then get probabilities using a softmax.

Parameters:

latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)

Returns:

Return type:

Module

sample()[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

class stable_baselines3.common.distributions.DiagGaussianDistribution(action_dim)[source]

Gaussian distribution with diagonal covariance matrix, for continuous actions.

Parameters:

action_dim (int) – Dimension of the action space.

actions_from_params(mean_actions, log_std, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

• deterministic (bool) –

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

Return type:

Tensor | None

log_prob(actions)[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the `proba_distribution()` method.

Parameters:

actions (Tensor) –

Returns:

Return type:

Tensor

log_prob_from_params(mean_actions, log_std)[source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

Returns:

Return type:

Tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

proba_distribution(mean_actions, log_std)[source]

Create the distribution given its parameters (mean, std)

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

• self (SelfDiagGaussianDistribution) –

Returns:

Return type:

SelfDiagGaussianDistribution

proba_distribution_net(latent_dim, log_std_init=0.0)[source]

Create the layers and parameter that represent the distribution: one output will be the mean of the Gaussian, the other parameter will be the standard deviation (log std in fact to allow negative values)

Parameters:
• latent_dim (int) – Dimension of the last layer of the policy (before the action layer)

• log_std_init (float) – Initial value for the log standard deviation

Returns:

Return type:

Tuple[Module, Parameter]

sample()[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

class stable_baselines3.common.distributions.Distribution[source]

Abstract base class for distributions.

abstract actions_from_params(*args, **kwargs)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Return type:

Tensor

abstract entropy()[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

Return type:

Tensor | None

get_actions(deterministic=False)[source]

Return actions according to the probability distribution.

Parameters:

deterministic (bool) –

Returns:

Return type:

Tensor

abstract log_prob(x)[source]

Returns the log likelihood

Parameters:

x (Tensor) – the taken action

Returns:

The log likelihood of the distribution

Return type:

Tensor

abstract log_prob_from_params(*args, **kwargs)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

Return type:

Tuple[Tensor, Tensor]

abstract mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

abstract proba_distribution(*args, **kwargs)[source]

Set parameters of the distribution.

Returns:

self

Parameters:

self (SelfDistribution) –

Return type:

SelfDistribution

abstract proba_distribution_net(*args, **kwargs)[source]

Create the layers and parameters that represent the distribution.

Subclasses must define this, but the arguments and return type vary between concrete classes.

Return type:

Module | Tuple[Module, Parameter]

abstract sample()[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

class stable_baselines3.common.distributions.MultiCategoricalDistribution(action_dims)[source]

MultiCategorical distribution for multi discrete actions.

Parameters:

action_dims (List[int]) – List of sizes of discrete action spaces

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:
• action_logits (Tensor) –

• deterministic (bool) –

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

Return type:

Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:
• x – the taken action

• actions (Tensor) –

Returns:

The log likelihood of the distribution

Return type:

Tensor

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

Parameters:

action_logits (Tensor) –

Return type:

Tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Returns:

self

Parameters:
• self (SelfMultiCategoricalDistribution) –

• action_logits (Tensor) –

Return type:

SelfMultiCategoricalDistribution

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits (flattened) of the MultiCategorical distribution. You can then get probabilities using a softmax on each sub-space.

Parameters:

latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)

Returns:

Return type:

Module

sample()[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

class stable_baselines3.common.distributions.SquashedDiagGaussianDistribution(action_dim, epsilon=1e-06)[source]

Gaussian distribution with diagonal covariance matrix, followed by a squashing function (tanh) to ensure bounds.

Parameters:
• action_dim (int) – Dimension of the action space.

• epsilon (float) – small value to avoid NaN due to numerical imprecision.

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

Return type:

Tensor | None

log_prob(actions, gaussian_actions=None)[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the `proba_distribution()` method.

Parameters:
• actions (Tensor) –

• gaussian_actions (Tensor | None) –

Returns:

Return type:

Tensor

log_prob_from_params(mean_actions, log_std)[source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

Returns:

Return type:

Tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

proba_distribution(mean_actions, log_std)[source]

Create the distribution given its parameters (mean, std)

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

• self (SelfSquashedDiagGaussianDistribution) –

Returns:

Return type:

SelfSquashedDiagGaussianDistribution

sample()[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

class stable_baselines3.common.distributions.StateDependentNoiseDistribution(action_dim, full_std=True, use_expln=False, squash_output=False, learn_features=False, epsilon=1e-06)[source]

Distribution class for using generalized State Dependent Exploration (gSDE). Paper: https://arxiv.org/abs/2005.05719

It is used to create the noise exploration matrix and compute the log probability of an action with that noise.

Parameters:
• action_dim (int) – Dimension of the action space.

• full_std (bool) – Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,)

• use_expln (bool) – Use `expln()` function instead of `exp()` to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice, `exp()` is usually enough.

• squash_output (bool) – Whether to squash the output using a tanh function, this ensures bounds are satisfied.

• learn_features (bool) – Whether to learn features for gSDE or not. This will enable gradients to be backpropagated through the features `latent_sde` in the code.

• epsilon (float) – small value to avoid NaN due to numerical imprecision.

actions_from_params(mean_actions, log_std, latent_sde, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

• latent_sde (Tensor) –

• deterministic (bool) –

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

Return type:

Tensor | None

get_std(log_std)[source]

Get the standard deviation from the learned parameter (log of it by default). This ensures that the std is positive.

Parameters:

log_std (Tensor) –

Returns:

Return type:

Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:
• x – the taken action

• actions (Tensor) –

Returns:

The log likelihood of the distribution

Return type:

Tensor

log_prob_from_params(mean_actions, log_std, latent_sde)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

• latent_sde (Tensor) –

Return type:

Tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

proba_distribution(mean_actions, log_std, latent_sde)[source]

Create the distribution given its parameters (mean, std)

Parameters:
• mean_actions (Tensor) –

• log_std (Tensor) –

• latent_sde (Tensor) –

• self (SelfStateDependentNoiseDistribution) –

Returns:

Return type:

SelfStateDependentNoiseDistribution

proba_distribution_net(latent_dim, log_std_init=-2.0, latent_sde_dim=None)[source]

Create the layers and parameter that represent the distribution: one output will be the deterministic action, the other parameter will be the standard deviation of the distribution that control the weights of the noise matrix.

Parameters:
• latent_dim (int) – Dimension of the last layer of the policy (before the action layer)

• log_std_init (float) – Initial value for the log standard deviation

• latent_sde_dim (int | None) – Dimension of the last layer of the features extractor for gSDE. By default, it is shared with the policy network.

Returns:

Return type:

Tuple[Module, Parameter]

sample()[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

Return type:

Tensor

sample_weights(log_std, batch_size=1)[source]

Sample weights for the noise exploration matrix, using a centered Gaussian distribution.

Parameters:
• log_std (Tensor) –

• batch_size (int) –

Return type:

None

class stable_baselines3.common.distributions.TanhBijector(epsilon=1e-06)[source]

Bijective transformation of a probability distribution using a squashing function (tanh)

Parameters:

epsilon (float) – small value to avoid NaN due to numerical imprecision.

static atanh(x)[source]

Inverse of Tanh

Taken from Pyro: https://github.com/pyro-ppl/pyro 0.5 * torch.log((1 + x ) / (1 - x))

Parameters:

x (Tensor) –

Return type:

Tensor

static inverse(y)[source]

Inverse tanh.

Parameters:

y (Tensor) –

Returns:

Return type:

Tensor

stable_baselines3.common.distributions.kl_divergence(dist_true, dist_pred)[source]

Wrapper for the PyTorch implementation of the full form KL Divergence

Parameters:
Returns:

KL(dist_true||dist_pred)

Return type:

Tensor

stable_baselines3.common.distributions.make_proba_distribution(action_space, use_sde=False, dist_kwargs=None)[source]

Return an instance of Distribution for the correct type of action space

Parameters:
• action_space (Space) – the input action space

• use_sde (bool) – Force the use of StateDependentNoiseDistribution instead of DiagGaussianDistribution

• dist_kwargs (Dict[str, Any] | None) – Keyword arguments to pass to the probability distribution

Returns:

the appropriate Distribution object

Return type:

Distribution

stable_baselines3.common.distributions.sum_independent_dims(tensor)[source]

Continuous actions are usually considered to be independent, so we can sum components of the `log_prob` or the entropy.

Parameters:

tensor (Tensor) – shape: (n_batch, n_actions) or (n_batch,)

Returns:

shape: (n_batch,) for (n_batch, n_actions) input, scalar for (n_batch,) input

Return type:

Tensor