# Probability Distributions

Probability distributions used for the different action spaces:

• `CategoricalDistribution` -> Discrete

• `DiagGaussianDistribution` -> Box (continuous actions)

• `StateDependentNoiseDistribution` -> Box (continuous actions) when `use_sde=True`

The policy networks output parameters for the distributions (named `flat` in the methods). Actions are then sampled from those distributions.

For instance, in the case of discrete actions. The policy network outputs probability of taking each action. The `CategoricalDistribution` allows to sample from it, computes the entropy, the log probability (`log_prob`) and backpropagate the gradient.

In the case of continuous actions, a Gaussian distribution is used. The policy network outputs mean and (log) std of the distribution (assumed to be a `DiagGaussianDistribution`).

Probability distributions.

class stable_baselines3.common.distributions.BernoulliDistribution(action_dims)[source]

Bernoulli distribution for MultiBinary action spaces.

Parameters:

action_dim – Number of binary actions

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Return type:

`Tensor`

Returns:

actions

entropy()[source]

Returns Shannon’s entropy of the probability

Return type:

`Tensor`

Returns:

the entropy, or None if no analytical form is known

log_prob(actions)[source]

Returns the log likelihood

Parameters:

x – the taken action

Return type:

`Tensor`

Returns:

The log likelihood of the distribution

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:

`Tuple`[`Tensor`, `Tensor`]

Returns:

actions and log prob

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Return type:

`TypeVar`(`SelfBernoulliDistribution`, bound= BernoulliDistribution)

Returns:

self

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits of the Bernoulli distribution.

Parameters:

latent_dim (`int`) – Dimension of the last layer of the policy network (before the action layer)

Return type:

`Module`

Returns:

sample()[source]

Returns a sample from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

class stable_baselines3.common.distributions.CategoricalDistribution(action_dim)[source]

Categorical distribution for discrete actions.

Parameters:

action_dim (`int`) – Number of discrete actions

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Return type:

`Tensor`

Returns:

actions

entropy()[source]

Returns Shannon’s entropy of the probability

Return type:

`Tensor`

Returns:

the entropy, or None if no analytical form is known

log_prob(actions)[source]

Returns the log likelihood

Parameters:

x – the taken action

Return type:

`Tensor`

Returns:

The log likelihood of the distribution

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:

`Tuple`[`Tensor`, `Tensor`]

Returns:

actions and log prob

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Return type:

`TypeVar`(`SelfCategoricalDistribution`, bound= CategoricalDistribution)

Returns:

self

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits of the Categorical distribution. You can then get probabilities using a softmax.

Parameters:

latent_dim (`int`) – Dimension of the last layer of the policy network (before the action layer)

Return type:

`Module`

Returns:

sample()[source]

Returns a sample from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

class stable_baselines3.common.distributions.DiagGaussianDistribution(action_dim)[source]

Gaussian distribution with diagonal covariance matrix, for continuous actions.

Parameters:

action_dim (`int`) – Dimension of the action space.

actions_from_params(mean_actions, log_std, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Return type:

`Tensor`

Returns:

actions

entropy()[source]

Returns Shannon’s entropy of the probability

Return type:

`Tensor`

Returns:

the entropy, or None if no analytical form is known

log_prob(actions)[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the `proba_distribution()` method.

Parameters:

actions (`Tensor`) –

Return type:

`Tensor`

Returns:

log_prob_from_params(mean_actions, log_std)[source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:
• mean_actions (`Tensor`) –

• log_std (`Tensor`) –

Return type:

`Tuple`[`Tensor`, `Tensor`]

Returns:

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

proba_distribution(mean_actions, log_std)[source]

Create the distribution given its parameters (mean, std)

Parameters:
• mean_actions (`Tensor`) –

• log_std (`Tensor`) –

Return type:

`TypeVar`(`SelfDiagGaussianDistribution`, bound= DiagGaussianDistribution)

Returns:

proba_distribution_net(latent_dim, log_std_init=0.0)[source]

Create the layers and parameter that represent the distribution: one output will be the mean of the Gaussian, the other parameter will be the standard deviation (log std in fact to allow negative values)

Parameters:
• latent_dim (`int`) – Dimension of the last layer of the policy (before the action layer)

• log_std_init (`float`) – Initial value for the log standard deviation

Return type:

`Tuple`[`Module`, `Parameter`]

Returns:

sample()[source]

Returns a sample from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

class stable_baselines3.common.distributions.Distribution[source]

Abstract base class for distributions.

abstract actions_from_params(*args, **kwargs)[source]

Returns samples from the probability distribution given its parameters.

Return type:

`Tensor`

Returns:

actions

abstract entropy()[source]

Returns Shannon’s entropy of the probability

Return type:

`Optional`[`Tensor`]

Returns:

the entropy, or None if no analytical form is known

get_actions(deterministic=False)[source]

Return actions according to the probability distribution.

Parameters:

deterministic (`bool`) –

Return type:

`Tensor`

Returns:

abstract log_prob(x)[source]

Returns the log likelihood

Parameters:

x (`Tensor`) – the taken action

Return type:

`Tensor`

Returns:

The log likelihood of the distribution

abstract log_prob_from_params(*args, **kwargs)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:

`Tuple`[`Tensor`, `Tensor`]

Returns:

actions and log prob

abstract mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

abstract proba_distribution(*args, **kwargs)[source]

Set parameters of the distribution.

Return type:

`TypeVar`(`SelfDistribution`, bound= Distribution)

Returns:

self

abstract proba_distribution_net(*args, **kwargs)[source]

Create the layers and parameters that represent the distribution.

Subclasses must define this, but the arguments and return type vary between concrete classes.

Return type:

`Union`[`Module`, `Tuple`[`Module`, `Parameter`]]

abstract sample()[source]

Returns a sample from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

class stable_baselines3.common.distributions.MultiCategoricalDistribution(action_dims)[source]

MultiCategorical distribution for multi discrete actions.

Parameters:

action_dims (`List`[`int`]) – List of sizes of discrete action spaces

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Return type:

`Tensor`

Returns:

actions

entropy()[source]

Returns Shannon’s entropy of the probability

Return type:

`Tensor`

Returns:

the entropy, or None if no analytical form is known

log_prob(actions)[source]

Returns the log likelihood

Parameters:

x – the taken action

Return type:

`Tensor`

Returns:

The log likelihood of the distribution

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:

`Tuple`[`Tensor`, `Tensor`]

Returns:

actions and log prob

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Return type:

`TypeVar`(`SelfMultiCategoricalDistribution`, bound= MultiCategoricalDistribution)

Returns:

self

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits (flattened) of the MultiCategorical distribution. You can then get probabilities using a softmax on each sub-space.

Parameters:

latent_dim (`int`) – Dimension of the last layer of the policy network (before the action layer)

Return type:

`Module`

Returns:

sample()[source]

Returns a sample from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

class stable_baselines3.common.distributions.SquashedDiagGaussianDistribution(action_dim, epsilon=1e-06)[source]

Gaussian distribution with diagonal covariance matrix, followed by a squashing function (tanh) to ensure bounds.

Parameters:
• action_dim (`int`) – Dimension of the action space.

• epsilon (`float`) – small value to avoid NaN due to numerical imprecision.

entropy()[source]

Returns Shannon’s entropy of the probability

Return type:

`Optional`[`Tensor`]

Returns:

the entropy, or None if no analytical form is known

log_prob(actions, gaussian_actions=None)[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the `proba_distribution()` method.

Parameters:

actions (`Tensor`) –

Return type:

`Tensor`

Returns:

log_prob_from_params(mean_actions, log_std)[source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:
• mean_actions (`Tensor`) –

• log_std (`Tensor`) –

Return type:

`Tuple`[`Tensor`, `Tensor`]

Returns:

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

proba_distribution(mean_actions, log_std)[source]

Create the distribution given its parameters (mean, std)

Parameters:
• mean_actions (`Tensor`) –

• log_std (`Tensor`) –

Return type:

`TypeVar`(`SelfSquashedDiagGaussianDistribution`, bound= SquashedDiagGaussianDistribution)

Returns:

sample()[source]

Returns a sample from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

class stable_baselines3.common.distributions.StateDependentNoiseDistribution(action_dim, full_std=True, use_expln=False, squash_output=False, learn_features=False, epsilon=1e-06)[source]

Distribution class for using generalized State Dependent Exploration (gSDE). Paper: https://arxiv.org/abs/2005.05719

It is used to create the noise exploration matrix and compute the log probability of an action with that noise.

Parameters:
• action_dim (`int`) – Dimension of the action space.

• full_std (`bool`) – Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,)

• use_expln (`bool`) – Use `expln()` function instead of `exp()` to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice, `exp()` is usually enough.

• squash_output (`bool`) – Whether to squash the output using a tanh function, this ensures bounds are satisfied.

• learn_features (`bool`) – Whether to learn features for gSDE or not. This will enable gradients to be backpropagated through the features `latent_sde` in the code.

• epsilon (`float`) – small value to avoid NaN due to numerical imprecision.

actions_from_params(mean_actions, log_std, latent_sde, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Return type:

`Tensor`

Returns:

actions

entropy()[source]

Returns Shannon’s entropy of the probability

Return type:

`Optional`[`Tensor`]

Returns:

the entropy, or None if no analytical form is known

get_std(log_std)[source]

Get the standard deviation from the learned parameter (log of it by default). This ensures that the std is positive.

Parameters:

log_std (`Tensor`) –

Return type:

`Tensor`

Returns:

log_prob(actions)[source]

Returns the log likelihood

Parameters:

x – the taken action

Return type:

`Tensor`

Returns:

The log likelihood of the distribution

log_prob_from_params(mean_actions, log_std, latent_sde)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:

`Tuple`[`Tensor`, `Tensor`]

Returns:

actions and log prob

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

proba_distribution(mean_actions, log_std, latent_sde)[source]

Create the distribution given its parameters (mean, std)

Parameters:
• mean_actions (`Tensor`) –

• log_std (`Tensor`) –

• latent_sde (`Tensor`) –

Return type:

`TypeVar`(`SelfStateDependentNoiseDistribution`, bound= StateDependentNoiseDistribution)

Returns:

proba_distribution_net(latent_dim, log_std_init=-2.0, latent_sde_dim=None)[source]

Create the layers and parameter that represent the distribution: one output will be the deterministic action, the other parameter will be the standard deviation of the distribution that control the weights of the noise matrix.

Parameters:
• latent_dim (`int`) – Dimension of the last layer of the policy (before the action layer)

• log_std_init (`float`) – Initial value for the log standard deviation

• latent_sde_dim (`Optional`[`int`]) – Dimension of the last layer of the features extractor for gSDE. By default, it is shared with the policy network.

Return type:

`Tuple`[`Module`, `Parameter`]

Returns:

sample()[source]

Returns a sample from the probability distribution

Return type:

`Tensor`

Returns:

the stochastic action

sample_weights(log_std, batch_size=1)[source]

Sample weights for the noise exploration matrix, using a centered Gaussian distribution.

Parameters:
• log_std (`Tensor`) –

• batch_size (`int`) –

Return type:

`None`

class stable_baselines3.common.distributions.TanhBijector(epsilon=1e-06)[source]

Bijective transformation of a probability distribution using a squashing function (tanh)

Parameters:

epsilon (`float`) – small value to avoid NaN due to numerical imprecision.

static atanh(x)[source]

Inverse of Tanh

Taken from Pyro: https://github.com/pyro-ppl/pyro 0.5 * torch.log((1 + x ) / (1 - x))

Return type:

`Tensor`

static inverse(y)[source]

Inverse tanh.

Parameters:

y (`Tensor`) –

Return type:

`Tensor`

Returns:

stable_baselines3.common.distributions.kl_divergence(dist_true, dist_pred)[source]

Wrapper for the PyTorch implementation of the full form KL Divergence

Parameters:
Return type:

`Tensor`

Returns:

KL(dist_true||dist_pred)

stable_baselines3.common.distributions.make_proba_distribution(action_space, use_sde=False, dist_kwargs=None)[source]

Return an instance of Distribution for the correct type of action space

Parameters:
• action_space (`Space`) – the input action space

• use_sde (`bool`) – Force the use of StateDependentNoiseDistribution instead of DiagGaussianDistribution

• dist_kwargs (`Optional`[`Dict`[`str`, `Any`]]) – Keyword arguments to pass to the probability distribution

Return type:

`Distribution`

Returns:

the appropriate Distribution object

stable_baselines3.common.distributions.sum_independent_dims(tensor)[source]

Continuous actions are usually considered to be independent, so we can sum components of the `log_prob` or the entropy.

Parameters:

tensor (`Tensor`) – shape: (n_batch, n_actions) or (n_batch,)

Return type:

`Tensor`

Returns:

shape: (n_batch,)