Probability Distributions

Probability distributions used for the different action spaces:

CategoricalDistribution -> Discrete
DiagGaussianDistribution -> Box (continuous actions)
StateDependentNoiseDistribution -> Box (continuous actions) when use_sde=True

The policy networks output parameters for the distributions (named flat in the methods). Actions are then sampled from those distributions.

For instance, in the case of discrete actions. The policy network outputs probability of taking each action. The CategoricalDistribution allows sampling from it, computes the entropy, the log probability (log_prob) and backpropagate the gradient.

In the case of continuous actions, a Gaussian distribution is used. The policy network outputs mean and (log) std of the distribution (assumed to be a DiagGaussianDistribution).

Probability distributions.

class stable_baselines3.common.distributions.BernoulliDistribution(action_dims)[source]

Bernoulli distribution for MultiBinary action spaces.

Parameters:

action_dim – Number of binary actions
action_dims (int)

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:

action_logits (Tensor)
deterministic (bool)

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known
Return type:: Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:: actions (Tensor) – the taken action
Returns:: The log likelihood of the distribution
Return type:: Tensor

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob
Parameters:: action_logits (Tensor)
Return type:: tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Returns:

self

Parameters:

self (SelfBernoulliDistribution)
action_logits (Tensor)

Return type:

SelfBernoulliDistribution

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits of the Bernoulli distribution.

Parameters:: latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)
Returns:
Return type:: Module

sample()[source]

Returns a sample from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

class stable_baselines3.common.distributions.CategoricalDistribution(action_dim)[source]

Categorical distribution for discrete actions.

Parameters:: action_dim (int) – Number of discrete actions

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:

action_logits (Tensor)
deterministic (bool)

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known
Return type:: Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:: actions (Tensor) – the taken action
Returns:: The log likelihood of the distribution
Return type:: Tensor

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob
Parameters:: action_logits (Tensor)
Return type:: tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Returns:

self

Parameters:

self (SelfCategoricalDistribution)
action_logits (Tensor)

Return type:

SelfCategoricalDistribution

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits of the Categorical distribution. You can then get probabilities using a softmax.

Parameters:: latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)
Returns:
Return type:: Module

sample()[source]

Returns a sample from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

class stable_baselines3.common.distributions.DiagGaussianDistribution(action_dim)[source]

Gaussian distribution with diagonal covariance matrix, for continuous actions.

Parameters:: action_dim (int) – Dimension of the action space.

actions_from_params(mean_actions, log_std, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:

mean_actions (Tensor)
log_std (Tensor)
deterministic (bool)

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known
Return type:: Tensor | None

log_prob(actions)[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:: actions (Tensor)
Returns:
Return type:: Tensor

log_prob_from_params(mean_actions, log_std)[source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:

mean_actions (Tensor)
log_std (Tensor)

Returns:

Return type:

tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

proba_distribution(mean_actions, log_std)[source]

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions (Tensor)
log_std (Tensor)
self (SelfDiagGaussianDistribution)

Returns:

Return type:

SelfDiagGaussianDistribution

proba_distribution_net(latent_dim, log_std_init=0.0)[source]

Create the layers and parameter that represent the distribution: one output will be the mean of the Gaussian, the other parameter will be the standard deviation (log std in fact to allow negative values)

Parameters:

latent_dim (int) – Dimension of the last layer of the policy (before the action layer)
log_std_init (float) – Initial value for the log standard deviation

Returns:

Return type:

tuple[Module, Parameter]

sample()[source]

Returns a sample from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

class stable_baselines3.common.distributions.Distribution[source]

Abstract base class for distributions.

abstractmethod actions_from_params(*args, **kwargs)[source]

Returns samples from the probability distribution given its parameters.

Returns:: actions
Return type:: Tensor

abstractmethod entropy()[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known
Return type:: Tensor | None

get_actions(deterministic=False)[source]

Return actions according to the probability distribution.

Parameters:: deterministic (bool)
Returns:
Return type:: Tensor

abstractmethod log_prob(actions)[source]

Returns the log likelihood

Parameters:: actions (Tensor) – the taken action
Returns:: The log likelihood of the distribution
Return type:: Tensor

abstractmethod log_prob_from_params(*args, **kwargs)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob
Return type:: tuple[Tensor, Tensor]

abstractmethod mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

abstractmethod proba_distribution(*args, **kwargs)[source]

Set parameters of the distribution.

Returns:: self
Parameters:: self (SelfDistribution)
Return type:: SelfDistribution

abstractmethod proba_distribution_net(*args, **kwargs)[source]

Create the layers and parameters that represent the distribution.

Subclasses must define this, but the arguments and return type vary between concrete classes.

Return type:: Module | tuple[Module, Parameter]

abstractmethod sample()[source]

Returns a sample from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

class stable_baselines3.common.distributions.MultiCategoricalDistribution(action_dims)[source]

MultiCategorical distribution for multi discrete actions.

Parameters:: action_dims (list[int]) – List of sizes of discrete action spaces

actions_from_params(action_logits, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:

action_logits (Tensor)
deterministic (bool)

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known
Return type:: Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:: actions (Tensor) – the taken action
Returns:: The log likelihood of the distribution
Return type:: Tensor

log_prob_from_params(action_logits)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob
Parameters:: action_logits (Tensor)
Return type:: tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

proba_distribution(action_logits)[source]

Set parameters of the distribution.

Returns:

self

Parameters:

self (SelfMultiCategoricalDistribution)
action_logits (Tensor)

Return type:

SelfMultiCategoricalDistribution

proba_distribution_net(latent_dim)[source]

Create the layer that represents the distribution: it will be the logits (flattened) of the MultiCategorical distribution. You can then get probabilities using a softmax on each sub-space.

Parameters:: latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)
Returns:
Return type:: Module

sample()[source]

Returns a sample from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

class stable_baselines3.common.distributions.SquashedDiagGaussianDistribution(action_dim, epsilon=1e-06)[source]

Gaussian distribution with diagonal covariance matrix, followed by a squashing function (tanh) to ensure bounds.

Parameters:

action_dim (int) – Dimension of the action space.
epsilon (float) – small value to avoid NaN due to numerical imprecision.

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known
Return type:: Tensor | None

log_prob(actions, gaussian_actions=None)[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:

actions (Tensor)
gaussian_actions (Tensor | None)

Returns:

Return type:

Tensor

log_prob_from_params(mean_actions, log_std)[source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:

mean_actions (Tensor)
log_std (Tensor)

Returns:

Return type:

tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

proba_distribution(mean_actions, log_std)[source]

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions (Tensor)
log_std (Tensor)
self (SelfSquashedDiagGaussianDistribution)

Returns:

Return type:

SelfSquashedDiagGaussianDistribution

sample()[source]

Returns a sample from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

class stable_baselines3.common.distributions.StateDependentNoiseDistribution(action_dim, full_std=True, use_expln=False, squash_output=False, learn_features=False, epsilon=1e-06)[source]

Distribution class for using generalized State Dependent Exploration (gSDE). Paper: https://arxiv.org/abs/2005.05719

It is used to create the noise exploration matrix and compute the log probability of an action with that noise.

Parameters:

action_dim (int) – Dimension of the action space.
full_std (bool) – Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,)
use_expln (bool) – Use expln() function instead of exp() to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice, exp() is usually enough.
squash_output (bool) – Whether to squash the output using a tanh function, this ensures bounds are satisfied.
learn_features (bool) – Whether to learn features for gSDE or not. This will enable gradients to be backpropagated through the features latent_sde in the code.
epsilon (float) – small value to avoid NaN due to numerical imprecision.

actions_from_params(mean_actions, log_std, latent_sde, deterministic=False)[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

Parameters:

mean_actions (Tensor)
log_std (Tensor)
latent_sde (Tensor)
deterministic (bool)

Return type:

Tensor

entropy()[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known
Return type:: Tensor | None

get_std(log_std)[source]

Get the standard deviation from the learned parameter (log of it by default). This ensures that the std is positive.

Parameters:: log_std (Tensor)
Returns:
Return type:: Tensor

log_prob(actions)[source]

Returns the log likelihood

Parameters:: actions (Tensor) – the taken action
Returns:: The log likelihood of the distribution
Return type:: Tensor

log_prob_from_params(mean_actions, log_std, latent_sde)[source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

Parameters:

mean_actions (Tensor)
log_std (Tensor)
latent_sde (Tensor)

Return type:

tuple[Tensor, Tensor]

mode()[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

proba_distribution(mean_actions, log_std, latent_sde)[source]

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions (Tensor)
log_std (Tensor)
latent_sde (Tensor)
self (SelfStateDependentNoiseDistribution)

Returns:

Return type:

SelfStateDependentNoiseDistribution

proba_distribution_net(latent_dim, log_std_init=-2.0, latent_sde_dim=None)[source]

Create the layers and parameter that represent the distribution: one output will be the deterministic action, the other parameter will be the standard deviation of the distribution that control the weights of the noise matrix.

Parameters:

latent_dim (int) – Dimension of the last layer of the policy (before the action layer)
log_std_init (float) – Initial value for the log standard deviation
latent_sde_dim (int | None) – Dimension of the last layer of the features extractor for gSDE. By default, it is shared with the policy network.

Returns:

Return type:

tuple[Module, Parameter]

sample()[source]

Returns a sample from the probability distribution

Returns:: the stochastic action
Return type:: Tensor

sample_weights(log_std, batch_size=1)[source]

Sample weights for the noise exploration matrix, using a centered Gaussian distribution.

Parameters:

log_std (Tensor)
batch_size (int)

Return type:

None

class stable_baselines3.common.distributions.TanhBijector(epsilon=1e-06)[source]

Bijective transformation of a probability distribution using a squashing function (tanh)

Parameters:: epsilon (float) – small value to avoid NaN due to numerical imprecision.

static atanh(x)[source]

Inverse of Tanh

Taken from Pyro: https://github.com/pyro-ppl/pyro 0.5 * torch.log((1 + x ) / (1 - x))

Parameters:: x (Tensor)
Return type:: Tensor

static inverse(y)[source]

Inverse tanh.

Parameters:: y (Tensor)
Returns:
Return type:: Tensor

stable_baselines3.common.distributions.kl_divergence(dist_true, dist_pred)[source]

Wrapper for the PyTorch implementation of the full form KL Divergence

Parameters:

dist_true (Distribution) – the p distribution
dist_pred (Distribution) – the q distribution

Returns:

KL(dist_true||dist_pred)

Return type:

Tensor

stable_baselines3.common.distributions.make_proba_distribution(action_space, use_sde=False, dist_kwargs=None)[source]

Return an instance of Distribution for the correct type of action space

Parameters:

action_space (Space) – the input action space
use_sde (bool) – Force the use of StateDependentNoiseDistribution instead of DiagGaussianDistribution
dist_kwargs (dict[str, Any] | None) – Keyword arguments to pass to the probability distribution

Returns:

the appropriate Distribution object

Return type:

Distribution

stable_baselines3.common.distributions.sum_independent_dims(tensor)[source]

Continuous actions are usually considered to be independent, so we can sum components of the log_prob or the entropy.

Parameters:: tensor (Tensor) – shape: (n_batch, n_actions) or (n_batch,)
Returns:: shape: (n_batch,) for (n_batch, n_actions) input, scalar for (n_batch,) input
Return type:: Tensor