Probability Distributions¶

Probability distributions used for the different action spaces:

CategoricalDistribution -> Discrete
DiagGaussianDistribution -> Box (continuous actions)
StateDependentNoiseDistribution -> Box (continuous actions) when use_sde=True

The policy networks output parameters for the distributions (named flat in the methods). Actions are then sampled from those distributions.

For instance, in the case of discrete actions. The policy network outputs probability of taking each action. The CategoricalDistribution allows to sample from it, computes the entropy, the log probability (log_prob) and backpropagate the gradient.

In the case of continuous actions, a Gaussian distribution is used. The policy network outputs mean and (log) std of the distribution (assumed to be a DiagGaussianDistribution).

Probability distributions.

class stable_baselines3.common.distributions.BernoulliDistribution(action_dims)[source]¶

Bernoulli distribution for MultiBinary action spaces.

Parameters:: action_dim – Number of binary actions

actions_from_params(action_logits, deterministic=False)[source]¶

Returns samples from the probability distribution given its parameters.

Return type:: Tensor
Returns:: actions

entropy()[source]¶

Returns Shannon’s entropy of the probability

Return type:: Tensor
Returns:: the entropy, or None if no analytical form is known

log_prob(actions)[source]¶

Returns the log likelihood

Parameters:: x – the taken action
Return type:: Tensor
Returns:: The log likelihood of the distribution

log_prob_from_params(action_logits)[source]¶

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:: Tuple[Tensor, Tensor]
Returns:: actions and log prob

mode()[source]¶

Returns the most likely action (deterministic output) from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

proba_distribution(action_logits)[source]¶

Set parameters of the distribution.

Return type:: TypeVar(SelfBernoulliDistribution, bound= BernoulliDistribution)
Returns:: self

proba_distribution_net(latent_dim)[source]¶

Create the layer that represents the distribution: it will be the logits of the Bernoulli distribution.

Parameters:: latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)
Return type:: Module
Returns:

sample()[source]¶

Returns a sample from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

class stable_baselines3.common.distributions.CategoricalDistribution(action_dim)[source]¶

Categorical distribution for discrete actions.

Parameters:: action_dim (int) – Number of discrete actions

actions_from_params(action_logits, deterministic=False)[source]¶

Returns samples from the probability distribution given its parameters.

Return type:: Tensor
Returns:: actions

entropy()[source]¶

Returns Shannon’s entropy of the probability

Return type:: Tensor
Returns:: the entropy, or None if no analytical form is known

log_prob(actions)[source]¶

Returns the log likelihood

Parameters:: x – the taken action
Return type:: Tensor
Returns:: The log likelihood of the distribution

log_prob_from_params(action_logits)[source]¶

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:: Tuple[Tensor, Tensor]
Returns:: actions and log prob

mode()[source]¶

Returns the most likely action (deterministic output) from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

proba_distribution(action_logits)[source]¶

Set parameters of the distribution.

Return type:: TypeVar(SelfCategoricalDistribution, bound= CategoricalDistribution)
Returns:: self

proba_distribution_net(latent_dim)[source]¶

Create the layer that represents the distribution: it will be the logits of the Categorical distribution. You can then get probabilities using a softmax.

Parameters:: latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)
Return type:: Module
Returns:

sample()[source]¶

Returns a sample from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

class stable_baselines3.common.distributions.DiagGaussianDistribution(action_dim)[source]¶

Gaussian distribution with diagonal covariance matrix, for continuous actions.

Parameters:: action_dim (int) – Dimension of the action space.

actions_from_params(mean_actions, log_std, deterministic=False)[source]¶

Returns samples from the probability distribution given its parameters.

Return type:: Tensor
Returns:: actions

entropy()[source]¶

Returns Shannon’s entropy of the probability

Return type:: Tensor
Returns:: the entropy, or None if no analytical form is known

log_prob(actions)[source]¶

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:: actions (Tensor) –
Return type:: Tensor
Returns:

log_prob_from_params(mean_actions, log_std)[source]¶

Compute the log probability of taking an action given the distribution parameters.

Parameters:

mean_actions (Tensor) –
log_std (Tensor) –

Return type:

Tuple[Tensor, Tensor]

Returns:

mode()[source]¶

Returns the most likely action (deterministic output) from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

proba_distribution(mean_actions, log_std)[source]¶

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions (Tensor) –
log_std (Tensor) –

Return type:

TypeVar(SelfDiagGaussianDistribution, bound= DiagGaussianDistribution)

Returns:

proba_distribution_net(latent_dim, log_std_init=0.0)[source]¶

Create the layers and parameter that represent the distribution: one output will be the mean of the Gaussian, the other parameter will be the standard deviation (log std in fact to allow negative values)

Parameters:

latent_dim (int) – Dimension of the last layer of the policy (before the action layer)
log_std_init (float) – Initial value for the log standard deviation

Return type:

Tuple[Module, Parameter]

Returns:

sample()[source]¶

Returns a sample from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

class stable_baselines3.common.distributions.Distribution[source]¶

Abstract base class for distributions.

abstract actions_from_params(*args, **kwargs)[source]¶

Returns samples from the probability distribution given its parameters.

Return type:: Tensor
Returns:: actions

abstract entropy()[source]¶

Returns Shannon’s entropy of the probability

Return type:: Optional[Tensor]
Returns:: the entropy, or None if no analytical form is known

get_actions(deterministic=False)[source]¶

Return actions according to the probability distribution.

Parameters:: deterministic (bool) –
Return type:: Tensor
Returns:

abstract log_prob(x)[source]¶

Returns the log likelihood

Parameters:: x (Tensor) – the taken action
Return type:: Tensor
Returns:: The log likelihood of the distribution

abstract log_prob_from_params(*args, **kwargs)[source]¶

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:: Tuple[Tensor, Tensor]
Returns:: actions and log prob

abstract mode()[source]¶

Returns the most likely action (deterministic output) from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

abstract proba_distribution(*args, **kwargs)[source]¶

Set parameters of the distribution.

Return type:: TypeVar(SelfDistribution, bound= Distribution)
Returns:: self

abstract proba_distribution_net(*args, **kwargs)[source]¶

Create the layers and parameters that represent the distribution.

Subclasses must define this, but the arguments and return type vary between concrete classes.

Return type:: Union[Module, Tuple[Module, Parameter]]

abstract sample()[source]¶

Returns a sample from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

class stable_baselines3.common.distributions.MultiCategoricalDistribution(action_dims)[source]¶

MultiCategorical distribution for multi discrete actions.

Parameters:: action_dims (List[int]) – List of sizes of discrete action spaces

actions_from_params(action_logits, deterministic=False)[source]¶

Returns samples from the probability distribution given its parameters.

Return type:: Tensor
Returns:: actions

entropy()[source]¶

Returns Shannon’s entropy of the probability

Return type:: Tensor
Returns:: the entropy, or None if no analytical form is known

log_prob(actions)[source]¶

Returns the log likelihood

Parameters:: x – the taken action
Return type:: Tensor
Returns:: The log likelihood of the distribution

log_prob_from_params(action_logits)[source]¶

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:: Tuple[Tensor, Tensor]
Returns:: actions and log prob

mode()[source]¶

Returns the most likely action (deterministic output) from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

proba_distribution(action_logits)[source]¶

Set parameters of the distribution.

Return type:: TypeVar(SelfMultiCategoricalDistribution, bound= MultiCategoricalDistribution)
Returns:: self

proba_distribution_net(latent_dim)[source]¶

Create the layer that represents the distribution: it will be the logits (flattened) of the MultiCategorical distribution. You can then get probabilities using a softmax on each sub-space.

Parameters:: latent_dim (int) – Dimension of the last layer of the policy network (before the action layer)
Return type:: Module
Returns:

sample()[source]¶

Returns a sample from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

class stable_baselines3.common.distributions.SquashedDiagGaussianDistribution(action_dim, epsilon=1e-06)[source]¶

Gaussian distribution with diagonal covariance matrix, followed by a squashing function (tanh) to ensure bounds.

Parameters:

action_dim (int) – Dimension of the action space.
epsilon (float) – small value to avoid NaN due to numerical imprecision.

entropy()[source]¶

Returns Shannon’s entropy of the probability

Return type:: Optional[Tensor]
Returns:: the entropy, or None if no analytical form is known

log_prob(actions, gaussian_actions=None)[source]¶

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:: actions (Tensor) –
Return type:: Tensor
Returns:

log_prob_from_params(mean_actions, log_std)[source]¶

Compute the log probability of taking an action given the distribution parameters.

Parameters:

mean_actions (Tensor) –
log_std (Tensor) –

Return type:

Tuple[Tensor, Tensor]

Returns:

mode()[source]¶

Returns the most likely action (deterministic output) from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

proba_distribution(mean_actions, log_std)[source]¶

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions (Tensor) –
log_std (Tensor) –

Return type:

TypeVar(SelfSquashedDiagGaussianDistribution, bound= SquashedDiagGaussianDistribution)

Returns:

sample()[source]¶

Returns a sample from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

class stable_baselines3.common.distributions.StateDependentNoiseDistribution(action_dim, full_std=True, use_expln=False, squash_output=False, learn_features=False, epsilon=1e-06)[source]¶

Distribution class for using generalized State Dependent Exploration (gSDE). Paper: https://arxiv.org/abs/2005.05719

It is used to create the noise exploration matrix and compute the log probability of an action with that noise.

Parameters:

action_dim (int) – Dimension of the action space.
full_std (bool) – Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,)
use_expln (bool) – Use expln() function instead of exp() to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice, exp() is usually enough.
squash_output (bool) – Whether to squash the output using a tanh function, this ensures bounds are satisfied.
learn_features (bool) – Whether to learn features for gSDE or not. This will enable gradients to be backpropagated through the features latent_sde in the code.
epsilon (float) – small value to avoid NaN due to numerical imprecision.

actions_from_params(mean_actions, log_std, latent_sde, deterministic=False)[source]¶

Returns samples from the probability distribution given its parameters.

Return type:: Tensor
Returns:: actions

entropy()[source]¶

Returns Shannon’s entropy of the probability

Return type:: Optional[Tensor]
Returns:: the entropy, or None if no analytical form is known

get_std(log_std)[source]¶

Get the standard deviation from the learned parameter (log of it by default). This ensures that the std is positive.

Parameters:: log_std (Tensor) –
Return type:: Tensor
Returns:

log_prob(actions)[source]¶

Returns the log likelihood

Parameters:: x – the taken action
Return type:: Tensor
Returns:: The log likelihood of the distribution

log_prob_from_params(mean_actions, log_std, latent_sde)[source]¶

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Return type:: Tuple[Tensor, Tensor]
Returns:: actions and log prob

mode()[source]¶

Returns the most likely action (deterministic output) from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

proba_distribution(mean_actions, log_std, latent_sde)[source]¶

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions (Tensor) –
log_std (Tensor) –
latent_sde (Tensor) –

Return type:

TypeVar(SelfStateDependentNoiseDistribution, bound= StateDependentNoiseDistribution)

Returns:

proba_distribution_net(latent_dim, log_std_init=-2.0, latent_sde_dim=None)[source]¶

Create the layers and parameter that represent the distribution: one output will be the deterministic action, the other parameter will be the standard deviation of the distribution that control the weights of the noise matrix.

Parameters:

latent_dim (int) – Dimension of the last layer of the policy (before the action layer)
log_std_init (float) – Initial value for the log standard deviation
latent_sde_dim (Optional[int]) – Dimension of the last layer of the features extractor for gSDE. By default, it is shared with the policy network.

Return type:

Tuple[Module, Parameter]

Returns:

sample()[source]¶

Returns a sample from the probability distribution

Return type:: Tensor
Returns:: the stochastic action

sample_weights(log_std, batch_size=1)[source]¶

Sample weights for the noise exploration matrix, using a centered Gaussian distribution.

Parameters:

log_std (Tensor) –
batch_size (int) –

Return type:

None

class stable_baselines3.common.distributions.TanhBijector(epsilon=1e-06)[source]¶

Bijective transformation of a probability distribution using a squashing function (tanh)

Parameters:: epsilon (float) – small value to avoid NaN due to numerical imprecision.

static atanh(x)[source]¶

Inverse of Tanh

Taken from Pyro: https://github.com/pyro-ppl/pyro 0.5 * torch.log((1 + x ) / (1 - x))

Return type:: Tensor

static inverse(y)[source]¶

Inverse tanh.

Parameters:: y (Tensor) –
Return type:: Tensor
Returns:

stable_baselines3.common.distributions.kl_divergence(dist_true, dist_pred)[source]¶

Wrapper for the PyTorch implementation of the full form KL Divergence

Parameters:

dist_true (Distribution) – the p distribution
dist_pred (Distribution) – the q distribution

Return type:

Tensor

Returns:

KL(dist_true||dist_pred)

stable_baselines3.common.distributions.make_proba_distribution(action_space, use_sde=False, dist_kwargs=None)[source]¶

Return an instance of Distribution for the correct type of action space

Parameters:

action_space (Space) – the input action space
use_sde (bool) – Force the use of StateDependentNoiseDistribution instead of DiagGaussianDistribution
dist_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to the probability distribution

Return type:

Distribution

Returns:

the appropriate Distribution object

stable_baselines3.common.distributions.sum_independent_dims(tensor)[source]¶

Continuous actions are usually considered to be independent, so we can sum components of the log_prob or the entropy.

Parameters:: tensor (Tensor) – shape: (n_batch, n_actions) or (n_batch,)
Return type:: Tensor
Returns:: shape: (n_batch,)