Vectorized Environments¶

Vectorized Environments are a method for stacking multiple independent environments into a single environment. Instead of training an RL agent on 1 environment per step, it allows us to train it on n environments per step. Because of this, actions passed to the environment are now a vector (of dimension n). It is the same for observations, rewards and end of episode signals (dones). In the case of non-array observation spaces such as Dict or Tuple, where different sub-spaces may have different shapes, the sub-observations are vectors (of dimension n).

Name	`Box`	`Discrete`	`Dict`	`Tuple`	Multi Processing
DummyVecEnv	✔️	✔️	✔️	✔️	❌️
SubprocVecEnv	✔️	✔️	✔️	✔️	✔️

Note

Vectorized environments are required when using wrappers for frame-stacking or normalization.

Note

When using vectorized environments, the environments are automatically reset at the end of each episode. Thus, the observation returned for the i-th environment when done[i] is true will in fact be the first observation of the next episode, not the last observation of the episode that has just terminated. You can access the “real” final observation of the terminated episode—that is, the one that accompanied the done event provided by the underlying environment—using the terminal_observation keys in the info dicts returned by the VecEnv.

Warning

When defining a custom VecEnv (for instance, using gym3 ProcgenEnv), you should provide terminal_observation keys in the info dicts returned by the VecEnv (cf. note above).

Warning

When using SubprocVecEnv, users must wrap the code in an if __name__ == "__main__": if using the forkserver or spawn start method (default on Windows). On Linux, the default start method is fork which is not thread safe and can create deadlocks.

For more information, see Python’s multiprocessing guidelines.

Vectorized Environments Wrappers¶

If you want to alter or augment a VecEnv without redefining it completely (e.g. stack multiple frames, monitor the VecEnv, normalize the observation, …), you can use VecEnvWrapper for that. They are the vectorized equivalents (i.e., they act on multiple environments at the same time) of gym.Wrapper.

You can find below an example for extracting one key from the observation:

import numpy as np

from stable_baselines3.common.vec_env.base_vec_env import VecEnv, VecEnvStepReturn, VecEnvWrapper


class VecExtractDictObs(VecEnvWrapper):
    """
    A vectorized wrapper for filtering a specific key from dictionary observations.
    Similar to Gym's FilterObservation wrapper:
        https://github.com/openai/gym/blob/master/gym/wrappers/filter_observation.py

    :param venv: The vectorized environment
    :param key: The key of the dictionary observation
    """

    def __init__(self, venv: VecEnv, key: str):
        self.key = key
        super().__init__(venv=venv, observation_space=venv.observation_space.spaces[self.key])

    def reset(self) -> np.ndarray:
        obs = self.venv.reset()
        return obs[self.key]

    def step_async(self, actions: np.ndarray) -> None:
        self.venv.step_async(actions)

    def step_wait(self) -> VecEnvStepReturn:
        obs, reward, done, info = self.venv.step_wait()
        return obs[self.key], reward, done, info

env = DummyVecEnv([lambda: gym.make("FetchReach-v1")])
# Wrap the VecEnv
env = VecExtractDictObs(env, key="observation")

VecEnv¶

class stable_baselines3.common.vec_env.VecEnv(num_envs, observation_space, action_space)[source]¶

An abstract asynchronous, vectorized environment.

Parameters:

num_envs (int) – the number of environments
observation_space (Space) – the observation space
action_space (Space) – the action space

abstract close()[source]¶

Clean up the environment’s resources.

Return type:: None

abstract env_is_wrapped(wrapper_class, indices=None)[source]¶

Check if environments are wrapped with a given wrapper.

Parameters:

method_name – The name of the environment method to invoke.
indices (Union[None, int, Iterable[int]]) – Indices of envs whose method to call
method_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call

Return type:

List[bool]

Returns:

True if the env is wrapped, False otherwise, for each env queried.

abstract env_method(method_name, *method_args, indices=None, **method_kwargs)[source]¶

Call instance methods of vectorized environments.

Parameters:

method_name (str) – The name of the environment method to invoke.
indices (Union[None, int, Iterable[int]]) – Indices of envs whose method to call
method_args – Any positional arguments to provide in the call
method_kwargs – Any keyword arguments to provide in the call

Return type:

List[Any]

Returns:

List of items returned by the environment’s method call

abstract get_attr(attr_name, indices=None)[source]¶

Return attribute from vectorized environment.

Parameters:

attr_name (str) – The name of the attribute whose value to return
indices (Union[None, int, Iterable[int]]) – Indices of envs to get attribute from

Return type:

List[Any]

Returns:

List of values of ‘attr_name’ in all environments

get_images()[source]¶

Return RGB images from each environment

Return type:: Sequence[ndarray]

getattr_depth_check(name, already_found)[source]¶

Check if an attribute reference is being hidden in a recursive call to __getattr__

Parameters:

name (str) – name of attribute to check for
already_found (bool) – whether this attribute has already been found in a wrapper

Return type:

Optional[str]

Returns:

name of module whose attribute is being shadowed, if any.

render(mode='human')[source]¶

Gym environment rendering

Parameters:: mode (str) – the rendering type
Return type:: Optional[ndarray]

abstract reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Return type:: Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]]
Returns:: observation

abstract seed(seed=None)[source]¶

Sets the random seeds for all environments, based on a given seed. Each individual environment will still get its own seed, by incrementing the given seed.

Parameters:: seed (Optional[int]) – The random seed. May be None for completely random seeding.
Return type:: List[Optional[int]]
Returns:: Returns a list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when being seeded.

abstract set_attr(attr_name, value, indices=None)[source]¶

Set attribute inside vectorized environments.

Parameters:

attr_name (str) – The name of attribute to assign new value
value (Any) – Value to assign to attr_name
indices (Union[None, int, Iterable[int]]) – Indices of envs to assign value

Return type:

None

Returns:

step(actions)[source]¶

Step the environments with the given action

Parameters:: actions (ndarray) – the action
Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

abstract step_async(actions)[source]¶

Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of the step.

You should not call this if a step_async run is already pending.

Return type:: None

abstract step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

DummyVecEnv¶

class stable_baselines3.common.vec_env.DummyVecEnv(env_fns)[source]¶

Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the current Python process. This is useful for computationally simple environment such as cartpole-v1, as the overhead of multiprocess or multithread outweighs the environment computation time. This can also be used for RL methods that require a vectorized environment, but that you want a single environments to train with.

Parameters:: env_fns (List[Callable[[], Env]]) – a list of functions that return environments to vectorize

close()[source]¶

Clean up the environment’s resources.

Return type:: None

env_is_wrapped(wrapper_class, indices=None)[source]¶

Check if worker environments are wrapped with a given wrapper

Return type:: List[bool]

env_method(method_name, *method_args, indices=None, **method_kwargs)[source]¶

Call instance methods of vectorized environments.

Return type:: List[Any]

get_attr(attr_name, indices=None)[source]¶

Return attribute from vectorized environment (see base class).

Return type:: List[Any]

get_images()[source]¶

Return RGB images from each environment

Return type:: Sequence[ndarray]

render(mode='human')[source]¶

Gym environment rendering. If there are multiple environments then they are tiled together in one image via BaseVecEnv.render(). Otherwise (if self.num_envs == 1), we pass the render call directly to the underlying environment.

Therefore, some arguments such as mode will have values that are valid only when num_envs == 1.

Parameters:: mode (str) – The rendering type.
Return type:: Optional[ndarray]

reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Return type:: Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]]
Returns:: observation

seed(seed=None)[source]¶

Sets the random seeds for all environments, based on a given seed. Each individual environment will still get its own seed, by incrementing the given seed.

Parameters:: seed (Optional[int]) – The random seed. May be None for completely random seeding.
Return type:: List[Optional[int]]
Returns:: Returns a list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when being seeded.

set_attr(attr_name, value, indices=None)[source]¶

Set attribute inside vectorized environments (see base class).

Return type:: None

step_async(actions)[source]¶

Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of the step.

You should not call this if a step_async run is already pending.

Return type:: None

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

SubprocVecEnv¶

class stable_baselines3.common.vec_env.SubprocVecEnv(env_fns, start_method=None)[source]¶

Creates a multiprocess vectorized wrapper for multiple environments, distributing each environment to its own process, allowing significant speed up when the environment is computationally complex.

For performance reasons, if your environment is not IO bound, the number of environments should not exceed the number of logical cores on your CPU.

Warning

Only ‘forkserver’ and ‘spawn’ start methods are thread-safe, which is important when TensorFlow sessions or other non thread-safe libraries are used in the parent (see issue #217). However, compared to ‘fork’ they incur a small start-up cost and have restrictions on global variables. With those methods, users must wrap the code in an if __name__ == "__main__": block. For more information, see the multiprocessing documentation.

Parameters:

env_fns (List[Callable[[], Env]]) – Environments to run in subprocesses
start_method (Optional[str]) – method used to start the subprocesses. Must be one of the methods returned by multiprocessing.get_all_start_methods(). Defaults to ‘forkserver’ on available platforms, and ‘spawn’ otherwise.

close()[source]¶

Clean up the environment’s resources.

Return type:: None

env_is_wrapped(wrapper_class, indices=None)[source]¶

Check if worker environments are wrapped with a given wrapper

Return type:: List[bool]

env_method(method_name, *method_args, indices=None, **method_kwargs)[source]¶

Call instance methods of vectorized environments.

Return type:: List[Any]

get_attr(attr_name, indices=None)[source]¶

Return attribute from vectorized environment (see base class).

Return type:: List[Any]

get_images()[source]¶

Return RGB images from each environment

Return type:: Sequence[ndarray]

reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Return type:: Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]]
Returns:: observation

seed(seed=None)[source]¶

Sets the random seeds for all environments, based on a given seed. Each individual environment will still get its own seed, by incrementing the given seed.

Parameters:: seed (Optional[int]) – The random seed. May be None for completely random seeding.
Return type:: List[Optional[int]]
Returns:: Returns a list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when being seeded.

set_attr(attr_name, value, indices=None)[source]¶

Set attribute inside vectorized environments (see base class).

Return type:: None

step_async(actions)[source]¶

Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of the step.

You should not call this if a step_async run is already pending.

Return type:: None

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

Wrappers¶

VecFrameStack¶

class stable_baselines3.common.vec_env.VecFrameStack(venv, n_stack, channels_order=None)[source]¶

Frame stacking wrapper for vectorized environment. Designed for image observations.

Uses the StackedObservations class, or StackedDictObservations depending on the observations space

Parameters:

venv (VecEnv) – the vectorized environment to wrap
n_stack (int) – Number of frames to stack
channels_order (Union[str, Dict[str, str], None]) – If “first”, stack on first image dimension. If “last”, stack on last dimension. If None, automatically detect channel to stack over in case of image observation or default to “last” (default). Alternatively channels_order can be a dictionary which can be used with environments with Dict observation spaces

close()[source]¶

Clean up the environment’s resources.

Return type:: None

reset()[source]¶

Reset all environments

Return type:: Union[ndarray, Dict[str, ndarray]]

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray]], ndarray, ndarray, List[Dict[str, Any]]]
Returns:: observation, reward, done, information

StackedObservations¶

class stable_baselines3.common.vec_env.stacked_observations.StackedObservations(num_envs, n_stack, observation_space, channels_order=None)[source]¶

Frame stacking wrapper for data.

Dimension to stack over is either first (channels-first) or last (channels-last), which is detected automatically using common.preprocessing.is_image_space_channels_first if observation is an image space.

Parameters:

num_envs (int) – number of environments
n_stack (int) – Number of frames to stack
observation_space (Space) – Environment observation space.
channels_order (Optional[str]) – If “first”, stack on first image dimension. If “last”, stack on last dimension. If None, automatically detect channel to stack over in case of image observation or default to “last” (default).

static compute_stacking(num_envs, n_stack, observation_space, channels_order=None)[source]¶

Calculates the parameters in order to stack observations

Parameters:

num_envs (int) – Number of environments in the stack
n_stack (int) – The number of observations to stack
observation_space (Box) – The observation space
channels_order (Optional[str]) – The order of the channels

Return type:

Tuple[bool, int, ndarray, int]

Returns:

tuple of channels_first, stack_dimension, stackedobs, repeat_axis

reset(observation)[source]¶

Resets the stackedobs, adds the reset observation to the stack, and returns the stack

Parameters:: observation (ndarray) – Reset observation
Return type:: ndarray
Returns:: The stacked reset observation

stack_observation_space(observation_space)[source]¶

Given an observation space, returns a new observation space with stacked observations

Return type:: Box
Returns:: New observation space with stacked dimensions

update(observations, dones, infos)[source]¶

Adds the observations to the stack and uses the dones to update the infos.

Parameters:

observations (ndarray) – numpy array of observations
dones (ndarray) – numpy array of done info
infos (List[Dict[str, Any]]) – numpy array of info dicts

Return type:

Tuple[ndarray, List[Dict[str, Any]]]

Returns:

tuple of the stacked observations and the updated infos

StackedDictObservations¶

class stable_baselines3.common.vec_env.stacked_observations.StackedDictObservations(num_envs, n_stack, observation_space, channels_order=None)[source]¶

Frame stacking wrapper for dictionary data.

Dimension to stack over is either first (channels-first) or last (channels-last), which is detected automatically using common.preprocessing.is_image_space_channels_first if observation is an image space.

Parameters:

num_envs (int) – number of environments
n_stack (int) – Number of frames to stack
channels_order (Union[str, Dict[str, str], None]) – If “first”, stack on first image dimension. If “last”, stack on last dimension. If None, automatically detect channel to stack over in case of image observation or default to “last” (default).

reset(observation)[source]¶

Resets the stacked observations, adds the reset observation to the stack, and returns the stack

Parameters:: observation (Dict[str, ndarray]) – Reset observation
Return type:: Dict[str, ndarray]
Returns:: Stacked reset observations

stack_observation_space(observation_space)[source]¶

Returns the stacked verson of a Dict observation space

Parameters:: observation_space (Dict) – Dict observation space to stack
Return type:: Dict
Returns:: stacked observation space

update(observations, dones, infos)[source]¶

Adds the observations to the stack and uses the dones to update the infos.

Parameters:

observations (Dict[str, ndarray]) – Dict of numpy arrays of observations
dones (ndarray) – numpy array of dones
infos (List[Dict[str, Any]]) – dict of infos

Return type:

Tuple[Dict[str, ndarray], List[Dict[str, Any]]]

Returns:

tuple of the stacked observations and the updated infos

VecNormalize¶

class stable_baselines3.common.vec_env.VecNormalize(venv, training=True, norm_obs=True, norm_reward=True, clip_obs=10.0, clip_reward=10.0, gamma=0.99, epsilon=1e-08, norm_obs_keys=None)[source]¶

A moving average, normalizing wrapper for vectorized environment. has support for saving/loading moving average,

Parameters:

venv (VecEnv) – the vectorized environment to wrap
training (bool) – Whether to update or not the moving average
norm_obs (bool) – Whether to normalize observation or not (default: True)
norm_reward (bool) – Whether to normalize rewards or not (default: True)
clip_obs (float) – Max absolute value for observation
clip_reward (float) – Max value absolute for discounted reward
gamma (float) – discount factor
epsilon (float) – To avoid division by zero
norm_obs_keys (Optional[List[str]]) – Which keys from observation dict to normalize. If not specified, all keys will be normalized.

get_original_obs()[source]¶

Returns an unnormalized version of the observations from the most recent step or reset.

Return type:: Union[ndarray, Dict[str, ndarray]]

get_original_reward()[source]¶

Returns an unnormalized version of the rewards from the most recent step.

Return type:: ndarray

static load(load_path, venv)[source]¶

Loads a saved VecNormalize object.

Parameters:

load_path (str) – the path to load from.
venv (VecEnv) – the VecEnv to wrap.

Return type:

VecNormalize

Returns:

normalize_obs(obs)[source]¶

Normalize observations using this VecNormalize’s observations statistics. Calling this method does not update statistics.

Return type:: Union[ndarray, Dict[str, ndarray]]

normalize_reward(reward)[source]¶

Normalize rewards using this VecNormalize’s rewards statistics. Calling this method does not update statistics.

Return type:: ndarray

reset()[source]¶: Reset all environments :rtype: Union[ndarray, Dict[str, ndarray]] :return: first observation of the episode

save(save_path)[source]¶

Save current VecNormalize object with all running statistics and settings (e.g. clip_obs)

Parameters:: save_path (str) – The path to save to
Return type:: None

set_venv(venv)[source]¶

Sets the vector environment to wrap to venv.

Also sets attributes derived from this such as num_env.

Parameters:: venv (VecEnv) –
Return type:: None

step_wait()[source]¶

Apply sequence of actions to sequence of environments actions -> (observations, rewards, dones)

where dones is a boolean vector indicating whether each element is new.

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]

VecVideoRecorder¶

class stable_baselines3.common.vec_env.VecVideoRecorder(venv, video_folder, record_video_trigger, video_length=200, name_prefix='rl-video')[source]¶

Wraps a VecEnv or VecEnvWrapper object to record rendered image as mp4 video. It requires ffmpeg or avconv to be installed on the machine.

Parameters:

venv (VecEnv) –
video_folder (str) – Where to save videos
record_video_trigger (Callable[[int], bool]) – Function that defines when to start recording. The function takes the current number of step, and returns whether we should start recording or not.
video_length (int) – Length of recorded videos
name_prefix (str) – Prefix to the video name

close()[source]¶

Clean up the environment’s resources.

Return type:: None

reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Return type:: Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]]
Returns:: observation

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

VecCheckNan¶

class stable_baselines3.common.vec_env.VecCheckNan(venv, raise_exception=False, warn_once=True, check_inf=True)[source]¶

NaN and inf checking wrapper for vectorized environment, will raise a warning by default, allowing you to know from what the NaN of inf originated from.

Parameters:

venv (VecEnv) – the vectorized environment to wrap
raise_exception (bool) – Whether or not to raise a ValueError, instead of a UserWarning
warn_once (bool) – Whether or not to only warn once.
check_inf (bool) – Whether or not to check for +inf or -inf as well

reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Return type:: Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]]
Returns:: observation

step_async(actions)[source]¶

Tell all the environments to start taking a step with the given actions. Call step_wait() to get the results of the step.

You should not call this if a step_async run is already pending.

Return type:: None

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

VecTransposeImage¶

class stable_baselines3.common.vec_env.VecTransposeImage(venv, skip=False)[source]¶

Re-order channels, from HxWxC to CxHxW. It is required for PyTorch convolution layers.

Parameters:

venv (VecEnv) –
skip (bool) – Skip this wrapper if needed as we rely on heuristic to apply it or not, which may result in unwanted behavior, see GH issue #671.

close()[source]¶

Clean up the environment’s resources.

Return type:: None

reset()[source]¶

Reset all environments

Return type:: Union[ndarray, Dict]

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

static transpose_image(image)[source]¶

Transpose an image or batch of images (re-order channels).

Parameters:: image (ndarray) –
Return type:: ndarray
Returns:

transpose_observations(observations)[source]¶

Transpose (if needed) and return new observations.

Parameters:: observations (Union[ndarray, Dict]) –
Return type:: Union[ndarray, Dict]
Returns:: Transposed observations

static transpose_space(observation_space, key='')[source]¶

Transpose an observation space (re-order channels).

Parameters:

observation_space (Box) –
key (str) – In case of dictionary space, the key of the observation space.

Return type:

Box

Returns:

VecMonitor¶

class stable_baselines3.common.vec_env.VecMonitor(venv, filename=None, info_keywords=())[source]¶

A vectorized monitor wrapper for vectorized Gym environments, it is used to record the episode reward, length, time and other data.

Some environments like openai/procgen or gym3 directly initialize the vectorized environments, without giving us a chance to use the Monitor wrapper. So this class simply does the job of the Monitor wrapper on a vectorized level.

Parameters:

venv (VecEnv) – The vectorized environment
filename (Optional[str]) – the location to save a log file, can be None for no log
info_keywords (Tuple[str, ...]) – extra information to log, from the information return of env.step()

close()[source]¶

Clean up the environment’s resources.

Return type:: None

reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Return type:: Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]]
Returns:: observation

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information

VecExtractDictObs¶

class stable_baselines3.common.vec_env.VecExtractDictObs(venv, key)[source]¶

A vectorized wrapper for extracting dictionary observations.

Parameters:

venv (VecEnv) – The vectorized environment
key (str) – The key of the dictionary observation

reset()[source]¶

Reset all the environments and return an array of observations, or a tuple of observation arrays.

If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.

Return type:: ndarray
Returns:: observation

step_wait()[source]¶

Wait for the step taken with step_async().

Return type:: Tuple[Union[ndarray, Dict[str, ndarray], Tuple[ndarray, ...]], ndarray, ndarray, List[Dict]]
Returns:: observation, reward, done, information