Custom Environments¶

Those environments were created for testing purposes.

BitFlippingEnv¶

class stable_baselines3.common.envs.BitFlippingEnv(n_bits=10, continuous=False, max_steps=None, discrete_obs_space=False, image_obs_space=False, channel_first=True)[source]¶

Simple bit flipping env, useful to test HER. The goal is to flip all the bits to get a vector of ones. In the continuous variant, if the ith action component has a value > 0, then the ith bit will be flipped.

Parameters:

n_bits (int) – Number of bits to flip
continuous (bool) – Whether to use the continuous actions version or not, by default, it uses the discrete one
max_steps (Optional[int]) – Max number of steps, by default, equal to n_bits
discrete_obs_space (bool) – Whether to use the discrete observation version or not, by default, it uses the MultiBinary one
image_obs_space (bool) – Use image as input instead of the MultiBinary one.
channel_first (bool) – Whether to use channel-first or last image.

close()[source]¶

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

Return type:: None

convert_if_needed(state)[source]¶

Convert to discrete space if needed.

Parameters:: state (ndarray) –
Return type:: Union[int, ndarray]
Returns:

convert_to_bit_vector(state, batch_size)[source]¶

Convert to bit vector if needed.

Parameters:

state (Union[int, ndarray]) –
batch_size (int) –

Return type:

ndarray

Returns:

render(mode='human')[source]¶

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is: :rtype: Optional[ndarray]

human: render to the current display or terminal and return nothing. Usually for human consumption.
rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note:

Make sure that your class’s metadata ‘render.modes’ key includes: the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Args:

mode (str): the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):

if mode == ‘rgb_array’:: return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:: … # pop up a window and render
else:: super(MyEnv, self).render(mode=mode) # just raise an exception

reset()[source]¶

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Return type:: Dict[str, Union[ndarray, int]]

Returns:: observation (object): the initial observation.

seed(seed)[source]¶

Sets the seed for this env’s random number generator(s).

Return type:: None

Note:

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns:

list<bigint>: Returns the list of seeds used in this env’s random: number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

step(action)[source]¶

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Return type:: Tuple[Union[Tuple, Dict[str, Any], ndarray, int], float, bool, Dict]

Args:: action (object): an action provided by the agent
Returns:: observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

SimpleMultiObsEnv¶

class stable_baselines3.common.envs.SimpleMultiObsEnv(num_col=4, num_row=4, random_start=True, discrete_actions=True, channel_last=True)[source]¶

Base class for GridWorld-based MultiObs Environments 4x4 grid world.

 ____________
| 0  1  2   3|
| 4|¯5¯¯6¯| 7|
| 8|_9_10_|11|
|12 13  14 15|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯

start is 0 states 5, 6, 9, and 10 are blocked goal is 15 actions are = [left, down, right, up]

simple linear state env of 15 states but encoded with a vector and an image observation: each column is represented by a random vector and each row is represented by a random image, both sampled once at creation time.

Parameters:

num_col (int) – Number of columns in the grid
num_row (int) – Number of rows in the grid
random_start (bool) – If true, agent starts in random position
channel_last (bool) – If true, the image will be channel last, else it will be channel first

get_state_mapping()[source]¶

Uses the state to get the observation mapping.

Return type:: Dict[str, ndarray]
Returns:: observation dict {‘vec’: …, ‘img’: …}

init_possible_transitions()[source]¶

Initializes the transitions of the environment The environment exploits the cardinal directions of the grid by noting that they correspond to simple addition and subtraction from the cell id within the grid :rtype: None

up => means moving up a row => means subtracting the length of a column
down => means moving down a row => means adding the length of a column
left => means moving left by one => means subtracting 1
right => means moving right by one => means adding 1

Thus one only needs to specify in which states each action is possible in order to define the transitions of the environment

init_state_mapping(num_col, num_row)[source]¶

Initializes the state_mapping array which holds the observation values for each state

Parameters:

num_col (int) – Number of columns.
num_row (int) – Number of rows.

Return type:

None

render(mode='human')[source]¶

Prints the log of the environment.

Parameters:: mode (str) –
Return type:: None

reset()[source]¶

Resets the environment state and step count and returns reset observation.

Return type:: Dict[str, ndarray]
Returns:: observation dict {‘vec’: …, ‘img’: …}

step(action)[source]¶

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state. Accepts an action and returns a tuple (observation, reward, done, info).

Parameters:: action (Union[float, ndarray]) –
Return type:: Tuple[Union[Tuple, Dict[str, Any], ndarray, int], float, bool, Dict]
Returns:: tuple (observation, reward, done, info).