Custom Environments

Those environments were created for testing purposes.

BitFlippingEnv

class stable_baselines3.common.envs.BitFlippingEnv(n_bits=10, continuous=False, max_steps=None, discrete_obs_space=False, image_obs_space=False, channel_first=True)[source]

Simple bit flipping env, useful to test HER. The goal is to flip all the bits to get a vector of ones. In the continuous variant, if the ith action component has a value > 0, then the ith bit will be flipped.

Parameters
  • n_bits (int) – Number of bits to flip

  • continuous (bool) – Whether to use the continuous actions version or not, by default, it uses the discrete one

  • max_steps (Optional[int]) – Max number of steps, by default, equal to n_bits

  • discrete_obs_space (bool) – Whether to use the discrete observation version or not, by default, it uses the MultiBinary one

  • image_obs_space (bool) – Use image as input instead of the MultiBinary one.

  • channel_first (bool) – Whether to use channel-first or last image.

close()[source]

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

Return type

None

compute_reward(achieved_goal, desired_goal, _info)[source]

Compute the step reward. This externalizes the reward function and makes it dependent on a desired goal and the one that was achieved. If you wish to include additional rewards that are independent of the goal, you can include the necessary values to derive it in ‘info’ and compute it accordingly.

Args:

achieved_goal (object): the goal that was achieved during execution desired_goal (object): the desired goal that we asked the agent to attempt to achieve info (dict): an info dictionary with additional information

Returns:

float: The reward that corresponds to the provided achieved goal w.r.t. to the desired goal. Note that the following should always hold true:

ob, reward, done, info = env.step() assert reward == env.compute_reward(ob[‘achieved_goal’], ob[‘desired_goal’], info)

Return type

float32

convert_if_needed(state)[source]

Convert to discrete space if needed.

Parameters

state (ndarray) –

Return type

Union[int, ndarray]

Returns

convert_to_bit_vector(state, batch_size)[source]

Convert to bit vector if needed.

Parameters
  • state (Union[int, ndarray]) –

  • batch_size (int) –

Return type

ndarray

Returns

render(mode='human')[source]

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.

  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.

  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note:
Make sure that your class’s metadata ‘render.modes’ key includes

the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Args:

mode (str): the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:

return np.array(…) # return RGB frame suitable for video

elif mode == ‘human’:

… # pop up a window and render

else:

super(MyEnv, self).render(mode=mode) # just raise an exception

Return type

Optional[ndarray]

reset()[source]

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns:

observation (object): the initial observation.

Return type

Dict[str, Union[int, ndarray]]

seed(seed)[source]

Sets the seed for this env’s random number generator(s).

Note:

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns:
list<bigint>: Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

None

step(action)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Args:

action (object): an action provided by the agent

Returns:

observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

Tuple[Union[Tuple, Dict[str, Any], ndarray, int], float, bool, Dict]

SimpleMultiObsEnv

class stable_baselines3.common.envs.SimpleMultiObsEnv(num_col=4, num_row=4, random_start=True, discrete_actions=True, channel_last=True)[source]

Base class for GridWorld-based MultiObs Environments 4x4 grid world.

 ____________
| 0  1  2   3|
| 4|¯5¯¯6¯| 7|
| 8|_9_10_|11|
|12 13  14 15|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯

start is 0 states 5, 6, 9, and 10 are blocked goal is 15 actions are = [left, down, right, up]

simple linear state env of 15 states but encoded with a vector and an image observation: each column is represented by a random vector and each row is represented by a random image, both sampled once at creation time.

Parameters
  • num_col (int) – Number of columns in the grid

  • num_row (int) – Number of rows in the grid

  • random_start (bool) – If true, agent starts in random position

  • channel_last (bool) – If true, the image will be channel last, else it will be channel first

get_state_mapping()[source]

Uses the state to get the observation mapping.

Return type

Dict[str, ndarray]

Returns

observation dict {‘vec’: …, ‘img’: …}

init_possible_transitions()[source]

Initializes the transitions of the environment The environment exploits the cardinal directions of the grid by noting that they correspond to simple addition and subtraction from the cell id within the grid

  • up => means moving up a row => means subtracting the length of a column

  • down => means moving down a row => means adding the length of a column

  • left => means moving left by one => means subtracting 1

  • right => means moving right by one => means adding 1

Thus one only needs to specify in which states each action is possible in order to define the transitions of the environment

Return type

None

init_state_mapping(num_col, num_row)[source]

Initializes the state_mapping array which holds the observation values for each state

Parameters
  • num_col (int) – Number of columns.

  • num_row (int) – Number of rows.

Return type

None

render(mode='human')[source]

Prints the log of the environment.

Parameters

mode (str) –

Return type

None

reset()[source]

Resets the environment state and step count and returns reset observation.

Return type

Dict[str, ndarray]

Returns

observation dict {‘vec’: …, ‘img’: …}

step(action)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state. Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (Union[int, float, ndarray]) –

Return type

Tuple[Union[Tuple, Dict[str, Any], ndarray, int], float, bool, Dict]

Returns

tuple (observation, reward, done, info).