Custom Environments¶
Those environments were created for testing purposes.
BitFlippingEnv¶
- class stable_baselines3.common.envs.BitFlippingEnv(n_bits=10, continuous=False, max_steps=None, discrete_obs_space=False, image_obs_space=False, channel_first=True)[source]¶
Simple bit flipping env, useful to test HER. The goal is to flip all the bits to get a vector of ones. In the continuous variant, if the ith action component has a value > 0, then the ith bit will be flipped.
- Parameters:
n_bits (
int
) – Number of bits to flipcontinuous (
bool
) – Whether to use the continuous actions version or not, by default, it uses the discrete onemax_steps (
Optional
[int
]) – Max number of steps, by default, equal to n_bitsdiscrete_obs_space (
bool
) – Whether to use the discrete observation version or not, by default, it uses theMultiBinary
oneimage_obs_space (
bool
) – Use image as input instead of theMultiBinary
one.channel_first (
bool
) – Whether to use channel-first or last image.
- close()[source]¶
Override close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.
- Return type:
None
- convert_if_needed(state)[source]¶
Convert to discrete space if needed.
- Parameters:
state (
ndarray
) –- Return type:
Union
[int
,ndarray
]- Returns:
- convert_to_bit_vector(state, batch_size)[source]¶
Convert to bit vector if needed.
- Parameters:
state (
Union
[int
,ndarray
]) –batch_size (
int
) –
- Return type:
ndarray
- Returns:
- render(mode='human')[source]¶
Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is: :rtype:
Optional
[ndarray
]human: render to the current display or terminal and return nothing. Usually for human consumption.
rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
- Note:
- Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
- Args:
mode (str): the mode to render with
Example:
- class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
- def render(self, mode=’human’):
- if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
- elif mode == ‘human’:
… # pop up a window and render
- else:
super(MyEnv, self).render(mode=mode) # just raise an exception
- reset()[source]¶
Resets the environment to an initial state and returns an initial observation.
Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.
- Return type:
Dict
[str
,Union
[int
,ndarray
]]
- Returns:
observation (object): the initial observation.
- seed(seed)[source]¶
Sets the seed for this env’s random number generator(s).
- Return type:
None
- Note:
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
- Returns:
- list<bigint>: Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
- step(action)[source]¶
Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
- Return type:
Tuple
[Union
[Tuple
,Dict
[str
,Any
],ndarray
,int
],float
,bool
,Dict
]
- Args:
action (object): an action provided by the agent
- Returns:
observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
SimpleMultiObsEnv¶
- class stable_baselines3.common.envs.SimpleMultiObsEnv(num_col=4, num_row=4, random_start=True, discrete_actions=True, channel_last=True)[source]¶
Base class for GridWorld-based MultiObs Environments 4x4 grid world.
____________ | 0 1 2 3| | 4|¯5¯¯6¯| 7| | 8|_9_10_|11| |12 13 14 15| ¯¯¯¯¯¯¯¯¯¯¯¯¯¯
start is 0 states 5, 6, 9, and 10 are blocked goal is 15 actions are = [left, down, right, up]
simple linear state env of 15 states but encoded with a vector and an image observation: each column is represented by a random vector and each row is represented by a random image, both sampled once at creation time.
- Parameters:
num_col (
int
) – Number of columns in the gridnum_row (
int
) – Number of rows in the gridrandom_start (
bool
) – If true, agent starts in random positionchannel_last (
bool
) – If true, the image will be channel last, else it will be channel first
- get_state_mapping()[source]¶
Uses the state to get the observation mapping.
- Return type:
Dict
[str
,ndarray
]- Returns:
observation dict {‘vec’: …, ‘img’: …}
- init_possible_transitions()[source]¶
Initializes the transitions of the environment The environment exploits the cardinal directions of the grid by noting that they correspond to simple addition and subtraction from the cell id within the grid :rtype:
None
up => means moving up a row => means subtracting the length of a column
down => means moving down a row => means adding the length of a column
left => means moving left by one => means subtracting 1
right => means moving right by one => means adding 1
Thus one only needs to specify in which states each action is possible in order to define the transitions of the environment
- init_state_mapping(num_col, num_row)[source]¶
Initializes the state_mapping array which holds the observation values for each state
- Parameters:
num_col (
int
) – Number of columns.num_row (
int
) – Number of rows.
- Return type:
None
- render(mode='human')[source]¶
Prints the log of the environment.
- Parameters:
mode (
str
) –- Return type:
None
- reset()[source]¶
Resets the environment state and step count and returns reset observation.
- Return type:
Dict
[str
,ndarray
]- Returns:
observation dict {‘vec’: …, ‘img’: …}
- step(action)[source]¶
Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state. Accepts an action and returns a tuple (observation, reward, done, info).
- Parameters:
action (
Union
[float
,ndarray
]) –- Return type:
Tuple
[Union
[Tuple
,Dict
[str
,Any
],ndarray
,int
],float
,bool
,Dict
]- Returns:
tuple (observation, reward, done, info).