- stable_baselines3.common.evaluation.evaluate_policy(model, env, n_eval_episodes=10, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=False, warn=True)
Runs policy for
n_eval_episodesepisodes and returns average reward. If a vector env is passed in, this divides the episodes to evaluate onto the different elements of the vector env. This static division of work is done to remove bias. See https://github.com/DLR-RM/stable-baselines3/issues/402 for more details and discussion.
If environment has not been wrapped with
Monitorwrapper, reward and episode lengths are counted as it appears with
env.stepcalls. If the environment contains wrappers that modify rewards or episode lengths (e.g. reward scaling, early episode reset), these will affect the evaluation results as well. You can avoid this by wrapping environment with
Monitorwrapper before anything else.
PolicyPredictor) – The RL agent you want to evaluate. This can be any object that implements a predict method, such as an RL algorithm (
BaseAlgorithm) or policy (
VecEnv]) – The gym environment or
int) – Number of episode to evaluate the agent
bool) – Whether to use deterministic or stochastic actions
bool) – Whether to render the environment or not
None]]) – callback function to do additional checks, called after each step. Gets locals() and globals() passed as parameters.
float]) – Minimum expected reward per episode, this will raise an error if the performance is not met
bool) – If True, a list of rewards and episode lengths per episode will be returned instead of the mean.
bool) – If True (default), warns user about lack of a Monitor wrapper in the evaluation environment.
- Return type:
Mean reward per episode, std of reward per episode. Returns ([float], [int]) when
return_episode_rewardsis True, first list containing per-episode rewards and second containing per-episode lengths (in number of steps).