Evaluation Helper

stable_baselines3.common.evaluation.evaluate_policy(model, env, n_eval_episodes=10, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=False, warn=True)[source]

Runs the policy for n_eval_episodes episodes and outputs the average return per episode (sum of undiscounted rewards). If a vector env is passed in, this divides the episodes to evaluate onto the different elements of the vector env. This static division of work is done to remove bias. See https://github.com/DLR-RM/stable-baselines3/issues/402 for more details and discussion.

Note

If environment has not been wrapped with Monitor wrapper, reward and episode lengths are counted as it appears with env.step calls. If the environment contains wrappers that modify rewards or episode lengths (e.g. reward scaling, early episode reset), these will affect the evaluation results as well. You can avoid this by wrapping environment with Monitor wrapper before anything else.

Parameters:
  • model (PolicyPredictor) – The RL agent you want to evaluate. This can be any object that implements a predict method, such as an RL algorithm (BaseAlgorithm) or policy (BasePolicy).

  • env (Env | VecEnv) – The gym environment or VecEnv environment.

  • n_eval_episodes (int) – Number of episode to evaluate the agent

  • deterministic (bool) – Whether to use deterministic or stochastic actions

  • render (bool) – Whether to render the environment or not

  • callback (Callable[[dict[str, Any], dict[str, Any]], None] | None) – callback function to perform additional checks, called n_envs times after each step. Gets locals() and globals() passed as parameters. See https://github.com/DLR-RM/stable-baselines3/issues/1912 for more details.

  • reward_threshold (float | None) – Minimum expected reward per episode, this will raise an error if the performance is not met

  • return_episode_rewards (bool) – If True, a list of rewards and episode lengths per episode will be returned instead of the mean.

  • warn (bool) – If True (default), warns user about lack of a Monitor wrapper in the evaluation environment.

Returns:

Mean return per episode (sum of rewards), std of reward per episode. Returns (list[float], list[int]) when return_episode_rewards is True, first list containing per-episode return and second containing per-episode lengths (in number of steps).

Return type:

tuple[float, float] | tuple[list[float], list[int]]