Environments Guide¶

This guide introduces environments — the adapters that let an external algorithm interact with a scikit-agent model one step at a time. They are the bridge between your model and the reinforcement-learning tools described in the Algorithms Guide guide.

Why environments?¶

A BellmanPeriod describes a model: its shocks, dynamics, controls, and rewards. Many algorithms, though, expect to drive a model interactively — propose an action, see what reward and next state result, and repeat. An environment wraps your model in exactly that interactive loop, so you don’t have to wire up the stepping logic yourself.

scikit-agent provides two, depending on who is doing the driving.

`Environment`: stepping a model in plain Python¶

Environment advances a model one period at a time and hands back the full transition — the state, the action taken, the reward, the next state, the period’s discount factor, and the observation the policy saw. You supply a decision rule (a {control: callable} dict) and call step:

import numpy as np
from skagent.env import Environment
from skagent.distributions import Uniform
from skagent.models.benchmarks import d2_block, d2_calibration
from skagent.bellman import BellmanPeriod

bp = BellmanPeriod(d2_block, "DiscFac", d2_calibration)
env = Environment(bp, {"a": Uniform(low=0.5, high=2.0)}, rng=np.random.default_rng(0))

env.reset()
state, action, reward, next_state, discount, obs = env.step({"c": lambda m: m / 2})

This is useful whenever you want direct control over the simulation loop — for custom analysis, or for algorithms that consume full transitions.

Scoring a policy with rollouts¶

A common thing to do with an Environment is to score a decision rule by the total discounted reward it earns over a rollout. The helper discounted_rollout_reward() does this for you:

from skagent.env import discounted_rollout_reward

total = discounted_rollout_reward(
    bp,
    {"c": lambda m: m / 2},  # the decision rule to score
    {"a": Uniform(low=0.5, high=2.0)},  # initial state distribution
    steps=200,
    rng=np.random.default_rng(0),
)

Running this for several policies (and averaging over many rollouts) is a simple way to compare how well different decision rules actually perform.

`GymEnv`: a gymnasium environment for RL libraries¶

GymEnv presents your model through the standard gymnasium interface, so reinforcement-learning libraries can train on it directly. It is what skagent.algos.sb3.PPOAgent uses under the hood; most users never need to construct it by hand.

A couple of details worth knowing:

Actions are normalised. The agent works with actions in [-1, 1], and GymEnv automatically rescales them to each control’s real bounds (for example, the borrowing constraint c ≤ m) before applying them. Your model’s bounds are respected without any extra effort.
Single control, single agent. For now, GymEnv handles models with one control variable and one agent.

If you are using PPO through PPOAgent, this all happens for you — see the Algorithms Guide guide to get started.

Environments Guide¶

Why environments?¶

Environment: stepping a model in plain Python¶

Scoring a policy with rollouts¶

GymEnv: a gymnasium environment for RL libraries¶

`Environment`: stepping a model in plain Python¶

`GymEnv`: a gymnasium environment for RL libraries¶