Environments

This section documents the environment adapters that step a BellmanPeriod one transition at a time, for use by reinforcement-learning algorithms.

Two interfaces are provided:

  • Environment — a plain Python environment that returns the full transition (state, action, reward, next_state, discount, obs) as dicts keyed by symbol, suitable for off-policy RL and for rollouts.

  • GymEnv — a gymnasium adapter so Stable-Baselines3 algorithms (PPO, SAC, TD3, …) can drive a BellmanPeriod directly. It is the backend used by skagent.algos.sb3.PPOAgent (see Algorithms).

Environment adapters for BellmanPeriod models.

Two interfaces are provided:

  • Environment — plain Python environment that steps a BellmanPeriod by drawing shocks and applying a decision rule. Returns the full transition (state, action, reward, next_state, discount, obs) as dicts keyed by symbol, suitable for off-policy RL algorithms that consume full transitions.

  • GymEnvgymnasium adapter wrapping a BellmanPeriod so Stable Baselines3 algorithms (PPO, SAC, TD3, …) can drive a BellmanPeriod directly. The action space is normalised to [-1, 1]; the env unscales each action to the control’s per-state bounds (taken from Control.lower_bound / Control.upper_bound) before applying it.

class skagent.env.Environment(bp, initial, rng=None)

Step a BellmanPeriod one transition at a time.

Single agent. Returns torch tensors keyed by symbol so downstream code can index by name.

Parameters:
  • bp (BellmanPeriod) – Model definition (block dynamics, calibration, discount variable).

  • initial (dict) – Maps arrival-state symbols to skagent Distribution objects, used to sample fresh initial states each reset().

  • rng (Generator | None) – RNG used for shock and initial-state draws.

step(decision_rule)

Advance one period.

Parameters:

decision_rule (dict) – Maps control symbol to a callable on the control’s information set.

Returns:

(state_t, action, reward, state_t_plus_1, discount, obs). obs is the information set seen by the policy.

Return type:

tuple

class skagent.env.GymEnv(bp, initial, max_episode_steps=200, *, control_sym=None, default_lower=0.0, default_upper=1.0, bound_clearance=0.001, seed=None)

gymnasium adapter for a single-agent, single-control BellmanPeriod.

Designed to be driven by Stable Baselines3 (PPO, SAC, TD3, …).

Action space

Box(-1, 1, shape=(1,))normalised. Each action is unscaled to the control’s per-state bounds via

a_real = lo + (a_norm + 1) / 2 * (hi - lo)

where lo, hi are evaluated at the current pre-decision state. Missing bounds fall back to default_lower / default_upper.

Observation space

Box(-inf, inf, shape=(|iset|,)) over the control’s information set (Control.iset), in the iset’s declared order.

Episode timing

terminated is always False (no native end-of-life signal yet); truncated fires when max_episode_steps is reached. PPO bootstraps correctly across truncations as long as the wrapper sets the flag — which we do.

Parameters:
  • bp (BellmanPeriod) – Model definition.

  • initial (dict) – Maps arrival-state symbols to skagent Distribution objects used to sample initial states on reset().

  • max_episode_steps (int) – Episode horizon. Default 200.

  • control_sym (str | None) – Which control to drive. If omitted and the block has exactly one control, that control is used.

  • default_lower (float) – Fallback bounds when a control omits lower_bound or upper_bound. Defaults 0.0 / 1.0 only make sense for action spaces that happen to lie in [0, 1]; for other ranges pass explicit values.

  • default_upper (float) – Fallback bounds when a control omits lower_bound or upper_bound. Defaults 0.0 / 1.0 only make sense for action spaces that happen to lie in [0, 1]; for other ranges pass explicit values.

  • bound_clearance (float) – Fraction of the (hi - lo) span pulled in from each bound when unscaling, to avoid degenerate values at the edge (e.g. c=0 under log utility). Default 1e-3.

  • seed (int | None) – Seed for the env’s internal numpy RNG.

reset(*, seed=None, options=None)

Sample a fresh initial state and return (observation, info).

Draws arrival states from initial and the first period’s shocks, then returns the observation over the control’s information set. info is an empty dict. Reseeds the internal RNG when seed is given.

Parameters:

seed (int | None)

step(action)

Apply a normalised action and return the gymnasium 5-tuple.

action is a 1-element array in [-1, 1]; it is unscaled to the control’s per-state bounds before being applied. Returns (observation, reward, terminated, truncated, info), where terminated is always False and truncated fires at max_episode_steps. info carries the resolved discount, the action_unscaled value, and the bounds used for unscaling.

unscale_action(action_norm, obs)

Map normalised action(s) in [-1, 1] to the control’s real value.

For each row of obs, evaluates the control’s per-state bounds lo/hi (from Control.lower_bound / Control.upper_bound applied to the iset values) and unscales the corresponding action via

a_real = (lo + ε·span) + ½ (a_norm + 1) (span − 2 ε·span)

where span = hi lo and ε = bound_clearance. This is the same transform step() applies internally, so callers who run a trained SB3 model outside the gym loop (e.g. for diagnostics or to build a skagent decision rule) can use it directly instead of re-deriving the unscaling.

Parameters:
  • action_norm (array-like) – Normalised action(s). Scalar, shape (N,), or shape (N, 1). Values outside [-1, 1] are clipped.

  • obs (array-like) – Observation(s). Shape (|iset|,) for a single state, or (N, |iset|) for a batch. Must broadcast with action_norm along the leading axis.

Returns:

Unscaled action values. Always 1-D; scalar inputs return shape (1,).

Return type:

ndarray

skagent.env.discounted_rollout_reward(bp, decision_rule, initial, steps, rng=None)

Realized discounted reward of a single rollout under decision_rule.

Simulates one episode of steps periods through an Environment, accumulating per-period rewards weighted by the running product of the model’s (possibly per-period) discount factor.

Parameters:
  • bp (BellmanPeriod) – Model definition.

  • decision_rule (dict) – Maps control symbol to a callable on the control’s information set, as consumed by Environment.step().

  • initial (dict) – Maps arrival-state symbols to skagent Distribution objects used to sample the initial state.

  • steps (int) – Number of periods to simulate.

  • rng (Generator | None) – RNG used for the initial-state and shock draws.

Returns:

sum_t (prod_{s<t} discount_s) * reward_t.

Return type:

float