Environments¶
This section documents the environment adapters that step a
BellmanPeriod one transition at a time, for use by
reinforcement-learning algorithms.
Two interfaces are provided:
Environment— a plain Python environment that returns the full transition(state, action, reward, next_state, discount, obs)as dicts keyed by symbol, suitable for off-policy RL and for rollouts.GymEnv— a gymnasium adapter so Stable-Baselines3 algorithms (PPO, SAC, TD3, …) can drive aBellmanPerioddirectly. It is the backend used byskagent.algos.sb3.PPOAgent(see Algorithms).
Environment adapters for BellmanPeriod models.
Two interfaces are provided:
Environment— plain Python environment that steps a BellmanPeriod by drawing shocks and applying a decision rule. Returns the full transition(state, action, reward, next_state, discount, obs)as dicts keyed by symbol, suitable for off-policy RL algorithms that consume full transitions.GymEnv—gymnasiumadapter wrapping a BellmanPeriod so Stable Baselines3 algorithms (PPO, SAC, TD3, …) can drive a BellmanPeriod directly. The action space is normalised to[-1, 1]; the env unscales each action to the control’s per-state bounds (taken fromControl.lower_bound/Control.upper_bound) before applying it.
- class skagent.env.Environment(bp, initial, rng=None)¶
Step a
BellmanPeriodone transition at a time.Single agent. Returns torch tensors keyed by symbol so downstream code can index by name.
- Parameters:
bp (
BellmanPeriod) – Model definition (block dynamics, calibration, discount variable).initial (
dict) – Maps arrival-state symbols toskagentDistributionobjects, used to sample fresh initial states eachreset().rng (
Generator|None) – RNG used for shock and initial-state draws.
- class skagent.env.GymEnv(bp, initial, max_episode_steps=200, *, control_sym=None, default_lower=0.0, default_upper=1.0, bound_clearance=0.001, seed=None)¶
gymnasiumadapter for a single-agent, single-control BellmanPeriod.Designed to be driven by Stable Baselines3 (PPO, SAC, TD3, …).
- Action space
Box(-1, 1, shape=(1,))— normalised. Each action is unscaled to the control’s per-state bounds viaa_real = lo + (a_norm + 1) / 2 * (hi - lo)
where
lo,hiare evaluated at the current pre-decision state. Missing bounds fall back todefault_lower/default_upper.- Observation space
Box(-inf, inf, shape=(|iset|,))over the control’s information set (Control.iset), in the iset’s declared order.- Episode timing
terminatedis alwaysFalse(no native end-of-life signal yet);truncatedfires whenmax_episode_stepsis reached. PPO bootstraps correctly across truncations as long as the wrapper sets the flag — which we do.
- Parameters:
bp (
BellmanPeriod) – Model definition.initial (
dict) – Maps arrival-state symbols to skagentDistributionobjects used to sample initial states onreset().max_episode_steps (
int) – Episode horizon. Default 200.control_sym (
str|None) – Which control to drive. If omitted and the block has exactly one control, that control is used.default_lower (
float) – Fallback bounds when a control omitslower_boundorupper_bound. Defaults0.0/1.0only make sense for action spaces that happen to lie in[0, 1]; for other ranges pass explicit values.default_upper (
float) – Fallback bounds when a control omitslower_boundorupper_bound. Defaults0.0/1.0only make sense for action spaces that happen to lie in[0, 1]; for other ranges pass explicit values.bound_clearance (
float) – Fraction of the(hi - lo)span pulled in from each bound when unscaling, to avoid degenerate values at the edge (e.g.c=0under log utility). Default1e-3.
- reset(*, seed=None, options=None)¶
Sample a fresh initial state and return
(observation, info).Draws arrival states from
initialand the first period’s shocks, then returns the observation over the control’s information set.infois an empty dict. Reseeds the internal RNG whenseedis given.
- step(action)¶
Apply a normalised
actionand return the gymnasium 5-tuple.actionis a 1-element array in[-1, 1]; it is unscaled to the control’s per-state bounds before being applied. Returns(observation, reward, terminated, truncated, info), whereterminatedis alwaysFalseandtruncatedfires atmax_episode_steps.infocarries the resolveddiscount, theaction_unscaledvalue, and theboundsused for unscaling.
- unscale_action(action_norm, obs)¶
Map normalised action(s) in
[-1, 1]to the control’s real value.For each row of
obs, evaluates the control’s per-state boundslo/hi(fromControl.lower_bound/Control.upper_boundapplied to the iset values) and unscales the corresponding action viaa_real = (lo + ε·span) + ½ (a_norm + 1) (span − 2 ε·span)
where
span = hi − loandε = bound_clearance. This is the same transformstep()applies internally, so callers who run a trained SB3 model outside the gym loop (e.g. for diagnostics or to build a skagent decision rule) can use it directly instead of re-deriving the unscaling.- Parameters:
action_norm (array-like) – Normalised action(s). Scalar, shape
(N,), or shape(N, 1). Values outside[-1, 1]are clipped.obs (array-like) – Observation(s). Shape
(|iset|,)for a single state, or(N, |iset|)for a batch. Must broadcast withaction_normalong the leading axis.
- Returns:
Unscaled action values. Always 1-D; scalar inputs return shape
(1,).- Return type:
- skagent.env.discounted_rollout_reward(bp, decision_rule, initial, steps, rng=None)¶
Realized discounted reward of a single rollout under
decision_rule.Simulates one episode of
stepsperiods through anEnvironment, accumulating per-period rewards weighted by the running product of the model’s (possibly per-period) discount factor.- Parameters:
bp (
BellmanPeriod) – Model definition.decision_rule (
dict) – Maps control symbol to a callable on the control’s information set, as consumed byEnvironment.step().initial (
dict) – Maps arrival-state symbols to skagentDistributionobjects used to sample the initial state.steps (
int) – Number of periods to simulate.rng (
Generator|None) – RNG used for the initial-state and shock draws.
- Returns:
sum_t (prod_{s<t} discount_s) * reward_t.- Return type: