Algorithms¶
This section contains the API documentation for solution algorithms, neural network components, and grid tools used to solve dynamic stochastic optimization problems.
Value Backwards Induction (VBI)¶
The value backwards induction (VBI) algorithm derives arrival value functions from a continuation value function and the stage dynamics of model blocks.
Use backwards induction to derive the arrival value function from a continuation value function and stage dynamics.
- skagent.algos.vbi.ar_from_data(da)¶
Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.
- skagent.algos.vbi.get_action_rule(action)¶
Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.
- skagent.algos.vbi.grid_to_data_array(grid={})¶
Construct a zero-valued DataArray with the coordinates based on the Grid passed in.
- skagent.algos.vbi.solve(block, continuation, state_grid, disc_params={}, calibration={})¶
Solve a DBlock using backwards induction on the value function.
Core VBI Functions¶
- skagent.algos.vbi.solve(block, continuation, state_grid, disc_params={}, calibration={})
Solve a DBlock using backwards induction on the value function.
- skagent.algos.vbi.get_action_rule(action)
Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.
- skagent.algos.vbi.ar_from_data(da)
Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.
- skagent.algos.vbi.grid_to_data_array(grid={})
Construct a zero-valued DataArray with the coordinates based on the Grid passed in.
Maliar-Style Algorithms¶
Neural network-based solution methods following Maliar et al.
Tools for the implementation of the Maliar, Maliar, and Winant (JME ‘21) method.
This method relies on a simpler problem representation than that elaborated by the skagent Block system.
Note
generate_givens_from_states currently accesses bellman_period.block
directly rather than working through the BellmanPeriod interface. A future
refactoring could route shock generation through BellmanPeriod itself.
Similarly, shock draws are currently Monte Carlo only; structured draws
(e.g. exact discretizations) could be supported via BellmanPeriod.
- skagent.algos.maliar.generate_givens_from_states(states, model_block, shock_copies)¶
Generate omega_i values of the MMW JME ‘21 method.
- Parameters:
- Returns:
Grid containing states augmented with shock copies.
- Return type:
- skagent.algos.maliar.maliar_training_loop(bellman_period, loss_function, states_0_n, parameters, shock_copies=2, max_iterations=5, tolerance=1e-06, random_seed=None, simulation_steps=1, network_width=16, epochs_per_iteration=250, lr=0.001)¶
Run the Maliar, Maliar, and Winant (JME ‘21) training loop.
Trains a single neural network policy to minimize empirical risk (loss) on a panel of states drawn forward through the model dynamics. This helper constructs and trains a
BlockPolicyNetinternally and does not currently accept a pre-built shared-backboneBlockPolicyValueNet. If value-aware training is needed (e.g. for a Bellman residual loss with a value head), calltrain_block_nn()directly on aBlockPolicyValueNet; a future refactor may add value-network support here.The loop maps onto the MMW JME’21 algorithm steps as follows:
_validate_training_inputs()and the network construction below cover Step 1 (initialize topology and coefficients); the per-iterationtrain_block_nn()call is Step 2 (minimize the empirical risk \(\Xi^n(\theta)\)); the returned network is the Step 3 trained approximation \(\varphi(\cdot, \theta)\).- Parameters:
bellman_period (
BellmanPeriod) – A model definition containing block dynamics and transitions.loss_function (
Callable) – The empirical risk function \(\Xi^n\) from MMW JME’21. This function is passed to the neural network training routine asloss_function(decision_function, input_grid) -> loss_tensor.states_0_n (
Grid) – A panel of starting states for training. Must contain at least one state.parameters (
dict) – Given parameters for the model.shock_copies (
int) – Number of shock copies to include in the training set \(\{\omega_i\}\). Must match the expected number of shock copies in the loss function. Must be >= 1. Default is 2.max_iterations (
int) – Maximum number of training loop iterations before stopping. Must be >= 1. Default is 5.tolerance (
float) – Convergence tolerance. Training stops when either the L2 norm of parameter changes or the absolute difference in loss is below this threshold. Satisfying either criterion alone is sufficient. Must be > 0. Default is 1e-6.random_seed (
Optional[int]) – Random seed for reproducibility. Default is None.simulation_steps (
int) – Number of time steps to simulate forward when determining the next training set \(\{\omega_i\}\). Higher values let the training states explore more of the state space at higher computational cost. Must be >= 1. Default is 1.network_width (
int) – Width of hidden layers in the policy neural network. Must be >= 1. Default is 16.epochs_per_iteration (
int) – Number of training epochs per iteration. Must be >= 1. Default is 250.lr (
float) – Learning rate for the internal Adam optimizer. The optimizer is created once and reused across iterations to preserve momentum. Must be > 0. Default is 0.001.
- Returns:
(trained_policy_network, training_states)wheretrained_policy_networkis the trainedBlockPolicyNetandtraining_statesis theGridof states from the final iteration (the convergence point if training converged early, otherwise the states aftermax_iterationssteps).- Return type:
- Raises:
ValueError – If max_iterations < 1, tolerance <= 0, shock_copies < 1, simulation_steps < 1, network_width < 1, epochs_per_iteration < 1, or states_0_n contains no states.
TypeError – If bellman_period is None or loss_function is not callable.
- skagent.algos.maliar.simulate_forward(states_t, bellman_period, decision_function, parameters, big_t)¶
Simulate the model forward for a specified number of periods.
- Parameters:
bellman_period (
BellmanPeriod) – The Bellman period containing model dynamics.decision_function (
Callable) – Function mapping (states, shocks, parameters) to controls.parameters (
dict) – Model parameters.big_t (
int) – Number of time periods to simulate forward. If 0, returns the initial states unchanged.
- Returns:
Final state values after big_t periods.
- Return type:
- Raises:
ValueError – If big_t < 0 or if states_t is an empty dict.
Reinforcement Learning (Stable-Baselines3)¶
Proximal Policy Optimization (PPO) for BellmanPeriod models, via a
Stable-Baselines3 backend. The
agent wraps a model in a gymnasium environment (see Environments), trains
PPO, and emits a standard skagent decision rule.
Stable Baselines3 wrappers for BellmanPeriod models.
Provides PPOAgent, a thin wrapper around SB3’s PPO that:
builds a
skagent.env.GymEnvfrom aBellmanPeriod+ initial state distribution,delegates training to
stable_baselines3.PPO.learn,exposes a
PPOAgent.decision_rule()that returns the trained policy as a skagent-style{control_sym: callable}dict — i.e. the same shape consumed byskagent.env.Environmentand the rest of the skagent decision-rule API. Actions are unscaled back to real units viaGymEnv.unscale_action(), so downstream code does not see the[-1, 1]SB3 representation.
- class skagent.algos.sb3.PPOAgent(bp, initial, *, max_episode_steps=200, seed=None, gym_kwargs=None, ppo_kwargs=None, policy='MlpPolicy', device='cpu', verbose=0)¶
Train SB3’s PPO on a
BellmanPeriodand emit a skagent decision rule.- Parameters:
bp (
BellmanPeriod) – Model definition.initial (
dict) – Maps arrival-state symbols toskagentDistributionobjects, used byGymEnvto sample fresh initial states onreset.max_episode_steps (
int) – Episode horizon for the underlyingGymEnv. Default 200.seed (
Optional[int]) – Seed for both the environment and the PPO algorithm.gym_kwargs (
Optional[dict]) – Extra keyword arguments forwarded toGymEnv(e.g.default_lower,default_upper,bound_clearance,control_sym).ppo_kwargs (
Optional[dict]) – Extra keyword arguments forwarded tostable_baselines3.PPO(e.g.n_steps,batch_size,learning_rate,n_epochs,policy_kwargs).gammadefaults tobp.calibration[bp.discount_variable]if it is a finite scalar; callers can override by passinggammahere.policy (
str) – SB3 policy class string. Default"MlpPolicy".device (
str) – Torch device for PPO. Default"cpu"(SB3’s recommended default forMlpPolicy).verbose (
int) – Verbosity passed to PPO. Default 0.
- model¶
The SB3 model.
Noneuntillearn()is called the first time (constructed lazily so callers can inspectenvwithout paying PPO’s setup cost).- Type:
stable_baselines3.PPO
- decision_rule(deterministic=True)¶
Return a skagent decision rule that uses the trained policy.
The returned dict has the form
{control_sym: callable}where the callable accepts positional arguments matching the control’s iset order (i.e. the same signature skagentEnvironment.stepandBellmanPeriod.decision_functioncall). The callable’s output is atorch.Tensorof unscaled action values; the inputs may be scalars, numpy arrays, or torch tensors of compatible length.
- learn(total_timesteps, callback=None, **kwargs)¶
Run
PPO.learn. Returnsself.Every completed episode’s undiscounted reward is appended to
self.episode_rewardsvia an internal SB3 callback; repeatedlearncalls accumulate. A user-suppliedcallbackis merged with the internal one viaCallbackList. Extra**kwargsforward tomodel.learn.
- predict_unscaled(obs, deterministic=True)¶
Predict an unscaled action for
obs.obsmay be a single observation (shape(|iset|,)) or a batch ((N, |iset|)). Returns a 1-D array of shape(N,).
- snapshot()¶
Capture the current trained policy as a frozen
PolicySnapshot.The snapshot holds an independent copy of the policy network, so it is unaffected by later
learn()calls. This is the supported way to retain the policy at intermediate points during training (e.g. to compare checkpoints) without re-running training or re-implementing the unscaling logic.- Return type:
- class skagent.algos.sb3.PolicySnapshot(policy, env)¶
Frozen copy of a trained policy, decoupled from further training.
Returned by
PPOAgent.snapshot(). Holds a deep copy of the policy network taken at snapshot time, so subsequentlearncalls on the source agent do not change its predictions. Exposes the samepredict_unscaled()anddecision_rule()interface asPPOAgent.The
GymEnvis shared with the source agent (not copied): it is used only for stateless action unscaling, which does not depend on training state.- Parameters:
env (
GymEnv)
- decision_rule(deterministic=True)¶
Return a skagent decision rule; see
PPOAgent.decision_rule().
- predict_unscaled(obs, deterministic=True)¶
Predict an unscaled action for
obs; seePPOAgent.predict_unscaled().
Solvers¶
High-level routines that drive the neural-network training utilities to solve a model. See the Algorithms Guide guide for worked examples.
- skagent.solver.solve_multiple_controls(control_order, bellman_period, givens, calibration, epochs=200, loss=None)¶
Solve a block with more than one control by training a policy network for each control in turn.
Each control is given its own
skagent.ann.BlockPolicyNet. The networks are trained one at a time, in the order given bycontrol_order, with every network treating the other networks’ current policies as fixed. A control may appear incontrol_ordermore than once to refine it after its neighbours have been updated (e.g.["c", "d", "c"]), which is the multi-control analogue of a best-response sweep.Currently restricted to single-period (non-recurring) reward objectives; by default the negative immediate reward (
skagent.loss.StaticRewardLoss) is maximized.TODO: allow a variable ‘loss function generator’ once the API has solidified.
- Parameters:
control_order (list of str) – Control symbols, in the order they should be solved. Symbols may repeat to schedule additional refinement passes.
bellman_period (BellmanPeriod) – The model period whose controls are being solved.
givens (skagent.grid.Grid) – Grid of arrival states and shock realizations to train over.
calibration (dict) – Calibration parameters passed to the loss function.
epochs (int, optional) – Training epochs per pass. Default is 200.
loss (type, optional) – A loss-function class with signature
loss(bellman_period, parameters, other_dr). Defaults toskagent.loss.StaticRewardLoss.
- Returns:
Mapping from each control symbol to its trained decision rule.
- Return type:
Loss Functions¶
Objective functions passed to skagent.ann.train_block_nn(). The
reward-based losses (StaticRewardLoss,
EstimatedDiscountedLifetimeRewardLoss) solve a block
directly for the non-recurring case; the equation-residual losses
(BellmanEquationLoss,
EulerEquationLoss) target the recurring, dynamic case.
See Loss Functions for the full reference.
Neural Network Components¶
Net¶
Base neural network class with device management.
- class skagent.ann.Net(n_inputs, n_outputs, width=32, n_layers=2, activation='silu', transform=None, init_seed=None, copy_weights_from=None)¶
Bases:
ModuleA flexible feedforward neural network with configurable architecture.
- Parameters:
n_inputs (int) – Number of input features
n_outputs (int) – Number of output features
width (int, optional) – Width of hidden layers. Default is 32.
n_layers (int, optional) – Number of hidden layers (1-10). Default is 2.
activation (str, list, callable, or None, optional) –
Activation function(s) to use. Options: - str: Apply same activation to all layers (‘silu’, ‘relu’, ‘tanh’, ‘sigmoid’) - list: Apply different activations to each layer, e.g., [‘relu’, ‘tanh’, ‘silu’] - callable: Custom activation function - None: No activation (identity function)
Available activations: ‘silu’, ‘relu’, ‘tanh’, ‘sigmoid’, ‘identity’ Default is ‘silu’.
transform (str, list, callable, or None, optional) –
Transformation to apply to outputs. Options: - str: Apply same transform to all outputs (‘sigmoid’, ‘exp’, ‘tanh’, etc.) - list: Apply different transforms to each output, e.g., [‘sigmoid’, ‘exp’] - callable: Custom transformation function - None: No transformation
Available transforms: ‘sigmoid’, ‘exp’, ‘tanh’, ‘relu’, ‘softplus’, ‘softmax’, ‘abs’, ‘square’, ‘identity’ Default is None.
- property device¶
Device property for backward compatibility.
- forward(x)¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
BlockPolicyNet¶
A neural network for policy functions in dynamic programming problems.
- class skagent.ann.BlockPolicyNet(bellman_period, control_sym=None, apply_open_bounds=True, width=32, **kwargs)¶
Bases:
BellmanPeriodMixin,NetA neural network for policy functions in dynamic programming problems.
This network wraps a
Netand integrates with theBellmanPeriodinterface. It automatically determines input/output dimensions from the model block specification and enforces control variable bounds.- Parameters:
bellman_period (BellmanPeriod) – The model Bellman Period
apply_open_bounds (bool, optional) – If True, then the network forward output is normalized by the upper and/or lower bounds, computed as a function of the input tensor. These bounds are “open” because output can be arbitrarily close to, but not equal to, the bounds. Default is True.
control_sym (string, optional) – The symbol for the control variable.
width (int, optional) – Width of hidden layers. Default is 32.
**kwargs – Additional keyword arguments passed to Net. See Net class documentation for all available options including activation, transform, n_layers, init_seed, copy_weights_from, etc.
- decision_function(states_t, shocks_t, parameters)¶
A decision function, from states, shocks, and parameters, to control variable values.
- forward(x)¶
Note that this uses the same architecture of the superclass but adds on a normalization layer appropriate to the bounds of the decision rule.
- get_core_function(length=None)¶
- get_decision_rule(length=None)¶
Returns the decision rule corresponding to this neural network.
BlockValueNet¶
A neural network for value functions in dynamic programming problems.
- class skagent.ann.BlockValueNet(bellman_period, control_sym=None, width=32, **kwargs)¶
Bases:
BellmanPeriodMixin,NetStandalone value-function network for a Bellman problem.
Maps a control’s information set (the same pre-decision states a policy network sees) to a single unconstrained scalar value. It is the value-only counterpart of
BlockPolicyNet, kept for algorithms that approximate a value function separately from the policy. The Maliar/MMW path in this package uses the shared-backboneBlockPolicyValueNetinstead, soBlockValueNetis not wired intomaliar_training_loop().- Parameters:
bellman_period (BellmanPeriod) – The model Bellman period.
control_sym (str, optional) – Control whose information set defines the value function’s domain. Defaults to the first control.
width (
int) – Width of hidden layers. Default 32.**kwargs – Passed to
Net(activation, n_layers, init_seed, etc.).
- get_core_function(length=None)¶
Return the value function (the trainable core for this net).
- get_value_function()¶
Return a callable
(states, shocks, parameters) -> valuetensor.
- value_function(states_t, shocks_t=None, parameters=None)¶
Evaluate the value function at the control’s information set.
Arrival
states_t(withshocks_tandparameters) are mapped to the control’s information set viacompute_pre_state(), mirroringBlockPolicyNet.decision_function(), then the network is evaluated.- Returns:
Flattened value estimates, one per input row.
- Return type:
BlockPolicyValueNet¶
A shared-backbone neural network that jointly represents the policy and value functions.
- class skagent.ann.BlockPolicyValueNet(bellman_period, control_sym=None, apply_open_bounds=True, width=32, **kwargs)¶
Bases:
BellmanPeriodMixin,NetSingle neural network with shared backbone for both policy and value.
Architecture: shared hidden layers → two output heads: - Policy head — bounded output (sigmoid-scaled to satisfy constraints) - Value head — unconstrained scalar output
Sharing the backbone means one optimizer updates all weights simultaneously, and the value head anchors the control level that first-order-condition-only training (e.g. an Euler residual loss) cannot identify.
- Parameters:
bellman_period (BellmanPeriod) – The model Bellman Period.
control_sym (str, optional) – Control variable symbol. Defaults to first control.
apply_open_bounds (bool, optional) – Apply sigmoid/softplus scaling to the policy head. Default True.
width (int, optional) – Width of hidden layers. Default 32.
**kwargs – Passed to
Net(activation, n_layers, init_seed, etc.).
- decision_function(states_t, shocks_t, parameters)¶
Map states, shocks, and parameters to a controls dict.
- Parameters:
- Returns:
{control_sym: tensor}of policy-head outputs. The arrival states are mapped to the control’s information set viacompute_pre_state()before the network is evaluated.- Return type:
- forward(x)¶
Run shared backbone, then policy head (bounded) + value head.
Returns the
(policy, value)pair. The policy tensor is scaled into the control’s open bounds; the value tensor is unconstrained. Both have shape(n, 1).
- get_core_function(length=None)¶
Return decision rules (policy head) for use with train_block_nn.
- get_decision_rule(length=None)¶
Decision rule returning only the policy output.
- get_policy_and_value_functions(length=None)¶
Return both policy decision rules and value function.
- get_value_function()¶
- value_function(states_t, shocks_t=None, parameters=None)¶
Evaluate the value head at the control’s information set.
The input domain mirrors
decision_function(): arrivalstates_t(withshocks_tandparameters) are mapped to the control’s information set viacompute_pre_state(), then the shared backbone’s value head is evaluated on that pre-decision representation.- Parameters:
- Returns:
Flattened value estimates, one per input row.
- Return type:
Training Functions¶
- skagent.ann.train_block_nn(block_policy_nn, inputs, loss_function, epochs=50, lr=0.01, optimizer=None, grad_clip=1.0, verbose=True)¶
Train a policy network by minimizing a loss function over a grid.
This is a generic stochastic-gradient-descent driver, not a solution algorithm in itself. It runs
epochsAdam updates that minimize whateverloss_functionis supplied, evaluated on a single, fixed grid ofinputs; it is agnostic to where that grid came from or which method the loss encodes (Euler residual, Bellman residual, FOC, or a custom loss).Because it trains on whatever
inputsit is given, accuracy depends on the caller re-sampling those states across calls: Maliar, Maliar, and Winant (2021) keep the training data “constantly re-sampled,” and minimizing on a single fixed grid instead lets the solution over-fit those points while drifting elsewhere. Re-drawinputseach call (threading the returned optimizer back in to keep Adam’s momentum), or usemaliar_training_loop(), which wraps this driver in the full MMW’21 outer loop: it alternates these inner SGD updates with a forward-simulation step that refreshes the training states toward the model’s ergodic set.- Parameters:
block_policy_nn (BlockPolicyNet or BlockPolicyValueNet) – The network to train. Its
get_core_functionsupplies the decision rule(s) the loss is evaluated against.inputs (
Grid) – Input grid containing states and shocks.loss_function (
Callable) – Loss function(decision_function, input_grid) -> loss_tensor.epochs (
int) – Number of training epochs (default 50).lr (
float) – Learning rate for Adam optimizer (default 0.01).optimizer (
Optional[Optimizer]) – Pre-existing optimizer to reuse (preserves momentum across calls). If None, a new Adam optimizer is created.grad_clip (
Optional[float]) – Maximum gradient norm for clipping (default 1.0). Set to None to disable.verbose (
bool) – Emit alogging.infomessage with the loss every 100 epochs (default True). Configure the root logger to suppress these.
- Returns:
(trained_network, final_loss, optimizer). Theoptimizeris the one passed in, or the Adam instance created internally when none was supplied; returning it always lets callers warm-start a later call by threading it back in.- Return type:
Grid and Computational Tools¶
Grid Class¶
- class skagent.grid.Grid(labels, values, torched=True)¶
Bases:
objectA class representing a labeled grid of numerical values.
- Parameters:
(dict) (config) – dictionary with the following keys: “min” (float): The minimum value for the variable.; “max” (float): The maximum value for the variable; “count” (int): The number of points to generate for the variable.
- classmethod from_config(config={}, torched=True)¶
- classmethod from_dict(kv={}, torched=False)¶
- len()¶
Returns the number of columns, similar to a dict.
- n()¶
Returns the number of values for each symbol
- shape()¶
Returns the shape of the grid values.
- to_dict()¶
Returns a data structure, key: column, similar to tensordict or structured array.
- torch()¶
- update_from_dict(kv)¶
Grid Utility Functions¶
- skagent.grid.make_grid(config)¶
Make a ‘grid’ of values based on the provided configuration.
- Parameters:
(dict) (config) – dictionary with the following keys: “min” (float): The minimum value for the variable; “max” (float): The maximum value for the variable; “count” (int): The number of points to generate for the variable.
- Returns:
numpy.ndarray (A NumPy array of shape (product_of_counts, num_variables), where) – product_of_counts is the product of all count values in the config dictionary, and num_variables is the number of keys in the config.
- skagent.grid.cartesian_product(*arrays)¶
Create a Cartesian product of input arrays.
- Parameters:
*arrays – Variable length arrays to compute product
- Returns:
Array of shape (product_of_lengths, num_arrays)
where product_of_lengths is the product of the lengths of the input arrays,
and num_arrays is the number of input arrays. Each row contains one element
of the Cartesian product.