Algorithms

This section contains the API documentation for solution algorithms, neural network components, and grid tools used to solve dynamic stochastic optimization problems.

Value Backwards Induction (VBI)

The value backwards induction (VBI) algorithm derives arrival value functions from a continuation value function and the stage dynamics of model blocks.

Use backwards induction to derive the arrival value function from a continuation value function and stage dynamics.

skagent.algos.vbi.ar_from_data(da)

Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.

skagent.algos.vbi.get_action_rule(action)

Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.

skagent.algos.vbi.grid_to_data_array(grid={})

Construct a zero-valued DataArray with the coordinates based on the Grid passed in.

Parameters:

grid (Mapping[str, Sequence]) – A mapping from variable labels to a sequence of numerical values.

Returns:

An xarray.DataArray with coordinates given by both grids.

Return type:

da xarray.DataArray

skagent.algos.vbi.solve(block, continuation, state_grid, disc_params={}, calibration={})

Solve a DBlock using backwards induction on the value function.

Parameters:
  • block (DBlock)

  • continuation

  • state_grid (Mapping[str, Sequence]) – This is a grid over all variables that the optimization will range over. This should be just the information set of the decision variables.

  • disc_params

  • calibration

Core VBI Functions

skagent.algos.vbi.solve(block, continuation, state_grid, disc_params={}, calibration={})

Solve a DBlock using backwards induction on the value function.

Parameters:
  • block (DBlock)

  • continuation

  • state_grid (Mapping[str, Sequence]) – This is a grid over all variables that the optimization will range over. This should be just the information set of the decision variables.

  • disc_params

  • calibration

skagent.algos.vbi.get_action_rule(action)

Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.

skagent.algos.vbi.ar_from_data(da)

Produce a function from any inputs to a given value. This is useful for constructing decision rules with fixed actions.

skagent.algos.vbi.grid_to_data_array(grid={})

Construct a zero-valued DataArray with the coordinates based on the Grid passed in.

Parameters:

grid (Mapping[str, Sequence]) – A mapping from variable labels to a sequence of numerical values.

Returns:

An xarray.DataArray with coordinates given by both grids.

Return type:

da xarray.DataArray

Maliar-Style Algorithms

Neural network-based solution methods following Maliar et al.

Tools for the implementation of the Maliar, Maliar, and Winant (JME ‘21) method.

This method relies on a simpler problem representation than that elaborated by the skagent Block system.

Note

generate_givens_from_states currently accesses bellman_period.block directly rather than working through the BellmanPeriod interface. A future refactoring could route shock generation through BellmanPeriod itself. Similarly, shock draws are currently Monte Carlo only; structured draws (e.g. exact discretizations) could be supported via BellmanPeriod.

skagent.algos.maliar.generate_givens_from_states(states, model_block, shock_copies)

Generate omega_i values of the MMW JME ‘21 method.

Parameters:
  • states (Grid) – A grid of starting state values (exogenous and endogenous).

  • model_block (Block) – Block information (used to get the shock names).

  • shock_copies (int) – Number of copies of the shocks to be included. Must be >= 1.

Returns:

Grid containing states augmented with shock copies.

Return type:

Grid

skagent.algos.maliar.maliar_training_loop(bellman_period, loss_function, states_0_n, parameters, shock_copies=2, max_iterations=5, tolerance=1e-06, random_seed=None, simulation_steps=1, network_width=16, epochs_per_iteration=250, lr=0.001)

Run the Maliar, Maliar, and Winant (JME ‘21) training loop.

Trains a single neural network policy to minimize empirical risk (loss) on a panel of states drawn forward through the model dynamics. This helper constructs and trains a BlockPolicyNet internally and does not currently accept a pre-built shared-backbone BlockPolicyValueNet. If value-aware training is needed (e.g. for a Bellman residual loss with a value head), call train_block_nn() directly on a BlockPolicyValueNet; a future refactor may add value-network support here.

The loop maps onto the MMW JME’21 algorithm steps as follows: _validate_training_inputs() and the network construction below cover Step 1 (initialize topology and coefficients); the per-iteration train_block_nn() call is Step 2 (minimize the empirical risk \(\Xi^n(\theta)\)); the returned network is the Step 3 trained approximation \(\varphi(\cdot, \theta)\).

Parameters:
  • bellman_period (BellmanPeriod) – A model definition containing block dynamics and transitions.

  • loss_function (Callable) – The empirical risk function \(\Xi^n\) from MMW JME’21. This function is passed to the neural network training routine as loss_function(decision_function, input_grid) -> loss_tensor.

  • states_0_n (Grid) – A panel of starting states for training. Must contain at least one state.

  • parameters (dict) – Given parameters for the model.

  • shock_copies (int) – Number of shock copies to include in the training set \(\{\omega_i\}\). Must match the expected number of shock copies in the loss function. Must be >= 1. Default is 2.

  • max_iterations (int) – Maximum number of training loop iterations before stopping. Must be >= 1. Default is 5.

  • tolerance (float) – Convergence tolerance. Training stops when either the L2 norm of parameter changes or the absolute difference in loss is below this threshold. Satisfying either criterion alone is sufficient. Must be > 0. Default is 1e-6.

  • random_seed (Optional[int]) – Random seed for reproducibility. Default is None.

  • simulation_steps (int) – Number of time steps to simulate forward when determining the next training set \(\{\omega_i\}\). Higher values let the training states explore more of the state space at higher computational cost. Must be >= 1. Default is 1.

  • network_width (int) – Width of hidden layers in the policy neural network. Must be >= 1. Default is 16.

  • epochs_per_iteration (int) – Number of training epochs per iteration. Must be >= 1. Default is 250.

  • lr (float) – Learning rate for the internal Adam optimizer. The optimizer is created once and reused across iterations to preserve momentum. Must be > 0. Default is 0.001.

Returns:

(trained_policy_network, training_states) where trained_policy_network is the trained BlockPolicyNet and training_states is the Grid of states from the final iteration (the convergence point if training converged early, otherwise the states after max_iterations steps).

Return type:

tuple

Raises:
  • ValueError – If max_iterations < 1, tolerance <= 0, shock_copies < 1, simulation_steps < 1, network_width < 1, epochs_per_iteration < 1, or states_0_n contains no states.

  • TypeError – If bellman_period is None or loss_function is not callable.

skagent.algos.maliar.simulate_forward(states_t, bellman_period, decision_function, parameters, big_t)

Simulate the model forward for a specified number of periods.

Parameters:
  • states_t (Grid | dict) – Initial state values.

  • bellman_period (BellmanPeriod) – The Bellman period containing model dynamics.

  • decision_function (Callable) – Function mapping (states, shocks, parameters) to controls.

  • parameters (dict) – Model parameters.

  • big_t (int) – Number of time periods to simulate forward. If 0, returns the initial states unchanged.

Returns:

Final state values after big_t periods.

Return type:

dict

Raises:

ValueError – If big_t < 0 or if states_t is an empty dict.

Reinforcement Learning (Stable-Baselines3)

Proximal Policy Optimization (PPO) for BellmanPeriod models, via a Stable-Baselines3 backend. The agent wraps a model in a gymnasium environment (see Environments), trains PPO, and emits a standard skagent decision rule.

Stable Baselines3 wrappers for BellmanPeriod models.

Provides PPOAgent, a thin wrapper around SB3’s PPO that:

  • builds a skagent.env.GymEnv from a BellmanPeriod + initial state distribution,

  • delegates training to stable_baselines3.PPO.learn,

  • exposes a PPOAgent.decision_rule() that returns the trained policy as a skagent-style {control_sym: callable} dict — i.e. the same shape consumed by skagent.env.Environment and the rest of the skagent decision-rule API. Actions are unscaled back to real units via GymEnv.unscale_action(), so downstream code does not see the [-1, 1] SB3 representation.

class skagent.algos.sb3.PPOAgent(bp, initial, *, max_episode_steps=200, seed=None, gym_kwargs=None, ppo_kwargs=None, policy='MlpPolicy', device='cpu', verbose=0)

Train SB3’s PPO on a BellmanPeriod and emit a skagent decision rule.

Parameters:
  • bp (BellmanPeriod) – Model definition.

  • initial (dict) – Maps arrival-state symbols to skagent Distribution objects, used by GymEnv to sample fresh initial states on reset.

  • max_episode_steps (int) – Episode horizon for the underlying GymEnv. Default 200.

  • seed (Optional[int]) – Seed for both the environment and the PPO algorithm.

  • gym_kwargs (Optional[dict]) – Extra keyword arguments forwarded to GymEnv (e.g. default_lower, default_upper, bound_clearance, control_sym).

  • ppo_kwargs (Optional[dict]) – Extra keyword arguments forwarded to stable_baselines3.PPO (e.g. n_steps, batch_size, learning_rate, n_epochs, policy_kwargs). gamma defaults to bp.calibration[bp.discount_variable] if it is a finite scalar; callers can override by passing gamma here.

  • policy (str) – SB3 policy class string. Default "MlpPolicy".

  • device (str) – Torch device for PPO. Default "cpu" (SB3’s recommended default for MlpPolicy).

  • verbose (int) – Verbosity passed to PPO. Default 0.

env

The constructed gymnasium environment.

Type:

GymEnv

model

The SB3 model. None until learn() is called the first time (constructed lazily so callers can inspect env without paying PPO’s setup cost).

Type:

stable_baselines3.PPO

decision_rule(deterministic=True)

Return a skagent decision rule that uses the trained policy.

The returned dict has the form {control_sym: callable} where the callable accepts positional arguments matching the control’s iset order (i.e. the same signature skagent Environment.step and BellmanPeriod.decision_function call). The callable’s output is a torch.Tensor of unscaled action values; the inputs may be scalars, numpy arrays, or torch tensors of compatible length.

Parameters:

deterministic (bool) – Whether to use a deterministic (mean) policy. Default True — matches typical skagent decision-rule semantics.

Return type:

dict[str, Callable]

learn(total_timesteps, callback=None, **kwargs)

Run PPO.learn. Returns self.

Every completed episode’s undiscounted reward is appended to self.episode_rewards via an internal SB3 callback; repeated learn calls accumulate. A user-supplied callback is merged with the internal one via CallbackList. Extra **kwargs forward to model.learn.

Parameters:
  • total_timesteps (int)

  • callback (Any)

  • kwargs (Any)

Return type:

PPOAgent

predict_unscaled(obs, deterministic=True)

Predict an unscaled action for obs.

obs may be a single observation (shape (|iset|,)) or a batch ((N, |iset|)). Returns a 1-D array of shape (N,).

Parameters:

deterministic (bool)

Return type:

ndarray

snapshot()

Capture the current trained policy as a frozen PolicySnapshot.

The snapshot holds an independent copy of the policy network, so it is unaffected by later learn() calls. This is the supported way to retain the policy at intermediate points during training (e.g. to compare checkpoints) without re-running training or re-implementing the unscaling logic.

Return type:

PolicySnapshot

class skagent.algos.sb3.PolicySnapshot(policy, env)

Frozen copy of a trained policy, decoupled from further training.

Returned by PPOAgent.snapshot(). Holds a deep copy of the policy network taken at snapshot time, so subsequent learn calls on the source agent do not change its predictions. Exposes the same predict_unscaled() and decision_rule() interface as PPOAgent.

The GymEnv is shared with the source agent (not copied): it is used only for stateless action unscaling, which does not depend on training state.

Parameters:

env (GymEnv)

decision_rule(deterministic=True)

Return a skagent decision rule; see PPOAgent.decision_rule().

Parameters:

deterministic (bool)

Return type:

dict[str, Callable]

predict_unscaled(obs, deterministic=True)

Predict an unscaled action for obs; see PPOAgent.predict_unscaled().

Parameters:

deterministic (bool)

Return type:

ndarray

Solvers

High-level routines that drive the neural-network training utilities to solve a model. See the Algorithms Guide guide for worked examples.

skagent.solver.solve_multiple_controls(control_order, bellman_period, givens, calibration, epochs=200, loss=None)

Solve a block with more than one control by training a policy network for each control in turn.

Each control is given its own skagent.ann.BlockPolicyNet. The networks are trained one at a time, in the order given by control_order, with every network treating the other networks’ current policies as fixed. A control may appear in control_order more than once to refine it after its neighbours have been updated (e.g. ["c", "d", "c"]), which is the multi-control analogue of a best-response sweep.

Currently restricted to single-period (non-recurring) reward objectives; by default the negative immediate reward (skagent.loss.StaticRewardLoss) is maximized.

TODO: allow a variable ‘loss function generator’ once the API has solidified.

Parameters:
  • control_order (list of str) – Control symbols, in the order they should be solved. Symbols may repeat to schedule additional refinement passes.

  • bellman_period (BellmanPeriod) – The model period whose controls are being solved.

  • givens (skagent.grid.Grid) – Grid of arrival states and shock realizations to train over.

  • calibration (dict) – Calibration parameters passed to the loss function.

  • epochs (int, optional) – Training epochs per pass. Default is 200.

  • loss (type, optional) – A loss-function class with signature loss(bellman_period, parameters, other_dr). Defaults to skagent.loss.StaticRewardLoss.

Returns:

Mapping from each control symbol to its trained decision rule.

Return type:

dict

Loss Functions

Objective functions passed to skagent.ann.train_block_nn(). The reward-based losses (StaticRewardLoss, EstimatedDiscountedLifetimeRewardLoss) solve a block directly for the non-recurring case; the equation-residual losses (BellmanEquationLoss, EulerEquationLoss) target the recurring, dynamic case. See Loss Functions for the full reference.

Neural Network Components

Net

Base neural network class with device management.

class skagent.ann.Net(n_inputs, n_outputs, width=32, n_layers=2, activation='silu', transform=None, init_seed=None, copy_weights_from=None)

Bases: Module

A flexible feedforward neural network with configurable architecture.

Parameters:
  • n_inputs (int) – Number of input features

  • n_outputs (int) – Number of output features

  • width (int, optional) – Width of hidden layers. Default is 32.

  • n_layers (int, optional) – Number of hidden layers (1-10). Default is 2.

  • activation (str, list, callable, or None, optional) –

    Activation function(s) to use. Options: - str: Apply same activation to all layers (‘silu’, ‘relu’, ‘tanh’, ‘sigmoid’) - list: Apply different activations to each layer, e.g., [‘relu’, ‘tanh’, ‘silu’] - callable: Custom activation function - None: No activation (identity function)

    Available activations: ‘silu’, ‘relu’, ‘tanh’, ‘sigmoid’, ‘identity’ Default is ‘silu’.

  • transform (str, list, callable, or None, optional) –

    Transformation to apply to outputs. Options: - str: Apply same transform to all outputs (‘sigmoid’, ‘exp’, ‘tanh’, etc.) - list: Apply different transforms to each output, e.g., [‘sigmoid’, ‘exp’] - callable: Custom transformation function - None: No transformation

    Available transforms: ‘sigmoid’, ‘exp’, ‘tanh’, ‘relu’, ‘softplus’, ‘softmax’, ‘abs’, ‘square’, ‘identity’ Default is None.

property device

Device property for backward compatibility.

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

BlockPolicyNet

A neural network for policy functions in dynamic programming problems.

class skagent.ann.BlockPolicyNet(bellman_period, control_sym=None, apply_open_bounds=True, width=32, **kwargs)

Bases: BellmanPeriodMixin, Net

A neural network for policy functions in dynamic programming problems.

This network wraps a Net and integrates with the BellmanPeriod interface. It automatically determines input/output dimensions from the model block specification and enforces control variable bounds.

Parameters:
  • bellman_period (BellmanPeriod) – The model Bellman Period

  • apply_open_bounds (bool, optional) – If True, then the network forward output is normalized by the upper and/or lower bounds, computed as a function of the input tensor. These bounds are “open” because output can be arbitrarily close to, but not equal to, the bounds. Default is True.

  • control_sym (string, optional) – The symbol for the control variable.

  • width (int, optional) – Width of hidden layers. Default is 32.

  • **kwargs – Additional keyword arguments passed to Net. See Net class documentation for all available options including activation, transform, n_layers, init_seed, copy_weights_from, etc.

decision_function(states_t, shocks_t, parameters)

A decision function, from states, shocks, and parameters, to control variable values.

Parameters:
  • states_t (dict) – symbols : values

  • shocks_t (dict) – symbols: values

  • parameters (dict) – symbols : values

Returns:

  • decisions - dict – symbols : values

forward(x)

Note that this uses the same architecture of the superclass but adds on a normalization layer appropriate to the bounds of the decision rule.

get_core_function(length=None)
get_decision_rule(length=None)

Returns the decision rule corresponding to this neural network.

BlockValueNet

A neural network for value functions in dynamic programming problems.

class skagent.ann.BlockValueNet(bellman_period, control_sym=None, width=32, **kwargs)

Bases: BellmanPeriodMixin, Net

Standalone value-function network for a Bellman problem.

Maps a control’s information set (the same pre-decision states a policy network sees) to a single unconstrained scalar value. It is the value-only counterpart of BlockPolicyNet, kept for algorithms that approximate a value function separately from the policy. The Maliar/MMW path in this package uses the shared-backbone BlockPolicyValueNet instead, so BlockValueNet is not wired into maliar_training_loop().

Parameters:
  • bellman_period (BellmanPeriod) – The model Bellman period.

  • control_sym (str, optional) – Control whose information set defines the value function’s domain. Defaults to the first control.

  • width (int) – Width of hidden layers. Default 32.

  • **kwargs – Passed to Net (activation, n_layers, init_seed, etc.).

get_core_function(length=None)

Return the value function (the trainable core for this net).

get_value_function()

Return a callable (states, shocks, parameters) -> value tensor.

value_function(states_t, shocks_t=None, parameters=None)

Evaluate the value function at the control’s information set.

Arrival states_t (with shocks_t and parameters) are mapped to the control’s information set via compute_pre_state(), mirroring BlockPolicyNet.decision_function(), then the network is evaluated.

Returns:

Flattened value estimates, one per input row.

Return type:

torch.Tensor

BlockPolicyValueNet

A shared-backbone neural network that jointly represents the policy and value functions.

class skagent.ann.BlockPolicyValueNet(bellman_period, control_sym=None, apply_open_bounds=True, width=32, **kwargs)

Bases: BellmanPeriodMixin, Net

Single neural network with shared backbone for both policy and value.

Architecture: shared hidden layers → two output heads: - Policy head — bounded output (sigmoid-scaled to satisfy constraints) - Value head — unconstrained scalar output

Sharing the backbone means one optimizer updates all weights simultaneously, and the value head anchors the control level that first-order-condition-only training (e.g. an Euler residual loss) cannot identify.

Parameters:
  • bellman_period (BellmanPeriod) – The model Bellman Period.

  • control_sym (str, optional) – Control variable symbol. Defaults to first control.

  • apply_open_bounds (bool, optional) – Apply sigmoid/softplus scaling to the policy head. Default True.

  • width (int, optional) – Width of hidden layers. Default 32.

  • **kwargs – Passed to Net (activation, n_layers, init_seed, etc.).

decision_function(states_t, shocks_t, parameters)

Map states, shocks, and parameters to a controls dict.

Parameters:
  • states_t (dict) – Arrival state values, symbol -> tensor.

  • shocks_t (dict or None) – Shock values, symbol -> tensor (None is treated as {}).

  • parameters (dict) – Model parameters, symbol -> value.

Returns:

{control_sym: tensor} of policy-head outputs. The arrival states are mapped to the control’s information set via compute_pre_state() before the network is evaluated.

Return type:

dict

forward(x)

Run shared backbone, then policy head (bounded) + value head.

Returns the (policy, value) pair. The policy tensor is scaled into the control’s open bounds; the value tensor is unconstrained. Both have shape (n, 1).

Return type:

tuple[Tensor, Tensor]

get_core_function(length=None)

Return decision rules (policy head) for use with train_block_nn.

get_decision_rule(length=None)

Decision rule returning only the policy output.

get_policy_and_value_functions(length=None)

Return both policy decision rules and value function.

get_value_function()
value_function(states_t, shocks_t=None, parameters=None)

Evaluate the value head at the control’s information set.

The input domain mirrors decision_function(): arrival states_t (with shocks_t and parameters) are mapped to the control’s information set via compute_pre_state(), then the shared backbone’s value head is evaluated on that pre-decision representation.

Parameters:
  • states_t (dict) – Arrival state values, symbol -> tensor.

  • shocks_t (dict or None, optional) – Shock values (None is treated as {}).

  • parameters (dict or None, optional) – Model parameters.

Returns:

Flattened value estimates, one per input row.

Return type:

torch.Tensor

Training Functions

skagent.ann.train_block_nn(block_policy_nn, inputs, loss_function, epochs=50, lr=0.01, optimizer=None, grad_clip=1.0, verbose=True)

Train a policy network by minimizing a loss function over a grid.

This is a generic stochastic-gradient-descent driver, not a solution algorithm in itself. It runs epochs Adam updates that minimize whatever loss_function is supplied, evaluated on a single, fixed grid of inputs; it is agnostic to where that grid came from or which method the loss encodes (Euler residual, Bellman residual, FOC, or a custom loss).

Because it trains on whatever inputs it is given, accuracy depends on the caller re-sampling those states across calls: Maliar, Maliar, and Winant (2021) keep the training data “constantly re-sampled,” and minimizing on a single fixed grid instead lets the solution over-fit those points while drifting elsewhere. Re-draw inputs each call (threading the returned optimizer back in to keep Adam’s momentum), or use maliar_training_loop(), which wraps this driver in the full MMW’21 outer loop: it alternates these inner SGD updates with a forward-simulation step that refreshes the training states toward the model’s ergodic set.

Parameters:
  • block_policy_nn (BlockPolicyNet or BlockPolicyValueNet) – The network to train. Its get_core_function supplies the decision rule(s) the loss is evaluated against.

  • inputs (Grid) – Input grid containing states and shocks.

  • loss_function (Callable) – Loss function (decision_function, input_grid) -> loss_tensor.

  • epochs (int) – Number of training epochs (default 50).

  • lr (float) – Learning rate for Adam optimizer (default 0.01).

  • optimizer (Optional[Optimizer]) – Pre-existing optimizer to reuse (preserves momentum across calls). If None, a new Adam optimizer is created.

  • grad_clip (Optional[float]) – Maximum gradient norm for clipping (default 1.0). Set to None to disable.

  • verbose (bool) – Emit a logging.info message with the loss every 100 epochs (default True). Configure the root logger to suppress these.

Returns:

(trained_network, final_loss, optimizer). The optimizer is the one passed in, or the Adam instance created internally when none was supplied; returning it always lets callers warm-start a later call by threading it back in.

Return type:

tuple

skagent.ann.aggregate_net_loss(inputs, df, loss_function)

Compute a loss function over a tensor of inputs, given a decision function df. Return the mean.

Parameters:

inputs (Grid)

Grid and Computational Tools

Grid Class

class skagent.grid.Grid(labels, values, torched=True)

Bases: object

A class representing a labeled grid of numerical values.

Parameters:

(dict) (config) – dictionary with the following keys: “min” (float): The minimum value for the variable.; “max” (float): The maximum value for the variable; “count” (int): The number of points to generate for the variable.

classmethod from_config(config={}, torched=True)
classmethod from_dict(kv={}, torched=False)
len()

Returns the number of columns, similar to a dict.

n()

Returns the number of values for each symbol

shape()

Returns the shape of the grid values.

to_dict()

Returns a data structure, key: column, similar to tensordict or structured array.

torch()
update_from_dict(kv)

Grid Utility Functions

skagent.grid.make_grid(config)

Make a ‘grid’ of values based on the provided configuration.

Parameters:

(dict) (config) – dictionary with the following keys: “min” (float): The minimum value for the variable; “max” (float): The maximum value for the variable; “count” (int): The number of points to generate for the variable.

Returns:

  • numpy.ndarray (A NumPy array of shape (product_of_counts, num_variables), where) – product_of_counts is the product of all count values in the config dictionary, and num_variables is the number of keys in the config.

skagent.grid.cartesian_product(*arrays)

Create a Cartesian product of input arrays.

Parameters:

*arrays – Variable length arrays to compute product

Returns:

  • Array of shape (product_of_lengths, num_arrays)

  • where product_of_lengths is the product of the lengths of the input arrays,

  • and num_arrays is the number of input arrays. Each row contains one element

  • of the Cartesian product.