diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 715c9c3..26739a2 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2023-12-20T02:01:15","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2023-12-20T02:16:25","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/dev/index.html b/dev/index.html index 021235a..3348bb6 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,5 +1,5 @@ -MDPs.jl: Markov Decision Processes · MDPs.jl

MDPs.jl: Markov Decision Processes

Models

This section describes the data structures that can be used to model various types on MDPs.

MDP

This is a general MDP data structure that supports basic functions. See IntMDP and TabMDP below for more models that can be used more directly to model and solve.

MDPs.MDPType

A general MDP representation with time-independent transition probabilities and rewards. The model makes no assumption that the states can be efficiently enumerated, but assumes that there is small number of actions

S: state type A: action type

source
MDPs.getnextMethod
getnext(model, s, a)

Compute next states using transition function.

Returns an object that can return a NamedTuple with states, probabilities, and transitions as AbstractArrays. This is a more-efficient version of transition (when supported).

The standard implementation is not memory efficient.

source
MDPs.transitionFunction
(sn, p, r) ∈ transition(model, s, a)

Return a list with next states, probabilities, and rewards. Returns an iterator.

Use getnext instead, which is more efficient and convenient to use.

source
MDPs.valuefunctionFunction
valuefunction(mdp, state, valuefunction)

Evaluates the value function for an MDP in a state

source

Tabular MDPs

This is an MDP instance that assumes that the states and actions are tabular.

MDPs.TabMDPType

An abstract tabular Markov Decision Process, time independent.

Default interpretation

  • State: Positive integer (>0) is non-terminal, zero or negative integer is terminal
  • Action: Positive integer, anything else is invalid

Functions that should be defined for any subtype for value and policy iterations to work are: state_count, action_count, transition

The methods state_count and states should only include non-terminal states

source
MDPs.transformMethod
transform(T::DataFrame, model)

Convert a tabular MDP to a data frame representation

source

Integral MDPs

This is a specific MDP instance in which states and actions are specified by integers.

MDPs.IntActionType

Represents transitions that follow an action. The lengths nextstate, probability, and reward must be the same.

Nextstate may not be unique and each transition can have a different reward associated with the transition. The transitions are not aggregated to allow for comuting the risk of a transition. Aggregating the values by state would change the risk value of the transition.

source
MDPs.IntMDPType

MDP with integral states and stationary transitions State and action indexes are all 1-based integers

source
MDPs.compressMethod
compress(nextstate, probability, reward)

The command will combine mulitple transitions to the same state into a single transition. Reward is computed as a weigted average of the individual rewards, assuming expected reward objective.

source
MDPs.load_mdpMethod
load_mdp(input, idoutcome)

Load the MDP from input. The function assumes 0-based indexes of states and actions, which is transformed to 1-based index.

Input formats are anything that is supported by DataFrame. Some options are CSV.File(...) or Arrow.Table(...).

States that have no transition probabilities defined are assumed to be terminal and are set to transition to themselves.

If docombine is true then the method combines transitions that have the same statefrom, action, stateto. This makes risk-neutral value iteration faster, but may change the value of a risk-averse solution.

The formulation allows for multiple transitions s,a → s'. When this is the case, the transition probability is assumed to be their sum and the reward is the weighted average of the rewards.

The method can also process CSV files for MDPO/MMDP, in which case idoutcome specifies a 1-based outcome to load.

Examples

Load the model from a CSV

using CSV: File
+MDPs.jl: Markov Decision Processes · MDPs.jl

MDPs.jl: Markov Decision Processes

Models

This section describes the data structures that can be used to model various types on MDPs.

MDP

This is a general MDP data structure that supports basic functions. See IntMDP and TabMDP below for more models that can be used more directly to model and solve.

MDPs.MDPType

A general MDP representation with time-independent transition probabilities and rewards. The model makes no assumption that the states can be efficiently enumerated, but assumes that there is small number of actions

S: state type A: action type

source
MDPs.getnextMethod
getnext(model, s, a)

Compute next states using transition function.

Returns an object that can return a NamedTuple with states, probabilities, and transitions as AbstractArrays. This is a more-efficient version of transition (when supported).

The standard implementation is not memory efficient.

source
MDPs.transitionFunction
(sn, p, r) ∈ transition(model, s, a)

Return a list with next states, probabilities, and rewards. Returns an iterator.

Use getnext instead, which is more efficient and convenient to use.

source
MDPs.valuefunctionFunction
valuefunction(mdp, state, valuefunction)

Evaluates the value function for an MDP in a state

source

Tabular MDPs

This is an MDP instance that assumes that the states and actions are tabular.

MDPs.TabMDPType

An abstract tabular Markov Decision Process, time independent.

Default interpretation

  • State: Positive integer (>0) is non-terminal, zero or negative integer is terminal
  • Action: Positive integer, anything else is invalid

Functions that should be defined for any subtype for value and policy iterations to work are: state_count, action_count, transition

The methods state_count and states should only include non-terminal states

source
MDPs.transformMethod
transform(T::DataFrame, model)

Convert a tabular MDP to a data frame representation

source

Integral MDPs

This is a specific MDP instance in which states and actions are specified by integers.

MDPs.IntActionType

Represents transitions that follow an action. The lengths nextstate, probability, and reward must be the same.

Nextstate may not be unique and each transition can have a different reward associated with the transition. The transitions are not aggregated to allow for comuting the risk of a transition. Aggregating the values by state would change the risk value of the transition.

source
MDPs.IntMDPType

MDP with integral states and stationary transitions State and action indexes are all 1-based integers

source
MDPs.compressMethod
compress(nextstate, probability, reward)

The command will combine mulitple transitions to the same state into a single transition. Reward is computed as a weigted average of the individual rewards, assuming expected reward objective.

source
MDPs.load_mdpMethod
load_mdp(input, idoutcome)

Load the MDP from input. The function assumes 0-based indexes of states and actions, which is transformed to 1-based index.

Input formats are anything that is supported by DataFrame. Some options are CSV.File(...) or Arrow.Table(...).

States that have no transition probabilities defined are assumed to be terminal and are set to transition to themselves.

If docombine is true then the method combines transitions that have the same statefrom, action, stateto. This makes risk-neutral value iteration faster, but may change the value of a risk-averse solution.

The formulation allows for multiple transitions s,a → s'. When this is the case, the transition probability is assumed to be their sum and the reward is the weighted average of the rewards.

The method can also process CSV files for MDPO/MMDP, in which case idoutcome specifies a 1-based outcome to load.

Examples

Load the model from a CSV

using CSV: File
 using MDPs
 filepath = joinpath(dirname(pathof(MDPs)), "..",
                     "data", "riverswim.csv")
@@ -14,4 +14,4 @@
 state_count(model)
 
 # output
-20
source
MDPs.make_int_mdpMethod
make_int_mdp(mdp::TabMDP, docompress = false)

Transform any tabular MDP mdp to a numeric one. This helps to accelerate operations and value function computation. The actions are also turned into 1-based integer values.

The option docompress combined transitions to the same state into a single transition. This improves efficiency in risk-neutral settings, but may change the outcome in risk-averse settings.

The function adds one more state at the end which represents a catch-all terminal state

source
MDPs.make_int_mdpMethod
make_int_mdp(Ps, rs)

Build IntMDP from a list of transition probabilities Ps and reward vectors rs for each action in the MDP. Each row of the transition matrix represents the probabilities of transitioning to next states.

source

Objectives

MDPs.FiniteHType

Finite-horizon discounted model. The discount factor γ can be in [0,1]. The optimal policy is Markov but time dependent.

source
MDPs.InfiniteHType

Inifinite-horizon discounted objective. The discount factor γ can be in [0,1]. The optimal policy is stationary.

source
MDPs.MarkovType

Objective solved by a randomized Markov non-stationary policy. In other words, the solution is time-dependent.

source
MDPs.MarkovDetType

Objective solved by a deterministic Markov non-stationary policy. In other words, the solution is time-dependent.

source

Algorithms

MDPs.value_iterationFunction
value_iteration(model, objective; [v_terminal, iterations = 1000, ϵ = 1e-3] )

Compute value function and policy for a tabular MDP model with an objective objective. The time steps go from 1 to T+1, the last decision happens at time T.

The supported objectives are FiniteH, and InfiniteH. When provided with a a real number γ ∈ [0,1] then the objective is treated as an infinite horizon problem.

Finite Horizon

Use finite-horizon value iteration for a tabular MDP model with a discount factor γ and horizon T (time steps 1 to T+1) the last decision happens at time T. Returns a vector of value functions for each time step.

The argument v_terminal represents the terminal value function. It should be provided as a function that maps the state id to its terminal value (at time T+1). If this value is provided, then it is used in place of 0.

Infinite Horizon

For a Bellman error ϵ, the computed value function is quaranteed to be within ϵ ⋅ γ / (1 - γ) of the optimal value function (all in terms of the L_∞ norm).

The value function is parallelized when parallel is true. This is also known as a Jacobi type of value iteration (as opposed to Gauss-Seidel)

Note that for the purpose of the greedy policy, minimizing the span seminorm is more efficient, but the goal of this function is also to compute the value function.

The time steps go from 1 to T+1.

source
MDPs.value_iteration!Function
value_iteration!(v, π, model, objective; [v_terminal] )

Run value iteration using the provided v and π storage for the value function and the policy. See value_iteration for more details.

Only support FiniteH objective.

source
MDPs.mrp!Method
mrp!(P_π, r_π, model, π)

Save the transition matrix P_π and reward vector r_π for the MDP model and policy π. Also supports terminal states.

Does not support duplicate entries in transition probabilities.

source
MDPs.mrpMethod
mrp(model, π)

Compute the transition matrix P_π and reward vector r_π for the MDP model and policy π. See mrp! for more details.

source
MDPs.mrp_sparseMethod
mrp(model, π)

Compute a sparse transition matrix P_π and reward vector r_π for the MDP model and policy π.

This function does not support duplicate entries in transition probabilities.

source
MDPs.policy_iterationMethod
policy_iteration(model, γ; [iterations=1000])

Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached.

Does not support duplicate entries in transition probabilities.

source
MDPs.policy_iteration_sparseMethod
policy_iteration_sparse(model, γ; iterations)

Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached. The value function is computed using sparse linear algebra.

Does not support duplicate entries in transition probabilities.

source

Value Function Manipulation

MDPs.make_valueMethod
make_value(model, objective)

Creates an undefined policy and value function for the model and objective.

See Also

value_iteration!

source
MDPs.bellmanMethod
bellman(model, γ, s, v)

Compute the Bellman operator for state s, and value function v assuming an objective obj.

A real-valued objective obj is interpreted as a discount factor.

source
MDPs.bellmangreedyMethod
bellmangreedy(model, obj, s, v)

Compute the Bellman operator and greedy action for state s, and value function v assuming an objective obj.

The function uses qvalue to compute the Bellman operator and the greedy policy.

source
MDPs.greedy!Method
greedy!(π, model, obj, v)

Update policy π with the greedy policy for value function v and MDP model and an objective obj.

source
MDPs.greedyMethod
greedy(model, obj, v)

Compute the greedy action for all states and value function v assuming an objective obj.

source
MDPs.greedyMethod
greedy(model, obj, [s,] v)

Compute the greedy action for state s and value function v assuming an objective obj.

If s is not provided, then computes a value function for all states. The model must support states function.

source
MDPs.qvalueMethod
qvalue(model, γ, s, a, v)

Compute the state-action-values for state s, action a, and value function v for a discount factor γ.

This function is just a more efficient version of the standard definition.

source
MDPs.qvalueMethod
qvalue(model, objective, s, a, v)

Compute the state-action-values for state s, action a, and value function v for an objective.

There is no set representation for the value function.

source
MDPs.qvalues!Method
qvalues!(qvalues, model, objective, s, v)

Compute the state-action-values for state s, and value function v for the objective.

Saves the values to qvalue which should be at least as long as the number of actions. Values of elements in qvalues that are beyond the action count are set to -Inf.

See qvalues for more information.

source
MDPs.qvaluesMethod
qvalues(model, objective, s, v)

Compute the state-action-value for state s, and value function v for objective. There is no set representation of the value function v.

The function is tractable only if there are a small number of actions and transitions.

The function is tractable only if there are a small number of actions and transitions.

source

Simulation

MDPs.PolicyType

Defines a policy, whether a stationary deterministic, or randomized, Markov, or even history-dependent. The policy should support functions make_internal, append_history that initialize and update the internal state. The function take_action then chooses an action to take.

It is important that the tracker keeps their own internal states in order to be thread safe.

source
MDPs.TabPolicyMDType

Markov deterministic policy for tabular MDPs. The policy π has an outer array over time steps and an inner array over states.

source
MDPs.TransitionType

Information about a transition from state to nstate after than an action. time is the time at which nstate is observed.

source
MDPs.append_historyFunction
append_history(policy, internal, transition) :: internal

Update the internal state for a policy by the transition information.

source
MDPs.cumulativeMethod
cumulative(rewards, γ)

Computes the cumulative return from rewards returned by the simulation function.

source
MDPs.make_internalFunction
make_internal(model, policy, state) -> internal

Initialize the internal state for a policy with the initial state. Returns the initial state.

source
MDPs.simulateMethod
simulate(model, π, initial, horizon, episodes; [stationary = true])

Simulate a policy π in a model and generate states and actions for the horizon decisions and episodes episodes. The initial state is initial.

The policy π can be a function, or a array, or an array of arrays depending on whether the policy is stationary, Markovian, deterministic, or randomized. When the policy is provided as a function, then the parameter stationary is used.

There are horizon+1 states generated in every episode including the terminal state at T+1.

The function requires that each state and action transition to a reasonable small number of next states.

See Also

cumulative to compute the cumulative rewards

source
MDPs.take_actionFunction
take_action(policy, internal, state) -> action

Return which action to take with the internal state and the MDP state state.

source

Domains

MDPs.Domains.Gambler.RuinType

Gambler's ruin. Can decide how much to bet at any point in time. With some probability p, the bet is doubled, and with 1-p it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise.

Capital = state - 1 Bet = action - 1

Available actions are 1, ..., state - 1.

Special states: state=1 is broke and state=max_capital+1 is a terminal winning state.

source
MDPs.transitionMethod
transition(params, stock, order, demand)

Update the inventory value and compute the profit.

Starting with a stock number of items, then order of items arrive, after demand of items are sold. Sale price is collected even if it is backlogged (not beyond backlog level). Negative stock means backlog.

Stocking costs are asessed after all the orders are fulfilled.

Causes an error when the order is too large, but no error when the demand cannot be satisfied or backlogged.

source
MDPs.Domains.Machine.ReplacementType

Standard machine replacement simulator. See Figure 3 in Delage 2009 for details.

States are: 1: repair 1 2: repair 2 3 - 10: utility state

Actions: 1: Do nothing 2: Repair

source
+20
source
MDPs.make_int_mdpMethod
make_int_mdp(mdp::TabMDP, docompress = false)

Transform any tabular MDP mdp to a numeric one. This helps to accelerate operations and value function computation. The actions are also turned into 1-based integer values.

The option docompress combined transitions to the same state into a single transition. This improves efficiency in risk-neutral settings, but may change the outcome in risk-averse settings.

The function adds one more state at the end which represents a catch-all terminal state

source
MDPs.make_int_mdpMethod
make_int_mdp(Ps, rs)

Build IntMDP from a list of transition probabilities Ps and reward vectors rs for each action in the MDP. Each row of the transition matrix represents the probabilities of transitioning to next states.

source

Objectives

MDPs.FiniteHType

Finite-horizon discounted model. The discount factor γ can be in [0,1]. The optimal policy is Markov but time dependent.

source
MDPs.InfiniteHType

Inifinite-horizon discounted objective. The discount factor γ can be in [0,1]. The optimal policy is stationary.

source
MDPs.MarkovType

Objective solved by a randomized Markov non-stationary policy. In other words, the solution is time-dependent.

source
MDPs.MarkovDetType

Objective solved by a deterministic Markov non-stationary policy. In other words, the solution is time-dependent.

source

Algorithms

MDPs.value_iterationFunction
value_iteration(model, objective; [v_terminal, iterations = 1000, ϵ = 1e-3] )

Compute value function and policy for a tabular MDP model with an objective objective. The time steps go from 1 to T+1, the last decision happens at time T.

The supported objectives are FiniteH, and InfiniteH. When provided with a a real number γ ∈ [0,1] then the objective is treated as an infinite horizon problem.

Finite Horizon

Use finite-horizon value iteration for a tabular MDP model with a discount factor γ and horizon T (time steps 1 to T+1) the last decision happens at time T. Returns a vector of value functions for each time step.

The argument v_terminal represents the terminal value function. It should be provided as a function that maps the state id to its terminal value (at time T+1). If this value is provided, then it is used in place of 0.

Infinite Horizon

For a Bellman error ϵ, the computed value function is quaranteed to be within ϵ ⋅ γ / (1 - γ) of the optimal value function (all in terms of the L_∞ norm).

The value function is parallelized when parallel is true. This is also known as a Jacobi type of value iteration (as opposed to Gauss-Seidel)

Note that for the purpose of the greedy policy, minimizing the span seminorm is more efficient, but the goal of this function is also to compute the value function.

The time steps go from 1 to T+1.

source
MDPs.value_iteration!Function
value_iteration!(v, π, model, objective; [v_terminal] )

Run value iteration using the provided v and π storage for the value function and the policy. See value_iteration for more details.

Only support FiniteH objective.

source
MDPs.mrp!Method
mrp!(P_π, r_π, model, π)

Save the transition matrix P_π and reward vector r_π for the MDP model and policy π. Also supports terminal states.

Does not support duplicate entries in transition probabilities.

source
MDPs.mrpMethod
mrp(model, π)

Compute the transition matrix P_π and reward vector r_π for the MDP model and policy π. See mrp! for more details.

source
MDPs.mrp_sparseMethod
mrp(model, π)

Compute a sparse transition matrix P_π and reward vector r_π for the MDP model and policy π.

This function does not support duplicate entries in transition probabilities.

source
MDPs.policy_iterationMethod
policy_iteration(model, γ; [iterations=1000])

Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached.

Does not support duplicate entries in transition probabilities.

source
MDPs.policy_iteration_sparseMethod
policy_iteration_sparse(model, γ; iterations)

Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached. The value function is computed using sparse linear algebra.

Does not support duplicate entries in transition probabilities.

source

Value Function Manipulation

MDPs.make_valueMethod
make_value(model, objective)

Creates an undefined policy and value function for the model and objective.

See Also

value_iteration!

source
MDPs.bellmanMethod
bellman(model, γ, s, v)

Compute the Bellman operator for state s, and value function v assuming an objective obj.

A real-valued objective obj is interpreted as a discount factor.

source
MDPs.bellmangreedyMethod
bellmangreedy(model, obj, s, v)

Compute the Bellman operator and greedy action for state s, and value function v assuming an objective obj.

The function uses qvalue to compute the Bellman operator and the greedy policy.

source
MDPs.greedy!Method
greedy!(π, model, obj, v)

Update policy π with the greedy policy for value function v and MDP model and an objective obj.

source
MDPs.greedyMethod
greedy(model, obj, v)

Compute the greedy action for all states and value function v assuming an objective obj.

source
MDPs.greedyMethod
greedy(model, obj, [s,] v)

Compute the greedy action for state s and value function v assuming an objective obj.

If s is not provided, then computes a value function for all states. The model must support states function.

source
MDPs.qvalueMethod
qvalue(model, γ, s, a, v)

Compute the state-action-values for state s, action a, and value function v for a discount factor γ.

This function is just a more efficient version of the standard definition.

source
MDPs.qvalueMethod
qvalue(model, objective, s, a, v)

Compute the state-action-values for state s, action a, and value function v for an objective.

There is no set representation for the value function.

source
MDPs.qvalues!Method
qvalues!(qvalues, model, objective, s, v)

Compute the state-action-values for state s, and value function v for the objective.

Saves the values to qvalue which should be at least as long as the number of actions. Values of elements in qvalues that are beyond the action count are set to -Inf.

See qvalues for more information.

source
MDPs.qvaluesMethod
qvalues(model, objective, s, v)

Compute the state-action-value for state s, and value function v for objective. There is no set representation of the value function v.

The function is tractable only if there are a small number of actions and transitions.

The function is tractable only if there are a small number of actions and transitions.

source

Simulation

MDPs.PolicyType

Defines a policy, whether a stationary deterministic, or randomized, Markov, or even history-dependent. The policy should support functions make_internal, append_history that initialize and update the internal state. The function take_action then chooses an action to take.

It is important that the tracker keeps their own internal states in order to be thread safe.

source
MDPs.TabPolicyMDType

Markov deterministic policy for tabular MDPs. The policy π has an outer array over time steps and an inner array over states.

source
MDPs.TransitionType

Information about a transition from state to nstate after than an action. time is the time at which nstate is observed.

source
MDPs.append_historyFunction
append_history(policy, internal, transition) :: internal

Update the internal state for a policy by the transition information.

source
MDPs.cumulativeMethod
cumulative(rewards, γ)

Computes the cumulative return from rewards returned by the simulation function.

source
MDPs.make_internalFunction
make_internal(model, policy, state) -> internal

Initialize the internal state for a policy with the initial state. Returns the initial state.

source
MDPs.simulateMethod
simulate(model, π, initial, horizon, episodes; [stationary = true])

Simulate a policy π in a model and generate states and actions for the horizon decisions and episodes episodes. The initial state is initial.

The policy π can be a function, or a array, or an array of arrays depending on whether the policy is stationary, Markovian, deterministic, or randomized. When the policy is provided as a function, then the parameter stationary is used.

There are horizon+1 states generated in every episode including the terminal state at T+1.

The function requires that each state and action transition to a reasonable small number of next states.

See Also

cumulative to compute the cumulative rewards

source
MDPs.take_actionFunction
take_action(policy, internal, state) -> action

Return which action to take with the internal state and the MDP state state.

source

Domains

MDPs.Domains.Gambler.RuinType

Gambler's ruin. Can decide how much to bet at any point in time. With some probability p, the bet is doubled, and with 1-p it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise.

Capital = state - 1 Bet = action - 1

Available actions are 1, ..., state - 1.

Special states: state=1 is broke and state=max_capital+1 is a terminal winning state.

source
MDPs.transitionMethod
transition(params, stock, order, demand)

Update the inventory value and compute the profit.

Starting with a stock number of items, then order of items arrive, after demand of items are sold. Sale price is collected even if it is backlogged (not beyond backlog level). Negative stock means backlog.

Stocking costs are asessed after all the orders are fulfilled.

Causes an error when the order is too large, but no error when the demand cannot be satisfied or backlogged.

source
MDPs.Domains.Machine.ReplacementType

Standard machine replacement simulator. See Figure 3 in Delage 2009 for details.

States are: 1: repair 1 2: repair 2 3 - 10: utility state

Actions: 1: Do nothing 2: Repair

source
diff --git a/dev/simulation/index.html b/dev/simulation/index.html index 03980ed..6fa9540 100644 --- a/dev/simulation/index.html +++ b/dev/simulation/index.html @@ -1,2 +1,2 @@ -- · MDPs.jl

Simulation

This will be more extended documentation that also discusses how to simulate policies that are history dependent.

+- · MDPs.jl

Simulation

This will be more extended documentation that also discusses how to simulate policies that are history dependent.