MDPs.jl: Markov Decision Processes
Models
This section describes the data structures that can be used to model various types on MDPs.
MDP
This is a general MDP data structure that supports basic functions. See IntMDP and TabMDP below for more models that can be used more directly to model and solve.
MDPs.MDP
— TypeA general MDP representation with time-independent transition probabilities and rewards. The model makes no assumption that the states can be efficiently enumerated, but assumes that there is small number of actions
S: state type A: action type
MDPs.getnext
— Methodgetnext(model, s, a)
Compute next states using transition
function.
Returns an object that can return a NamedTuple
with states
, probabilities
, and transitions
as AbstractArrays
. This is a more-efficient version of transition (when supported).
The standard implementation is not memory efficient.
MDPs.isterminal
— Functionisterminal(mdp, state)
Return true if the state is terminal
MDPs.transition
— Function(sn, p, r) ∈ transition(model, s, a)
Return a list with next states, probabilities, and rewards. Returns an iterator.
Use getnext
instead, which is more efficient and convenient to use.
MDPs.valuefunction
— Functionvaluefunction(mdp, state, valuefunction)
Evaluates the value function for an MDP in a state
Tabular MDPs
This is an MDP instance that assumes that the states and actions are tabular.
MDPs.TabMDP
— TypeAn abstract tabular Markov Decision Process, time independent.
Default interpretation
- State: Positive integer (>0) is non-terminal, zero or negative integer is terminal
- Action: Positive integer, anything else is invalid
Functions that should be defined for any subtype for value and policy iterations to work are: state_count
, action_count
, transition
The methods state_count
and states
should only include non-terminal states
MDPs.transform
— Methodtransform(T::DataFrame, model)
Convert a tabular MDP to a data frame representation
Integral MDPs
This is a specific MDP instance in which states and actions are specified by integers.
MDPs.FormatError
— TypeAn incorrect parameter value
MDPs.IntAction
— TypeRepresents transitions that follow an action. The lengths nextstate
, probability
, and reward
must be the same.
Nextstate may not be unique and each transition can have a different reward associated with the transition. The transitions are not aggregated to allow for comuting the risk of a transition. Aggregating the values by state would change the risk value of the transition.
MDPs.IntMDP
— TypeMDP with integral states and stationary transitions State and action indexes are all 1-based integers
MDPs.IntState
— TypeRepresents a discrete state
MDPs.compress
— Methodcompress(nextstate, probability, reward)
The command will combine mulitple transitions to the same state into a single transition. Reward is computed as a weigted average of the individual rewards, assuming expected reward objective.
MDPs.load_mdp
— Methodload_mdp(input, idoutcome)
Load the MDP from input
. The function assumes 0-based indexes of states and actions, which is transformed to 1-based index.
Input formats are anything that is supported by DataFrame. Some options are CSV.File(...)
or Arrow.Table(...)
.
States that have no transition probabilities defined are assumed to be terminal and are set to transition to themselves.
If docombine
is true then the method combines transitions that have the same statefrom, action, stateto. This makes risk-neutral value iteration faster, but may change the value of a risk-averse solution.
The formulation allows for multiple transitions s,a → s'. When this is the case, the transition probability is assumed to be their sum and the reward is the weighted average of the rewards.
The method can also process CSV files for MDPO/MMDP, in which case idoutcome
specifies a 1-based outcome to load.
Examples
Load the model from a CSV
using CSV: File
+MDPs.jl: Markov Decision Processes · MDPs.jl MDPs.jl: Markov Decision Processes
Models
This section describes the data structures that can be used to model various types on MDPs.
MDP
This is a general MDP data structure that supports basic functions. See IntMDP and TabMDP below for more models that can be used more directly to model and solve.
MDPs.MDP
— TypeA general MDP representation with time-independent transition probabilities and rewards. The model makes no assumption that the states can be efficiently enumerated, but assumes that there is small number of actions
S: state type A: action type
sourceMDPs.getnext
— Methodgetnext(model, s, a)
Compute next states using transition
function.
Returns an object that can return a NamedTuple
with states
, probabilities
, and transitions
as AbstractArrays
. This is a more-efficient version of transition (when supported).
The standard implementation is not memory efficient.
sourceMDPs.isterminal
— Functionisterminal(mdp, state)
Return true if the state is terminal
sourceMDPs.transition
— Function(sn, p, r) ∈ transition(model, s, a)
Return a list with next states, probabilities, and rewards. Returns an iterator.
Use getnext
instead, which is more efficient and convenient to use.
sourceMDPs.valuefunction
— Functionvaluefunction(mdp, state, valuefunction)
Evaluates the value function for an MDP in a state
sourceTabular MDPs
This is an MDP instance that assumes that the states and actions are tabular.
MDPs.TabMDP
— TypeAn abstract tabular Markov Decision Process, time independent.
Default interpretation
- State: Positive integer (>0) is non-terminal, zero or negative integer is terminal
- Action: Positive integer, anything else is invalid
Functions that should be defined for any subtype for value and policy iterations to work are: state_count
, action_count
, transition
The methods state_count
and states
should only include non-terminal states
sourceMDPs.transform
— Methodtransform(T::DataFrame, model)
Convert a tabular MDP to a data frame representation
sourceIntegral MDPs
This is a specific MDP instance in which states and actions are specified by integers.
MDPs.FormatError
— TypeAn incorrect parameter value
sourceMDPs.IntAction
— TypeRepresents transitions that follow an action. The lengths nextstate
, probability
, and reward
must be the same.
Nextstate may not be unique and each transition can have a different reward associated with the transition. The transitions are not aggregated to allow for comuting the risk of a transition. Aggregating the values by state would change the risk value of the transition.
sourceMDPs.IntMDP
— TypeMDP with integral states and stationary transitions State and action indexes are all 1-based integers
sourceMDPs.IntState
— TypeRepresents a discrete state
sourceMDPs.compress
— Methodcompress(nextstate, probability, reward)
The command will combine mulitple transitions to the same state into a single transition. Reward is computed as a weigted average of the individual rewards, assuming expected reward objective.
sourceMDPs.load_mdp
— Methodload_mdp(input, idoutcome)
Load the MDP from input
. The function assumes 0-based indexes of states and actions, which is transformed to 1-based index.
Input formats are anything that is supported by DataFrame. Some options are CSV.File(...)
or Arrow.Table(...)
.
States that have no transition probabilities defined are assumed to be terminal and are set to transition to themselves.
If docombine
is true then the method combines transitions that have the same statefrom, action, stateto. This makes risk-neutral value iteration faster, but may change the value of a risk-averse solution.
The formulation allows for multiple transitions s,a → s'. When this is the case, the transition probability is assumed to be their sum and the reward is the weighted average of the rewards.
The method can also process CSV files for MDPO/MMDP, in which case idoutcome
specifies a 1-based outcome to load.
Examples
Load the model from a CSV
using CSV: File
using MDPs
filepath = joinpath(dirname(pathof(MDPs)), "..",
"data", "riverswim.csv")
@@ -14,4 +14,4 @@
state_count(model)
# output
-20
sourceMDPs.make_int_mdp
— Methodmake_int_mdp(mdp::TabMDP, docompress = false)
Transform any tabular MDP mdp
to a numeric one. This helps to accelerate operations and value function computation. The actions are also turned into 1-based integer values.
The option docompress
combined transitions to the same state into a single transition. This improves efficiency in risk-neutral settings, but may change the outcome in risk-averse settings.
The function adds one more state at the end which represents a catch-all terminal state
sourceMDPs.make_int_mdp
— Methodmake_int_mdp(Ps, rs)
Build IntMDP from a list of transition probabilities Ps
and reward vectors rs
for each action in the MDP. Each row of the transition matrix represents the probabilities of transitioning to next states.
sourceObjectives
MDPs.FiniteH
— TypeFinite-horizon discounted model. The discount factor γ
can be in [0,1]. The optimal policy is Markov but time dependent.
sourceMDPs.InfiniteH
— TypeInifinite-horizon discounted objective. The discount factor γ
can be in [0,1]. The optimal policy is stationary.
sourceMDPs.Markov
— TypeObjective solved by a randomized Markov non-stationary policy. In other words, the solution is time-dependent.
sourceMDPs.MarkovDet
— TypeObjective solved by a deterministic Markov non-stationary policy. In other words, the solution is time-dependent.
sourceMDPs.Objective
— TypeAbstract objective for an MDP.
sourceMDPs.Stationary
— TypeObjective that is solved by a randomized stationary policy
sourceMDPs.StationaryDet
— TypeObjective that is solved by a randomized stationary policy
sourceMDPs.horizon
— Functionhorizon(objective)
Return the horizon length for objective
.
sourceAlgorithms
MDPs.value_iteration
— Functionvalue_iteration(model, objective; [v_terminal, iterations = 1000, ϵ = 1e-3] )
Compute value function and policy for a tabular MDP model
with an objective objective
. The time steps go from 1 to T+1, the last decision happens at time T.
The supported objectives are FiniteH
, and InfiniteH
. When provided with a a real number γ ∈ [0,1]
then the objective is treated as an infinite horizon problem.
Finite Horizon
Use finite-horizon value iteration for a tabular MDP model
with a discount factor γ
and horizon T
(time steps 1
to T+1
) the last decision happens at time T. Returns a vector of value functions for each time step.
The argument v_terminal
represents the terminal value function. It should be provided as a function that maps the state id to its terminal value (at time T+1). If this value is provided, then it is used in place of 0.
Infinite Horizon
For a Bellman error ϵ
, the computed value function is quaranteed to be within ϵ ⋅ γ / (1 - γ) of the optimal value function (all in terms of the L_∞ norm).
The value function is parallelized when parallel
is true. This is also known as a Jacobi type of value iteration (as opposed to Gauss-Seidel)
Note that for the purpose of the greedy policy, minimizing the span seminorm is more efficient, but the goal of this function is also to compute the value function.
The time steps go from 1 to T+1.
sourceMDPs.value_iteration!
— Functionvalue_iteration!(v, π, model, objective; [v_terminal] )
Run value iteration using the provided v
and π
storage for the value function and the policy. See value_iteration
for more details.
Only support FiniteH
objective.
sourceMDPs.mrp!
— Methodmrp!(P_π, r_π, model, π)
Save the transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
. Also supports terminal states.
Does not support duplicate entries in transition probabilities.
sourceMDPs.mrp
— Methodmrp(model, π)
Compute the transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
. See mrp! for more details.
sourceMDPs.mrp_sparse
— Methodmrp(model, π)
Compute a sparse transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
.
This function does not support duplicate entries in transition probabilities.
sourceMDPs.policy_iteration
— Methodpolicy_iteration(model, γ; [iterations=1000])
Implements policy iteration for MDP model
with a discount factor γ
. The algorithm runs until the policy stops changing or the number of iterations is reached.
Does not support duplicate entries in transition probabilities.
sourceMDPs.policy_iteration_sparse
— Methodpolicy_iteration_sparse(model, γ; iterations)
Implements policy iteration for MDP model
with a discount factor γ
. The algorithm runs until the policy stops changing or the number of iterations is reached. The value function is computed using sparse linear algebra.
Does not support duplicate entries in transition probabilities.
sourceValue Function Manipulation
MDPs.make_value
— Methodmake_value(model, objective)
Creates an undefined policy and value function for the model
and objective
.
See Also
value_iteration!
sourceMDPs.bellman
— Methodbellman(model, γ, s, v)
Compute the Bellman operator for state s
, and value function v
assuming an objective obj
.
A real-valued objective obj
is interpreted as a discount factor.
sourceMDPs.bellmangreedy
— Methodbellmangreedy(model, obj, s, v)
Compute the Bellman operator and greedy action for state s
, and value function v
assuming an objective obj
.
The function uses qvalue
to compute the Bellman operator and the greedy policy.
sourceMDPs.greedy!
— Methodgreedy!(π, model, obj, v)
Update policy π
with the greedy policy for value function v
and MDP model
and an objective obj
.
sourceMDPs.greedy
— Methodgreedy(model, obj, v)
Compute the greedy action for all states and value function v
assuming an objective obj
.
sourceMDPs.greedy
— Methodgreedy(model, obj, [s,] v)
Compute the greedy action for state s
and value function v
assuming an objective obj
.
If s
is not provided, then computes a value function for all states. The model must support states
function.
sourceMDPs.qvalue
— Methodqvalue(model, γ, s, a, v)
Compute the state-action-values for state s
, action a
, and value function v
for a discount factor γ
.
This function is just a more efficient version of the standard definition.
sourceMDPs.qvalue
— Methodqvalue(model, objective, s, a, v)
Compute the state-action-values for state s
, action a
, and value function v
for an objective
.
There is no set representation for the value function.
sourceMDPs.qvalues!
— Methodqvalues!(qvalues, model, objective, s, v)
Compute the state-action-values for state s
, and value function v
for the objective
.
Saves the values to qvalue
which should be at least as long as the number of actions. Values of elements in qvalues
that are beyond the action count are set to -Inf
.
See qvalues
for more information.
sourceMDPs.qvalues
— Methodqvalues(model, objective, s, v)
Compute the state-action-value for state s
, and value function v
for objective
. There is no set representation of the value function v
.
The function is tractable only if there are a small number of actions and transitions.
The function is tractable only if there are a small number of actions and transitions.
sourceSimulation
MDPs.FPolicyM
— TypeGeneral stationary policy specified by a function s,t → a
sourceMDPs.FPolicyS
— TypeGeneral stationary policy specified by a function s → a
sourceMDPs.Policy
— TypeDefines a policy, whether a stationary deterministic, or randomized, Markov, or even history-dependent. The policy should support functions make_internal
, append_history
that initialize and update the internal state. The function take_action
then chooses an action to take.
It is important that the tracker keeps their own internal states in order to be thread safe.
sourceMDPs.TabPolicyMD
— TypeMarkov deterministic policy for tabular MDPs. The policy π
has an outer array over time steps and an inner array over states.
sourceMDPs.TabPolicySD
— TypeStationary deterministic policy for tabular MDPs
sourceMDPs.TabPolicyStationary
— TypeGeneric policy for tabular MDPs
sourceMDPs.Transition
— TypeInformation about a transition from state
to nstate
after than an action
. time
is the time at which nstate
is observed.
sourceMDPs.append_history
— Functionappend_history(policy, internal, transition) :: internal
Update the internal state for a policy by the transition information.
sourceMDPs.cumulative
— Methodcumulative(rewards, γ)
Computes the cumulative return from rewards returned by the simulation function.
sourceMDPs.make_internal
— Functionmake_internal(model, policy, state) -> internal
Initialize the internal state for a policy with the initial state. Returns the initial state.
sourceMDPs.random_π
— Methodrandom_π(model)
Construct a random policy for a tabular MDP
sourceMDPs.simulate
— Methodsimulate(model, π, initial, horizon, episodes; [stationary = true])
Simulate a policy π
in a model
and generate states and actions for the horizon
decisions and episodes
episodes. The initial state is initial
.
The policy π
can be a function, or a array, or an array of arrays depending on whether the policy is stationary, Markovian, deterministic, or randomized. When the policy is provided as a function, then the parameter stationary
is used.
There are horizon+1 states generated in every episode including the terminal state at T+1.
The function requires that each state and action transition to a reasonable small number of next states.
See Also
cumulative
to compute the cumulative rewards
sourceMDPs.take_action
— Functiontake_action(policy, internal, state) -> action
Return which action to take with the internal
state and the MDP state state
.
sourceDomains
MDPs.Domains.Gambler.Ruin
— TypeGambler's ruin. Can decide how much to bet at any point in time. With some probability p
, the bet is doubled, and with 1-p
it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise.
Capital = state - 1 Bet = action - 1
Available actions are 1, ..., state - 1.
Special states: state=1 is broke and state=max_capital+1 is a terminal winning state.
sourceMDPs.Domains.Inventory.Demand
— TypeModels values of demand in values
and probabilities in probabilities
.
sourceMDPs.Domains.Inventory.Model
— TypeAn inventory MDP problem simulator
The states and actions are 1-based integers.
sourceMDPs.Domains.Inventory.Parameters
— TypeParameters that define an inventory problem
sourceMDPs.transition
— Methodtransition(params, stock, order, demand)
Update the inventory value and compute the profit.
Starting with a stock
number of items, then order
of items arrive, after demand
of items are sold. Sale price is collected even if it is backlogged (not beyond backlog level). Negative stock means backlog.
Stocking costs are asessed after all the orders are fulfilled.
Causes an error when the order
is too large, but no error when the demand cannot be satisfied or backlogged.
sourceMDPs.Domains.Machine.Replacement
— TypeStandard machine replacement simulator. See Figure 3 in Delage 2009 for details.
States are: 1: repair 1 2: repair 2 3 - 10: utility state
Actions: 1: Do nothing 2: Repair
sourceSettings
This document was generated with Documenter.jl version 1.2.1 on Wednesday 20 December 2023. Using Julia version 1.9.4.
+20
MDPs.make_int_mdp
— Methodmake_int_mdp(mdp::TabMDP, docompress = false)
Transform any tabular MDP mdp
to a numeric one. This helps to accelerate operations and value function computation. The actions are also turned into 1-based integer values.
The option docompress
combined transitions to the same state into a single transition. This improves efficiency in risk-neutral settings, but may change the outcome in risk-averse settings.
The function adds one more state at the end which represents a catch-all terminal state
MDPs.make_int_mdp
— Methodmake_int_mdp(Ps, rs)
Build IntMDP from a list of transition probabilities Ps
and reward vectors rs
for each action in the MDP. Each row of the transition matrix represents the probabilities of transitioning to next states.
Objectives
MDPs.FiniteH
— TypeFinite-horizon discounted model. The discount factor γ
can be in [0,1]. The optimal policy is Markov but time dependent.
MDPs.InfiniteH
— TypeInifinite-horizon discounted objective. The discount factor γ
can be in [0,1]. The optimal policy is stationary.
MDPs.Markov
— TypeObjective solved by a randomized Markov non-stationary policy. In other words, the solution is time-dependent.
MDPs.MarkovDet
— TypeObjective solved by a deterministic Markov non-stationary policy. In other words, the solution is time-dependent.
MDPs.Objective
— TypeAbstract objective for an MDP.
MDPs.Stationary
— TypeObjective that is solved by a randomized stationary policy
MDPs.StationaryDet
— TypeObjective that is solved by a randomized stationary policy
MDPs.horizon
— Functionhorizon(objective)
Return the horizon length for objective
.
Algorithms
MDPs.value_iteration
— Functionvalue_iteration(model, objective; [v_terminal, iterations = 1000, ϵ = 1e-3] )
Compute value function and policy for a tabular MDP model
with an objective objective
. The time steps go from 1 to T+1, the last decision happens at time T.
The supported objectives are FiniteH
, and InfiniteH
. When provided with a a real number γ ∈ [0,1]
then the objective is treated as an infinite horizon problem.
Finite Horizon
Use finite-horizon value iteration for a tabular MDP model
with a discount factor γ
and horizon T
(time steps 1
to T+1
) the last decision happens at time T. Returns a vector of value functions for each time step.
The argument v_terminal
represents the terminal value function. It should be provided as a function that maps the state id to its terminal value (at time T+1). If this value is provided, then it is used in place of 0.
Infinite Horizon
For a Bellman error ϵ
, the computed value function is quaranteed to be within ϵ ⋅ γ / (1 - γ) of the optimal value function (all in terms of the L_∞ norm).
The value function is parallelized when parallel
is true. This is also known as a Jacobi type of value iteration (as opposed to Gauss-Seidel)
Note that for the purpose of the greedy policy, minimizing the span seminorm is more efficient, but the goal of this function is also to compute the value function.
The time steps go from 1 to T+1.
MDPs.value_iteration!
— Functionvalue_iteration!(v, π, model, objective; [v_terminal] )
Run value iteration using the provided v
and π
storage for the value function and the policy. See value_iteration
for more details.
Only support FiniteH
objective.
MDPs.mrp!
— Methodmrp!(P_π, r_π, model, π)
Save the transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
. Also supports terminal states.
Does not support duplicate entries in transition probabilities.
MDPs.mrp
— Methodmrp(model, π)
Compute the transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
. See mrp! for more details.
MDPs.mrp_sparse
— Methodmrp(model, π)
Compute a sparse transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
.
This function does not support duplicate entries in transition probabilities.
MDPs.policy_iteration
— Methodpolicy_iteration(model, γ; [iterations=1000])
Implements policy iteration for MDP model
with a discount factor γ
. The algorithm runs until the policy stops changing or the number of iterations is reached.
Does not support duplicate entries in transition probabilities.
MDPs.policy_iteration_sparse
— Methodpolicy_iteration_sparse(model, γ; iterations)
Implements policy iteration for MDP model
with a discount factor γ
. The algorithm runs until the policy stops changing or the number of iterations is reached. The value function is computed using sparse linear algebra.
Does not support duplicate entries in transition probabilities.
Value Function Manipulation
MDPs.make_value
— Methodmake_value(model, objective)
Creates an undefined policy and value function for the model
and objective
.
See Also
value_iteration!
MDPs.bellman
— Methodbellman(model, γ, s, v)
Compute the Bellman operator for state s
, and value function v
assuming an objective obj
.
A real-valued objective obj
is interpreted as a discount factor.
MDPs.bellmangreedy
— Methodbellmangreedy(model, obj, s, v)
Compute the Bellman operator and greedy action for state s
, and value function v
assuming an objective obj
.
The function uses qvalue
to compute the Bellman operator and the greedy policy.
MDPs.greedy!
— Methodgreedy!(π, model, obj, v)
Update policy π
with the greedy policy for value function v
and MDP model
and an objective obj
.
MDPs.greedy
— Methodgreedy(model, obj, v)
Compute the greedy action for all states and value function v
assuming an objective obj
.
MDPs.greedy
— Methodgreedy(model, obj, [s,] v)
Compute the greedy action for state s
and value function v
assuming an objective obj
.
If s
is not provided, then computes a value function for all states. The model must support states
function.
MDPs.qvalue
— Methodqvalue(model, γ, s, a, v)
Compute the state-action-values for state s
, action a
, and value function v
for a discount factor γ
.
This function is just a more efficient version of the standard definition.
MDPs.qvalue
— Methodqvalue(model, objective, s, a, v)
Compute the state-action-values for state s
, action a
, and value function v
for an objective
.
There is no set representation for the value function.
MDPs.qvalues!
— Methodqvalues!(qvalues, model, objective, s, v)
Compute the state-action-values for state s
, and value function v
for the objective
.
Saves the values to qvalue
which should be at least as long as the number of actions. Values of elements in qvalues
that are beyond the action count are set to -Inf
.
See qvalues
for more information.
MDPs.qvalues
— Methodqvalues(model, objective, s, v)
Compute the state-action-value for state s
, and value function v
for objective
. There is no set representation of the value function v
.
The function is tractable only if there are a small number of actions and transitions.
The function is tractable only if there are a small number of actions and transitions.
Simulation
MDPs.FPolicyM
— TypeGeneral stationary policy specified by a function s,t → a
MDPs.FPolicyS
— TypeGeneral stationary policy specified by a function s → a
MDPs.Policy
— TypeDefines a policy, whether a stationary deterministic, or randomized, Markov, or even history-dependent. The policy should support functions make_internal
, append_history
that initialize and update the internal state. The function take_action
then chooses an action to take.
It is important that the tracker keeps their own internal states in order to be thread safe.
MDPs.TabPolicyMD
— TypeMarkov deterministic policy for tabular MDPs. The policy π
has an outer array over time steps and an inner array over states.
MDPs.TabPolicySD
— TypeStationary deterministic policy for tabular MDPs
MDPs.TabPolicyStationary
— TypeGeneric policy for tabular MDPs
MDPs.Transition
— TypeInformation about a transition from state
to nstate
after than an action
. time
is the time at which nstate
is observed.
MDPs.append_history
— Functionappend_history(policy, internal, transition) :: internal
Update the internal state for a policy by the transition information.
MDPs.cumulative
— Methodcumulative(rewards, γ)
Computes the cumulative return from rewards returned by the simulation function.
MDPs.make_internal
— Functionmake_internal(model, policy, state) -> internal
Initialize the internal state for a policy with the initial state. Returns the initial state.
MDPs.random_π
— Methodrandom_π(model)
Construct a random policy for a tabular MDP
MDPs.simulate
— Methodsimulate(model, π, initial, horizon, episodes; [stationary = true])
Simulate a policy π
in a model
and generate states and actions for the horizon
decisions and episodes
episodes. The initial state is initial
.
The policy π
can be a function, or a array, or an array of arrays depending on whether the policy is stationary, Markovian, deterministic, or randomized. When the policy is provided as a function, then the parameter stationary
is used.
There are horizon+1 states generated in every episode including the terminal state at T+1.
The function requires that each state and action transition to a reasonable small number of next states.
See Also
cumulative
to compute the cumulative rewards
MDPs.take_action
— Functiontake_action(policy, internal, state) -> action
Return which action to take with the internal
state and the MDP state state
.
Domains
MDPs.Domains.Gambler.Ruin
— TypeGambler's ruin. Can decide how much to bet at any point in time. With some probability p
, the bet is doubled, and with 1-p
it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise.
Capital = state - 1 Bet = action - 1
Available actions are 1, ..., state - 1.
Special states: state=1 is broke and state=max_capital+1 is a terminal winning state.
MDPs.Domains.Inventory.Demand
— TypeModels values of demand in values
and probabilities in probabilities
.
MDPs.Domains.Inventory.Model
— TypeAn inventory MDP problem simulator
The states and actions are 1-based integers.
MDPs.Domains.Inventory.Parameters
— TypeParameters that define an inventory problem
MDPs.transition
— Methodtransition(params, stock, order, demand)
Update the inventory value and compute the profit.
Starting with a stock
number of items, then order
of items arrive, after demand
of items are sold. Sale price is collected even if it is backlogged (not beyond backlog level). Negative stock means backlog.
Stocking costs are asessed after all the orders are fulfilled.
Causes an error when the order
is too large, but no error when the demand cannot be satisfied or backlogged.
MDPs.Domains.Machine.Replacement
— TypeStandard machine replacement simulator. See Figure 3 in Delage 2009 for details.
States are: 1: repair 1 2: repair 2 3 - 10: utility state
Actions: 1: Do nothing 2: Repair