Markov Decision Processes | AI Strategy & Policy Blog

Andrei Andreevich Markov (1856-1922) developed his idea of states chained (or connected) by probabilities after his retirement at the old age of 50 (i.e., never too late to get brilliant ideas). This was at the turn of the 20th century. One of the most famous Markov chains, that we all make use of pretty much every day, is the pages of the world wide web with 1.5+ billion indexed pages as designated states and maybe more than 150+ billion links between those web pages which are equivalent to the Markov chain transitions taken us from one State (web page) to another State (another web page). Googles PageRank algorithm, for example, is build upon the fundamentals of Markov chains. The usefulness of Markov chains spans many many fields, e.g., physics, chemistry biology, information science/theory, game theory, decision theory, language theory, speech processing, communications networks, etc…

There are a few concepts that are important to keep in mind for Markov Chains and Markov Decision processes.

Concepts.

Environment: is the relevant space that the Markov chain operates in. E.g., could be the physical surroundings of a logistic storage facility where a robot is moving around.

State: A state is a set of variables describing a system that does not include anything about its history (the physics definition). E.g., in classical mechanics the state of a point mass is given by its position and its velocity vector (i.e., where it is and where it goes). It is good to keep in mind that the computer science meaning of state is different, in the sense that a stateful agent is designed to remember preceding events (i.e., it “remembers” its history). This is however not how a state for a Markov chain should be understood. A sequence (or chain) of random variables {S0, S1, S2, … , Sn}, describing a stochastic process, is said to have a Markov property if

that is, a future state of the stochastic process depends only the state immediately prior and on no other past states. To make the concept of state a bit more tangible, think of a simple customer life-cycle process with (only) 4 states considered; (S0) Conversion, (S1) Retention, (S2) Upsell and (S3) Churn. Thus, in Python we would define the states as a dictionary,

# Example: Customer life-cycle process (simple)
# Defining States


states = {
    0 : 'Conversion',
    1 : 'Retention',
    2 : 'Upsell',
    3 : 'Churn'
}

In our example, the states S is a vector of dimension 4×1, either represented by S = (0 1 2 3) or alternatively S = (Conversion, Retention, Upsell, Churn). More generally, S is n×1 vector for n states.

If a reward or penalty has been assigned to the end-state, that terminates your decision or reward process, it is worth being extra careful in your Markov chain design and respective transition probability matrix. You may want to introduce a zero-value end-state. Though, it will of course depend on the structure of the decision process you are attempting to capture with the Markov Chain.

Transition: Describes how a given state transition from one state s to another s’ (e.g, can be the same state) in a single unit of time increment. The associated (state) transition probability matrix T provides the probabilities of all state-to-state transitions for a Markov chain with a single unit of time. T is square stochastic matrix defined by the number of states making up the Markov chain (i.e., for n states, T is an n×n matrix). We write the state transition, facilitated by T, as:

s(t+1) = s(t) ∙T per unit time step increment (iteration).

Action: an action a is defined as a choice or decision taken at the current unit of time (or iteration) that will trigger a transition from the current state into another state in the subsequent single unit of time. An action may be deterministic or random. The consequence of an action a, choice or decision, is described by the (state) transition matrix. Thus, the choice of an action is the same as a choice of a state transformation. The set of actions for a given Markov Chain is typically known in advance. Actions are typically associated with what is called a Markov Decision Process. Choosing an action a at time t, in a given state s transitioning to state s’, may result in a reward R(s, a, s’).

Policy: A policy represents the set (distribution) of actions a associated with a given set of states s (representing the Markov chain) and the respective (state) transition probability matrix. Think about a customer life-cycle process with two policies, (1) No churn remedies (or actions) and (2) Churn mitigating remedies (or actions). Policies can differ only slightly (i.e., different actions on a few states) or be substantially different. It is customary to denote a policy as π( a | s), which is the math way of saying that our policy is a distribution of actions conditional to given states,

π is a function such that π : S → A, with π( a | s) = P[ A(t) = a | S(t) = s ].

A policy, strategy or plan, specifies the set of rules governing the decision that will be used at every unit time increment.

Reward: Is defined for a given state s and is the expected reward value R over all possible states that one can transition to from a given state. A reward can also be associated with a given action a (and thus may also be different for different policies π). The reward is received in state s subject to action a transitioning into state s’ (which can be the same state as s). Thus, we can write the reward as R(s, a, s’) or in case the reward is independent of the state that is transitioned to, R(s, a).

The concept of reward is important in so called Markov Reward Processes and essential to the Markov Decision Process. It is customary (and good for convergence as well) to introduce a reward discount factor 0 ≤ γ ≤ 1 that discounts future rewards with γ^t. Essentially attributing less value (or reward) to events in the future (making the present more important). A positive reward can be seen as an income and a negative reward as a cost.

Thus, a Markov Chain is defined by (S, T)-tuple, where S are the states and T the (state) transition probability matrix facilitating the state transition. And a Markov Reward Process is thus defined by (S, T, R, γ)-tuple with the addition of R representing the rewards associated with the states and γ the discount factor. Finally, a Markov Decision Process can be defined by (S, A, T, R, γ)-tuple, with A representing the actions associated with the respective states.

The Markov Chain.

The conditional probability of being in a given state S at time t+1 (i.e., S(t+1)) given all the previous states {S(t=0), S(t=1), …, S(t=t)} is equal to the conditional probability of state S(t+1) only considering (conditioned upon) the immediate previous state S(t),

∀ S(t) ∊ Ω is a given state at time t that belong to the environment Ω the Markov chain exist in.

In other words, the state your system is in now S(t) only depends only on the previous state you where in one unit time step ago S(t-1). All other past states have no influence on your present state. Or said in another way, the future only depends on what happens now not what happened prior.

T: S(t) = i → S(t+1) = j, with the transition likelihood of p_ij = P[S(t+1) = j | S(t) = i ] representing the probability of transitioning from state i to state j upon a given action a taken in state i. We will regard the T as a (n × n) transition matrix, describing how states map to each other.

Where the rows represent States and the column where a state may be mapped to. Moreover, as we deal with probabilities, each row needs to add up to 1, e.g.,

Let’s simplify a bit by considering a 4-state Markov chain;

with the following Markov chain 4-state example,

with the following transition probability matrix T,

From the above illustration we have that our states (i,j) ∈ {Conversion (0), Retention (1), Upsell (2), Churned (3)}. Thus, T(1,1) = 0.75 is the probability that an action in the Retention state results in ending up in the same Retention state. Note that the first row (first column) is designated 0, second row (column) 1, etc.. As we sum the 2nd row T(1, 0 → 3) we get 1 (i.e., 0.00 + 0.75 + 0.20 + 0.05 = 1) as we require.

Let us consider the following initial condition at time t = 0 for the above Markov model,

s0 = ( 1 0 0 0 ) we are starting out in the Conversion (initial) state s0.

s1 = s0 ∙T = ( 0 1 0 0 ), at first time step (iteration) we end up in the Retention state.

s2 = s1 ∙T = s0 ∙T∙T = s0 ∙T^2 = ( 0.00 0.75 0.20 0.05 ). So already in 2nd time step (iteration) we have 75% likelihood of again ending up in the Retention state, 20% of ending up in the Upsell state as well as 5% chance that our customer Churn and thus ends the Markov process.

s3 = s2 ∙T = s0 ∙T∙T∙T = s0 ∙T^3 =( 0.00 0.76 0.15 0.09 )

…

s10 = s9 ∙T = s0 ∙T^10 = ( 0.00 0.56 0.12 0.32)

…

s36 = s35 ∙T = s0 ∙T^36 = ( 0.00 0.19 0.04 0.77 )

Eventually, our overall Markov chain will reach steady state and s ∙ T = s. It is common to use π for the Markov chain steady-state. Thus, we will frequently see π ∙ T = π, reflecting that steady state has been reached (usually within some level of pre-defined accuracy). To avoid confusion with policy mapping, which is often also described by π, I prefer to use π∞ to denote that a steady-state state has been reached.

Within a pre-set accuracy requirement of ε < 0.01, we have that s36 ≈ steady-state s-state and thus s36 ∙T ≈ s36.

It should be noted (and easy to verify) that introducing a 5th End-state (i.e., splitting up the churn-and-end-state into two states) in our example, will not change the steady-state outcome except for breaking up the churn’s steady-state value (from the 4-state steady-state analysis) into two values with their sum being equal to the 4-state churn value.

Value Iteration.

We start out with a Markov chain characterized by (S,T)-tuple that describes the backbone of our decision process. We have the option to add actions (e.g., can be a single action as well) and associate reward with the respective states and actions in our Markov chain. Thus, we expand the description of our Markov chain to that of a Markov Decision Process (MDP), that is (S, A, T, R, γ)-tuple (or for a Markov Reward Process (S, T, R, γ)-tuple), with γ being the discount factor (0 ≤ γ ≤ 1). Rohan Jagtap in his “Understanding Markov Decision Process (MDP)” has written a great, intuitive and very assessable account of the mathematical details of MRPs and MDPs. Really a recommended reading.

We have been given a magic coin that always ends up at the opposite face of the previous coin flip, e.g., Head → Tail → Head → Tail → etc.. Thus we are dealing with a 2-state process with period cycling between the two states (i.e., after 2 tosses we are back at the at the previous face). Each state with probability 1 of transitioning to the other. Also, we are given a reward of +2 (R(H))when we are transitioning into the Head-state (S0) and a reward of +1 (R(T)) when we are transitioning into the Tail-state (S1). We have thus 2 initial conditions (a) starting with Head and (b) starting with Tail.

How does the long-run (i.e., steady-state) expected value for each of the two states H & T develop over time?

(a) Assume our magic coin’s first face is Head (H), this earns us a reward of R(H) = +2. At the next unit time step we end up in Tail (T) with probability 1 (= P[T|H)) and reward of R(T) = +1. Next step we are back in Head with probability 1 (=P(H|T)), and so forth. The future value we may choose to discount with γ (and if γ less than 1, it even guaranty that the value converges). For (b) interchange, interchange H with T (and of course rewards accordingly).

It is good to keep in mind that the reward R is banked when in the state, after the transitioning into it from the previous state. The value accrued over time at a given state, is the present reward R(s) as well as the expected (discounted) reward for the subsequent states. It is customary to start out with zero value states at t=0. Though, one could also choose to use the reward vector instead to initialize the value of the states. So, here it goes,

Alright, no, I did not sum all the way infinite (I wouldn’t have finished yet). I “cheated” and used the ‘mdp_valueIteration()’ function;

# Import own Markov chain (MC) & Markov Decision Process (MDP) library
import mcmdp_v2 as mdp


# States
states = {
    0 : 'Head',
    1 : 'Tail'
}


# Transition Matrix - Magic Coin
T = np.array([[0.00, 1.00],
              [1.00, 0.00]])


# Reward Matrix - Magic Coin
R = np.array([[2], 
              [1]])


pi = np.array([1, 0,]) # Initial state, could also be [0, 1].


# Define the markov chain mc for the MDP value iteration.
mc = mdp.Mdp(states = states, pi = pi, T = T, R = R, gamma = 0.9, epsilon = 0.01)


state_values, expected_total_value, policy, ctime = mc.mdp_valueIteration() # Value iteration on mc


print('Long-run state value V[H]   :', np.round(state_values[0],1))
print('Long-run state value V[T]   :', np.round(state_values[1],1)) 

output>> Long-run state value V[H]   : 15.2
output>> Long-run state value V[T]   : 14.7

In general, we have the following value iteration algorithms representing the state-value function V as we iterate over time (i),

With [1] formula describing a general MDP algorithm. Formula [2] is an MDP where the state reward function R is independent of actions and subsequent state s’, and formula [3] describes a Markov Reward Process, where the reward function R is independent of the subsequent state s’. In order to get the value iteration started it is customary to begin with an initial condition (i.e., i = 0) of V_0 = 0 ∀ s ∊ S, e.g., for a 5-state process V_0 = [0, 0, 0, 0, 0] at i = 0, that is the initial value of all states in the Markov chain is set to zero.

The long-run steady-state state values are the out come of iterating the above formulas [1 – 3] until the state values are no longer changing (within a pre-determined level of accuracy). We can write the long-run steady-state values as,

with V∞[Sj] is the j-th state’s steady-state value and n is the number of states in the underlying Markov chain representing the MDP (or MRP for that matter).

The long-run average (overall ) value G in steady-state is

where V∞[S] is the steady-state value vector that the value iteration provided us with. π∞ is the decision process’s underlying Markov chain’s steady-state state.

One of the simpler examples to look at would be a “coin toss” process. In order to make it a bit more interesting, lets consider a unfair-ish coin to toss around. In the first example immediately below, we assume to have only 1 action and that the state rewards only depends on the state itself. Thus, we are in Formula [3] situation above. How we go around the above value-iteration algorithm is illustrated below,

Let us have another look at our customer life-cycle process. We would like to have a better appreciation of the value of each state in the decision-making process. The value iteration approach is provided in the illustration below,

Coding reference.

Kim Kyllesbech Larsen, “MarkovChains-and-MDPs“, The Python code used for all examples in this blog, (December 2021).

AI Strategy & Policy Blog

Tag Archives: Markov Decision Processes

Deep Dive – Markov chain & decision process fundamentals.

Concepts.

The Markov Chain.

Value Iteration.

Coding reference.