A hands-on guide to data-driven business decisions – How to gain insights into an uncertain world.

“Would you like an adventure now or shall we have tea first”, Alice in Wonderland (Lewis Carroll).

Google parent company Alphabet breached a US$2 (~ €1.8) trillion valuation in November this year (2021). What few realize is that this valuation, as well as the success of Google, is based on a Markov Decision Processes and its underlying Markov chain. Making PageRank, the brain child of Sergey Brin and Larry Page, likely the most valuable and most used applications of a Markov chain ever, period. We all make use of a Markov Chain and a Markov Decision Process pretty much every day, many many times a day. The pages of the world wide web, with its 1.5+ billion indexed pages, can be designated states of a humongous huge Markov chain. And the in excess of 150+ billion hyperlinks, between those web pages, can be seen as the equivalent of Markov state transitions, taken us from one State (web page) to another State (another web page). The Markov Decision Process value and policy iteration ensures that the consumers of Google’s search engine gets the best search results in the fastest possible way. Many hyperlink paths may lead to your page of interest, but many (maybe most) of those paths will not get you where you want to be fastest. Optimization algorithms develop on the world-wide web Markov chain will find the best path to the content you want (or the algorithm “thinks” you want). I cannot think of a better motivating example for business people to get acquainted with data-driven decision processes and the underlying Markov chains than that.

In the course of my analysis of corporate decision makers sentiment towards data-driven decision making and processes, it became clear that there is very little comprehensive material (for business people) on structured hands-on approaches to decision processes. How do you go around implementing a data-driven decision process in your company? and this despite, such approaches can be fantastic tools for structuring and optimizing a business decision making processes. Not only that. We can integrate such decision processes algorithms into our digital platforms. Achieving a very high degree of autonomous business decisions (e.g., Google’s PageRank decision process), that in turn will enhancing our customer’s experience. Also, it allows us to monitor the overall quality of customer interactions with our digital environment(s) (e.g., web-based environments, user-interfaces, closed-loop user-experience, digital sales, apps, bots, etc..). Such data-driven integration would result in more efficient business processes internally and towards external customers and partners.

So, my goal, with this blog, is to take you through the essentials, the enablers, to better understand what a decision process may look like and its wide applicability towards business-relevant decision making.

I like hands-on. The focus of this blog will be on providing coding examples for the reader to implement and in parallel play with the examples given. Maybe even better encourage you to create your own examples, relevant to your area of business interest. I will take you through two major important enablers for digitizing data-driven decision processes. One enabler is so-called Markov Chains which is essential in understanding the other, Markov Decision Processes (MDPs). I will attempt to do this intuitively, via coding examples, rather than write down a lot of mathematical notation (which is the normal approach). I will demonstrate some relative simple business examples that hopefully will create an appetite to learn much more about this field.

At the very end of this blog, you can either stop reading or you can continue (an MDP maybe?) and will then find a more formalistic treatment of Markov Chains and MDPs. Although somewhat more “long haired” (coming from a bald guy), I attempted to make it assessable to readers that may not have a math degree.

In my previous blog, “Data-driven decision making … what’s not to like about that?”, I wrote about data-driven decision making for businesses and public institutions. I described, or more accurately pulled a so-called Markov Decision Process out of my magic hat, that we could view as a data-driven decision process,

In the example above, I mapped out how a data-driven decision process might (or should) look like. The process consist of 6 states or stages (i.e., Idea, Data Gathering, Insights, Consultation, Decision, Stop) and actions that takes us from one state to the other (e.g., Consult → Decision), until the final Decision state, where we may decide to continue, develop further, thus back to the Idea, or decide to terminate (i.e., Stop). We can associate our transitions and the associated actions with likelihoods, based on empirical evidence that fancy people may describe as process mining, of a given state transition (e.g.., Insights → Consult vs Insights → Decision, …) to occur.

The described decision process is not static. It is dynamic and is likely to evolve over time. Transitions from one state to another can be seen as moving forward in time increments. You find yourself in one stage of the process and making up your mind, that is you make a decision or take an action, what next state you “want” to move to. With a given likelihood of making such a decision if there are several stages to move to and depending on the level of stochasticity. The rules of how to move from one state to the next is given by the scenario you are considering.

The above illustration may (or may not) look complex. So let’s break down on a higher level what we see above. We have Circles that represent a Stage in the decision process, e.g., Idea, Data gathering Insights, Consult, Decision, Stop. These stages are what we also could call States which is the official name if we speak Markovian. I will use stage and state interchangeably. In between the stages we have white colored arrows that takes us, or Transition in Markovian, from one stage to the next (e.g., from Insights to Consult). A transition from one stage to the next is triggered by an Action (e.g., a decision) in the stage we are in that leads us to the following stage in our process. For each transition and associated action we can assign a Transition Probability that describes the likelihood of a particular transition & action combination to take place. For example, in the above chart, I have a transition probability from Insights stage to the Consult stage of 40%. I can also associate a reward R associated with a given transition. This reward is “banked” as you enter the new stage. For example, I can expect a reward of €R23 (see illustration above), as I transition from Insights to Consult stage. An arrow that goes back to its own stage simply mean that a decision, or action, is taken that results in us remaining in that particular state (e.g., 20% chance of remaining in the Decision stage which here implies that we continue as is). As transitions are associated with probabilities, it follows that sum of transitions probabilities (arrows) leaving a state is required to be equal to 1. It is possible to define several transition probability scenarios per state. Such scenarios is called policies in the language of Markov. The idea is that we can run through several scenarios, or policies, to determine if some are better than others and thus optimize our decision making process (assuming the optimal policy is also feasible). Typically, actions, (state) transitions, and associated rewards will not be are not symmetric. In the sense, that the likelihood (& reward) of going from State 1 to State 2 may not be the same as going from State 2 back to State 1.

What I have describe above is the fundamental setup of a so-called Markov Chains and how such can be extended to action (e.g., decision), rewards (e.g., income & cost) and policy (i.e., scenario) estimation and optimization. Thus, into what we call Markov Decision Processes (MDP) or its “cousins” Markov Reward Processes (MRP).

The question remains, of course, how do we actually do some meaningful analysis on a decision process as illustrated above? How do we code this?

While you will find some examples in the public domain on analysis of Markov Chains, it becomes a bit more technical as you move up the “intellectual food chain” to Markov Decision Processes. I have provided some simple, but generalized, Python codes throughout this blog. These will allow you to run some of this analysis yourself and gain a lot of insights into Markov chains and (Markov) Decision Processes.

A simpler customer life-cycle & retention example.

Let’s take a simple example of customer life-cycle of a typical subscription (e.g., online, magazine, services, …). We start our process with the Conversion of a prospective customer to a real customer, kicking of the customer life-cycle. After the sale, the customer starts the service which below is defined as Retention, we want to keep our customer with that service. During the retention period, or life as a customer, the illustration below assumes 3 events may happen; (1) Our Customer is okay with service and wish to continue as is (i.e., remains in retention stage), (2) Our Customer is interested to add additional features to the existing subscription, thus accepting an Upsell after which the customer falls back into the retention phase, and finally (3) Our Customer may decide to discontinue the subscribed service. Thus, Churn and end the engagement and customer life-cycle process. In fact, once the churn state has been reached, it cannot be left. We call such a state and absorbing state. A Markov chain that includes absorbing states (or at least one) is called an absorbing Markov chain.

In the above customer life-cycle illustration, we have 4 states (or stages); Conversion (S0), Retention (S1), Upsell (S2) and Churn (S3). The transition from one stage to the next (e.g., Retention → Upsell) is driven by a decision (i.e., action) to do so. Each transition is associated with a likelihood of that particular decision is being made, e.g., there is 20% that a customer decides to accept an Upsell when located in the Retention state.

In a Python code we can operationalize the process as follows;

# Customer life-cycle process (simple)
# Define States

states = {
    0 : 'Conversion',
    1 : 'Retention',
    2 : 'Upsell',
    3 : 'Churn'

c = 0.05      # Churn likelihood
u = 0.20      # Upsell likelihood

#Transition probability matrix:
T = np.array([
    [0.00, 1.00, 0.00, 0.00],
    [0.00, 1-u-c, u, c],
    [0.00, 1.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 1.00]])

Each row (and column) in the transition probability matrix represents the stages in the process. Note that each row is required to sum up to 1. Our curiosity of the dynamics of our process, our chain of events, should lead us to ask “In the long run, what are the chance of ending up in each of the defined states of our process?”. Actually, for our customer life-cycle process, we might like understand what is our long-term churn proportion given the current scenario. Maybe even trade-offs between upsell and churn rate.

How do get about that question solved?

Our process have reached steady-state when the likelihood of ending up in a given state does not change any longer with subsequent evolution of time. What this means is that for a given overall process state, represented by π, applying the transition matrix no longer changes that overall process state (i.e., πT = π). Thus, π then represents the expected likelihood of being in a given state once steady-state has been reached.

In our illustration above, we can kick off our process with π0 = [1, 0, 0, 0] which represents that our initial stage is in the Conversion state. Applying the transition matrix T to our initial state will transition it into the Retention stage, i.e., π0T = [0, 1, 0, 0] = π1. The next step then is π1T = [0, 0.79, 0.20, 0.01] = π2 and so forth until πT = π for all subsequent time steps (algorithmically we can describe this iterative process as πT ← π). Following this recipe we can create a small Python code that will do the work for us;

def steady_state(pi, T, epsilon=0.01):

# pi      : Given n states pi is an array of dim (n,) .
# T       : The transition probability matrix of dim (n,n)
# epsilon : Provides the convergence criteria.

    j = 0 #Counter
    while True:
        oldpi = pi
        pi = pi.dot(T)

        # Check Convergence
        if np.max(np.abs(pi - oldpi)) <= epsilon:
        # In case of no Convergence
        if j == 1000:
    return pi # Returning the likelihood of the steady-state states.

Using the above code, the transition matrix given above and our customer life-cycle process initial condition π0 = [1, 0, 0, 0], we find that

# Finding Customer life-cycle Steady-State
pi0 = np.array([1, 0, 0, 0])

pi = steady_state(pi0, T)  # Using above steady-state function


output>> steady-state pi = [0, 0.19, 0.04, 0.77]  

So, within the existing customer life-cycle scenario a Churn rate of 5% will in the long run lead to an overall churn likelihood of almost 80%. This is with a Retention transition probability of of 75% and an Upsell transition probability of 20%. It may be a surprising outcome, that a relative low probability action (or event) can lead to such a dominant business impact. Intuitively, we should remember that churn is a terminal event. Retention and even Upsell simply continue to operate within the life-cycle process with the ever present “doom” of a customer leaving by churning. More detailed analysis of this process will show that as long as we keep the Churn transition probability below 1.3% the Churn State’s steady-state likelihood to transition probability is below 3.3 times that of the transition probability. Above 1.3% transition probability, the steady-state Churn likelihood rapidly increase to10 – 20 times that of the base transition probability. Needless to say, the current policy, as represented by the transition matrix, would require optimization in order to minimize the churn impact and ideally we would need to aim for measures keeping churn below 1.3%.

The purpose of this blog, as stated in the beginning, is not so much studying the dynamics of a hypothetical customer life-cycle management processes or any other for that matter. The purpose of this blog is to provide you with an understanding of decision process modelling, the analysis possible, and tools to go and do that yourself.

A simple customer life-cycle example with rewards.

How do we get from the process dynamics to an assessment of value of our given strategy or policy? There are several dimensions to this question. Firstly, we would of course like to enhance the value of the overall decision process. Next, we also would like to ensure that each stage of our process has been optimized as well, in terms of value, with value being monetary, time-wise, number of stages, order of states, etc..

I will provide you with a reasonably simple approach as well as the code to work out your own examples. It is in general straightforward to add complexity to your process and still be able to use the provided code.

In the above illustration two different scenarios have been provided. Both scenarios are represented by the same transition probability matrix. This is not a requirement but makes the example simpler. Policy 1 mimics an “annual subscription model” with a subscription value of 100 per annum, an upsell value of 20 per annum and a (1-time) loss of value by churn of -100. Policy 2 represents a “monthly subscription model” with a subscription value of 10 per month, an upsell value of 2 per month and a (1-time) loss of value by churn of -60. I would like to know what the overall expected value is for each of the two scenarios.

However, Houston we (may) have a problem here. Remember that once we end up in the “Churn” state we are stuck in that state (i.e., we don’t “un-churn” … at least not in this process). From a transition probability matrix perspective once we are in the churn state every new iteration transition back to the same state. In fact, as already discussed previously, the churn state acts as an absorbing state, making our customer life-cycle Markov chain an absorbing Markov Chain. In this particular case, we would like to assign a one-time penalty (i.e., negative reward) to the churn state and then end the process. Of course this may not always be the case. But, here it is the case. From a (state) transition matrix perspective we have created our own equivalent of “Groundhog day” as we keep returning to same state and thus might end up multiplying the churn penalty times the number of iterations it takes us to get to convergence of the overall system. Unless we are careful. We have two solutions (1) Choose an appropriate small churn penalty that with a given discount factor (lets call it γ) ensures that the penalty quickly converges to a reasonable figure (yeah … not the nicest solution, but it could work) and (2) Introduce an End-state with zero reward/penalty where the churn-state has a probability of 1 to end up in. Thus, pushing the absorbing state away from the churn-state which will be convenient for the valuation estimation process. This, will ensure that as we end up in the churn-state, its penalty is only counted once. This will be an important consideration as we commence on value iteration for a Markov Decision Process. So, our above illustration needs a bit of revision and get’s to look like this,

and our Python implementation of the states, transition matrix and reward vector will look like this,

# Customer life-cycle process (simple)
# Define States

states = {
    0 : 'Conversion',
    1 : 'Retention',
    2 : 'Upsell',
    3 : 'Churn',
    4 : 'End'

c = 0.05      # Churn likelihood
u = 0.20      # Upsell likelihood

#Transition probability matrix:
T = np.array([
    [0.00, 1.00, 0.00, 0.00, 0.00],
    [0.00, 1-u-c, u,    c,   0.00],
    [0.00, 1.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 1.00],    # Churn transition to End State
    [0.00, 0.00, 0.00, 0.00, 1.00]])   # End state

If you run the 5-state transition matrix through the above Python “steady_state()” function, you would get π = (0 0.18 0.04 0.01 0.77) where the last two-states (i.e., the Churn- & End-state) are the part of the same state, thus mapping it back to the 4-state model we have π = (0 0.18 0.04 0.78) which is very close to the above “true” 4-state model (i.e., with the churn-state also being the end-state) π = (0 0.19 0.04 0.01 0.77). From π we see that we have a steady-state likelihood of 77% ending up in the absorbing state in our customer life-cycle process.

For value estimation purposes, we would like to ignore the absorbing state and renormalize the steady-state vector π. Thus, π’ = (0 0.19 0.04 0.01 0)/sum((0 0.19 0.04 0.01 0)) which results in a renormalized steady-state vector π’ = (0, 0.79, 0.17, 0.04, 0) that we will use to asses the long-run average value of our customer life-cycle process.

So let’s do a bit of random walk across our decision process as depicted above. I will make use of repeated random sampling (also called a Monte Carlo simulation among friends) on my decision process. The transition matrix will bring me from one stage to another with a rate provided by the respective transition probability matric (T). The reward vector (R), see above illustration, provides the reward expectation for each state (e.g., R = [0, 100, 20, -100, 0] for Policy 1). As the repeated random sampling progresses and the various decision process states are visited, we add up the respective state’s rewards until we end up in the Churn state that will end the simulation. The simulation is then repeated and the average of all the simulated process values provides our expected overall value for the decision process. We have the option to discount future values with gamma (γ < 1);

with n* = min { N, n @ EndState} with N being the maximum allowed simulation time steps iterations and “n @ EndState” is the time step where our simulation end up in the EndState terminating the simulation. This principle is similar to financial net present value calculations that value the present more than the future at for example a weighted average cost of capital (WACC) or discount rate. Also it will turn out mathematically convenient to include a discount rate as it ensures that our value iteration converges (which is pretty convenient).

The Python code for the described simulation is provided here;

def mc_random_walk_reward(states, state0, T, R, gamma=0.9, tot_steps, EndState)

# states     : Array which includes the names of the n states being studied. 
# T          : The transition probability matrix of dim (n,n)
# R          : The reward array of dim (n,) e.g., can have several columns pending 
#              policies.
# gamma      : Value Discount Factor, < 1.0
# tot_ steps : Maximum number of time step iterations unless EndState is 
#              encounted.
# EndState   : The end-state that would stop the simulation unless steps is 
#              encounted.
    n = steps
    stp = 0
    start_state = state0
    path = [states[start_state]]
    value_state = R[state0]*(gamma**0)
    prev_state = start_state
    i = 1 
    while n:
        curr_state = np.random.choice(list(states.keys()), p = T[prev_state])
        value_state += R[curr_state]*(gamma**i)
        prev_state = curr_state
        if states[prev_state] == EndState: 
            stp = n + 1

    return (path, value_state, stp)

The Transition matrix remain, thus we need to define the Reward vector;

# Reward vector for Policy 1 - Annual Subsctiption Model.

R1 = np.array([

# Reward vector for Policy 2 - Monthly Subsctiption Model.

R2 = np.array([

Thus, we are ready to call run the decision process valuation;

j = 0
stop = 10000 # Total amount of Monte Carlo simulations

df_value = []
df_time = []

# For R use R1 for Policy 1 and R2 for Policy 2.

while True:
    a,b,c = mc_random_walk_reward(states, 0, T, R, 0.90, 100, 'Churn')
    if j == stop:

unit_time = 1 # 1 month in 1 time unit iteration.

print('Total Value of Policy 2: ', np.round(np.mean(df_value)/unit_time,0))
print('Mean time to Churn of Policy 2: ', np.round(np.mean(df_time),0))

output>> Expected Total Value of Policy 2:  42.0
output>> Average time to Churn of Policy 2:  77.0

The expected total value of Policy 2 comes out at ca. 42. Applying the above code to Policy 1, get’s us an expected total value of 46 and thus a bit more attractive than Policy 2. This provides a fairly easy way to get an idea about a given decision process value. As we enter the maximum amount of time step iteration steps (e.g., 100 in the code snippet) it is good to check that the average time to churn over the total amount of Monte Carlo simulations is less than this number (e.g., 77 < 100 in the above code snippet). It is wise to play a bit with the maximum steps number to check whether your expected total value of your policy changes significantly.

It should be noted that as the Monte Carlo simulation of the Markov Chain terminates upon Churn, the churn penalty is only accounted for once as is appropriate in this particular example. In other words, for this part we would not strictly speaking require the fifth state to end the process.

If we compare the expected value of our process of 42.0 (Policy 2) with the Value Iteration algorithmic approach (more on that below), used on Markov Decision Processes, we find steady-state state values of V[Policy 2] = [ 42.2, 46.9, 44.2, -60, 0] = [Conversion, Retention, Upsell, Churn, End]. With the End-state representing our absorbing state. Using our renormalized steady-state state π’ = (0.00, 0.79, 0.17, 0.04, 0.00), we find that our long-run average value is

G = V ∙ π’ = 42.2 ∙ 0.00 + 46.9 ∙ 0.79 + 44.2 ∙ 0.17 + (-60) ∙ 0.04 + 0 ∙ 0.00 = 42.2

Which is a pretty good agreement with the Monte Carlo simulations of our Customer Lifecycle Process. And even better … a much faster way of getting the long-run value. However, if you are dealing with an (or several) absorbing states, some caution in how to compensate for those should be considered. I like in general to run a Monte Carlo process simulation just to ensure that my value iteration extraction logic is correct.

Decision process optimization … Dual policy and value iteration.

We are often in situations where we need to evaluate different assumptions or policies in our decision process. Imagine you are looking at a subscription process, as shown below, where you have broken down the customer life-cycle in an 3 parts (i.e., states or stages); (1) Start of subscription, (2) Average life-time and (3) long-time subscriptions. You are contemplating on two different policies; Policy 0 (white arrows): No churn intervention and Policy 1 (orange arrows): Churn intervention measures at each stage in the subscription process. After a churn intervention, at a given state, your system will treat you as a new customer (note: this might not be the smartest thing to do, but it is the easiest to illustrate).

The decision process for such a customer life-cycle process is illustrated below. I have deliberately not added an end-state after churn in this example (i.e., so strictly speaking once we end up in the Churn state we are in our “Groundhog state” of “perpetual” churn … which is btw why we like to call it an absorbing state). It adds complexity to the Markov chain and the purpose of this example is to show the policy and value optimization in a situation where we have 2 policies to consider in our decision process.

You would like to know, what the value is at each state, given observed churn. Moreover, you also require to know when (time-wise) and where (state-wise) it might become important to shift from one policy to the other. So let’s (Python) code the above customer loyalty decision process;

# Customer life-cycle process - dual value and policy iteration
# Define States

# Subcription state

states = {
    0 : 'Start',
    1 : 'Average life-time ',
    2 : 'Long-time',
    3 : 'Churn'

#Transition probability matrix:

c = 0.05 # Churn rate; try to increase this with steps of 0.05
         # and you will see that value and policies begin to change as
         # the churn rate increases
    # Policy 0 - No churn intervention measures.
    policy_1 = np.array([
        [0.00, 1-c, 0.00, c],
        [0.00, 0.00, 1-c, c],
        [0.00, 0.00, 1-c, c],
        [0.00, 0.00, 0.00, 1.00]
    # Policy 1 - Churn Intervention measures.
    policy_2 = np.array([
        [1.00, 0.00, 0.00, 0.00],
        [1.00, 0.00, 0.00, 0.00],
        [1.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 1.00]
    T =  np.array([policy_1, policy_2]) #Transition probability matrix
    # Reward array for the two policies
    # 1st column reflects rewards in Policy 0
    # 2nd column reflects rewards in Policy 1

    R = np.array([
        [0, 0],
        [0, 1],
        [4, 2],
        [-1, -1]])

In order for us to get the optimal decision process value and respective policy we are making use of an iterative computing algorithm called Value Iteration. The value iteration procedure provides the values of the various states of our decision process with known transition probabilities and their corresponding rewards. The same algorithm also provides for the optimal policy matching the optimum values (i.e., Policy Iteration). I have implemented the Value and Policy Iteration algorithm in the code below.

def mdp_valueIteration(states,T,R, gamma = 0.90, epsilon = 0.01):

# states  : Array which includes the names of the n states being studied. 
# T       : The transition probability matrix of dim (n,n)
# R       : The reward array of dim (n,) e.g., can have several columns pending
#           the number of policies.
# gamma   : Value discount factor, < 1.0
# epsilon : Provides the convergence criteria.

    # Initialize V_0 to zero
    values = np.zeros(len(states))

    ctime = 0
    # Value iteration
    # Continue until convergence.

    while True:

        # To be used for convergence check
        oldValues = np.copy(values)

        values = np.transpose(R) + gamma*np.dot(T,values)   # Value iteration step
        policy = values.argmax(0)   # Take the best policy.
        values = values.max(0)      # Take the highest value
        ctime +=1
        # Check Convergence
        if np.max(np.abs(values - oldValues)) <= epsilon:
    return(values, policy, ctime)        

All we have to do is to call the above “mdp_valueIteration” function with the transition probability matrix and respective reward vectors for Policy 0 and Policy 1.

# Call ValueIteration function and get optimum value per state and
# Optimum Policy strategy per state.

values, policy, time = mdp_valueIteration(states,T,R,gamma = 0.90)

print('Optimum State Values: ', np.round(values,0))
print('Optimum Policy      : ', policy)
print('Optimum Time steps  : ', time)

# Output for Churn rate c = 0.05

output>> Optimum State Values: [17  21  25  -10]
output>> Optimum Policy      : [ 0   0   0    0]

# Output for Churn rate c = 0.3

output>> Optimum State Values: [ 0   1   4  -10]
output>> Optimum Policy      : [ 1   1   0    0]

We have 4 states defined in our subscription decision process, “Start of subscription” (S0), “Average life-time subscription” (S1), “Long-time subscription” (S2) and “Churn”(S3). Not surprisingly we find that the highest value is delivered by the subscribers in the long-time subscription category. For relative low churn rates, the policy without churn intervention measures (i.e., Policy 0) is the most appropriate. As the churn rate increases, we observe that the best policy (in terms of value optimization between the two policies) for State 0 and for State 1, is to have a churn intervention policy (i.e., Policy 1) in place.

In summary, with a more detailed analysis of our dual-policy customer life-cycle decision process we find the following process dynamics as a function of the churn rate.

For comparison, I have also tabulated the value outcome in the case that we do not consider a churn mitigation policy.

Using the provided code and examples, it is straightforward to consider more complex decision process applications. The codes I have provide are very general and your work will be in defining your process underlying Markov chain with its transition probabilities and the respective decision process rewards (positives as well as negatives). Once that is setup, it referencing the provided code functions.

The hidden Markov model … An intro.

We are often in the situation that we get customer feedback (observable) without being complete sure what underlying (“hidden”) processes that may have influenced the customer to provide a particular feedback. A customer may interact in a particular way with our web store (again observable). That customer interaction is likely also to be influenced by hidden processes and behaviors of the underlying system dynamics (e.g., user interface architecture, front-end interactions, back-end interactions, etc…).

Let’s add some observable “happiness” and “unhappiness” to our customer retention and life-cycle process from the beginning of this blog.

We have already defined the transition probability matrix above. We also need to to write the observable sentiments associated with states into a matrix form as well (this is called the emission matrix in Markovian).

To visualize this a bit better we observe;

The following code function provides the likelihood of observing a sequence of observable states (e.g., Happy, Happy, Not Happy, Happy, Not Happy, Not Happy, …) with a given hidden Markov chain describing the underlying process that may influence the observable states.

def hmm_find_prob(seq, T ,E, pi):
# seq  : The observable sequence.
# T    : Transition Probability matrix (hidden states).
# E    : Emission matrix (observable states).
# pi   : The steady-state of the hidden markov chain.

    start_state = seq[0]
    alpha = pi*E[:,start_state]

    for i in range(1, len(seq)):
        curr_state = seq[i]
        alpha = (alpha.dot(T))*E[:,curr_state]
        prob = sum(alpha)

    return prob

In order to estimate the likelihood of observable sentiment states (i.e., Happy and Not Happy) of our hidden Markov process (illustrated above) we need to code the matrices, identify the sentiment sequence we would like to access the likelihood of.

# Setting up the pre-requisites for a Hidden Markov Model

hidden_states = {
    0 : 'Conversion',
    1 : 'Retention',
    2 : 'Upsell',
    3 : 'Churn'

observable_states = {
    0: 'Happy',
    1: 'Sad'

# Note that even if the hidden and observable states are not directly used
# in this example, it is always good dicipline to write them out as they
# tie into both the transition and the emission matrix.

# Transition Probability Matrix (Hidden States)
T = np.array(
    [[0.00, 1.00, 0.00, 0.00],
     [0.00, 0.75, 0.20, 0.05],         # Note we assume 5% churn here!
     [0.00, 1.00, 0.00, 0.00],
     [0.00, 0.00, 0.00, 1.00]])

# Emission Matrix (Observable States)
E = np.array(
    [[1.0, 0.0],
     [0.7, 0.3],
     [0.8, 0.2],
     [0.0, 1.0]])

pi0 = np.array([1, 0, 0, 0])
pi = steady_state(pi0, T)          # Steady State of the underlying Markov chain.

seq = [0,0,0,0,0]       # Happy, Happy, Happy, Happy, Happy

print('Likelihood of having 5 consecutive happy experiences: ', np.round(100*hmm_find_prob(seq, T, E, pi),0), '%')

seq = [1,1,1,1,1]       # Not Happy, Not Happy, Not Happy, Not Happy, Not Happy

print('Likelihood of having 5 consecutive negative experiences: ', np.round(100*hmm_find_prob(seq, T, E, pi),0), '%')

output>> Likelihood of having 5 consecutive happy experiences:  4.0 %
output>> Likelihood of having 5 consecutive negative experiences:  78.0 %

We have to draw the conclusion that there is a much higher chance of having 5 consecutive negative experiences (78%) than 5 positive experiences (4%). This should not be too surprising as our steady state (long-term behavior) of the customer life-cycle process with 5% churn rate is 77%. As we assess that a customer in the churn state is 100% likely to be unhappy, it is clear that unhappiness would be the overwhelming expectations for the defined process. It should be noted that as an alternative, we could also run the decision process replacing monetary value with sentiment value and optimize our decision process for customer sentiment.

If I could lower my Churn rate, in my decision process, down to 1% (from 5%), increasing my Retention rate to 79% and keeping the Upsell rate at 20%, I would have 18% chance of 5 consecutive happy experiences and only 4% of 5 unhappy experiences.

It is not that difficult to see how these principles can be applied to many different settings in the digital domain interfacing with customers.

Wrapping it up.

In this blog, I have provided the reader with some insights into how to apply Markov Chains and Markov Decision Processes to important business applications. The Python code snippets you have met throughout this blog can directly be used in much more complex decision processes or deeper dives on Markov chains in general.

I have found it difficult to find reasonable comprehensive examples of how we can move from Markov chains (or models) to Markov Decision Processes as it applies to a data-driven business environment. You can find some good example on applied Markov Chains (see “For further study” below. I really recommend Normalized Nerd YouTube videos for an introduction). There is however very little material bridging the gap from Markov models to decision processes. While the math behind both Markov Chains and Markov Decision Processes are not very difficult, it is frequently presented in a way that makes it incomprehensible unless you have an advanced degree in mathematics. When you write it out for simpler Markov chains or decision processes you will see that it is not that difficult (I am still pondering on giving a few examples in this blog).

I recommend that you always run a couple of Monte Carlo simulations on your Markov chain and decision process. Comparing the outcome with the algorithmic approaches you might apply, such as steady-state state derivation, value and policy optimization, etc.. If your process contains absorbing states, be extra careful how that should be reflected in your overall process valuation and optimization. For example, if you consider churn in a customer lifecycle process, this would be an absorbing state and unless you are careful in your design on the underlying Markov chain, it may overshadow any other dynamics on your process. A Monte Carlo simulation might reveal such issues. Also, start simple, even if your target decision process may be very complex. This will allow you to understand and trust changes as you add complexity to your process.

I hope that my code snippets likewise will make this field more approachable. Anyway, if you want to deep dive into the math as well, you will find some good starting points in my literature list below.

“We should not fault an agent for not knowing something that matters, but only for having known something and then forgotten.”, The unremembered.


I greatly acknowledge my wife Eva Varadi for her support, patience and understanding during the creative process of writing this Blog. Also many of my Deutsche Telekom AG, T-Mobile NL & Industry colleagues in general have in countless of ways contributed to my thinking and ideas leading to this little Blog. Thank you!

For further study.

Romain Hollander, “On the policy iteration algorithm for PageRank Optimization”, MIT Report (June 2010).

Kim Kyllesbech Larsen, “Data-driven decision making … what’s not to like about that?”, LinkedIn Article (November 2021).

On Markov Chains in particular I recommend Normalized Nerd‘s lectures (super well done and easy to grasp, respect!). I recommend to have a Python notebook on the side and build up the lectures there. In any case if this is new to you start here; “Markov Chains Clearly Explained! Part – 1” (There are 7 parts in total).

Somnath Banerjee, “Real World Applications of Markov Decision Process”, towardsdatascience.com, (January 2021). Source of examples that can be worked out with the Python codes provided in this blog.

Ridhima Kumar, “Marketing Analytics through Markov Chain”, towardsdatascience.com, (January 2019). Source of examples that can be worked out with the Python codes provided in this blog.

Richard Bellman, “The theory of dynamic programming”, The RAND Corporation, P-550 (July 1954). A classic with the strength of providing a lot of intuition around value and policy iteration.

Dorsa Sadigh, Assistant Professor at Stanford University’s Computer Science Department, “Markov Decision Processes – Value Iteration | Stanford CS221”, (Autumn 2019).

Neil Walton, Reader at University of Manchester’s Department of Mathematics, “Algorithms for MDPs” (2018). Very good lectures on Markov Decision Processes and the algorithms used. They are somewhat mathematical but the examples and explanations given are really good (imo).

Rohan Jagtap, “Understanding Markov Decision Process (MDP)”, towardsdatascience.com, (September, 2020). Providing a intuitive as well as mathematical treatment of value iteration. For the mathematically inclined this is a very good treatment of the topic.

A. Aylin Tokuc, “Value iteration vs Policy iteration in Reinforcement learning”, (October 2021). Provides a nice and comprehensible overview of value and policy iteration.

Paul A. Gagniuc, “Markov Chains – From theory to implementation and experimentation“, Wiley, (2017). Lots of great examples of applied Markov Chains.

Richard J. Boucherie & Nico M. van Dijk, “Markov Decision Processes in Practice“, Springer (2017). Great book with many examples of MDP implementations.

Ankur Ankan & Abinash Pranda, “Hands -on Markov Models with Python“, Packt (2018). Provides many good ideas and inspiration of how to code Markov chains in Python.

Brian Hayes, “First Links in the Markov Chain”, American Scientist (March-April 2014). Provides a very easy to read and interesting account of Markov Chains.

Kim Kyllesbech Larsen, “MarkovChains-and-MDPs“, The Python code used for all examples in this blog, (December 2021). The link does require you to have a Github account.

Deep Dive – Markov chain & decision process fundamentals.

Andrei Andreevich Markov (1856-1922) developed his idea of states chained (or connected) by probabilities after his retirement at the old age of 50 (i.e., never too late to get brilliant ideas). This was at the turn of the 20th century. One of the most famous Markov chains, that we all make use of pretty much every day, is the pages of the world wide web with 1.5+ billion indexed pages as designated states and maybe more than 150+ billion links between those web pages which are equivalent to the Markov chain transitions taken us from one State (web page) to another State (another web page). Googles PageRank algorithm, for example, is build upon the fundamentals of Markov chains. The usefulness of Markov chains spans many many fields, e.g., physics, chemistry biology, information science/theory, game theory, decision theory, language theory, speech processing, communications networks, etc…

There are a few concepts that are important to keep in mind for Markov Chains and Markov Decision processes.


Environment: is the relevant space that the Markov chain operates in. E.g., could be the physical surroundings of a logistic storage facility where a robot is moving around.

State: A state is a set of variables describing a system that does not include anything about its history (the physics definition). E.g., in classical mechanics the state of a point mass is given by its position and its velocity vector (i.e., where it is and where it goes). It is good to keep in mind that the computer science meaning of state is different, in the sense that a stateful agent is designed to remember preceding events (i.e., it “remembers” its history). This is however not how a state for a Markov chain should be understood. A sequence (or chain) of random variables {S0, S1, S2, … , Sn}, describing a stochastic process, is said to have a Markov property if

that is, a future state of the stochastic process depends only the state immediately prior and on no other past states. To make the concept of state a bit more tangible, think of a simple customer life-cycle process with (only) 4 states considered; (S0) Conversion, (S1) Retention, (S2) Upsell and (S3) Churn. Thus, in Python we would define the states as a dictionary,

# Example: Customer life-cycle process (simple)
# Defining States

states = {
    0 : 'Conversion',
    1 : 'Retention',
    2 : 'Upsell',
    3 : 'Churn'

In our example, the states is a vector of dimension 4×1, either represented by S = (0 1 2 3) or alternatively S = (Conversion, Retention, Upsell, Churn). More generally, is n×1 vector for n states.

If a reward or penalty has been assigned to the end-state, that terminates your decision or reward process, it is worth being extra careful in your Markov chain design and respective transition probability matrix. You may want to introduce a zero-value end-state. Though, it will of course depend on the structure of the decision process you are attempting to capture with the Markov Chain.

Transition: Describes how a given state transition from one state s to another s’ (e.g, can be the same state) in a single unit of time increment. The associated (state) transition probability matrix T provides the probabilities of all state-to-state transitions for a Markov chain with a single unit of time. is square stochastic matrix defined by the number of states making up the Markov chain (i.e., for n states, is an n×n matrix). We write the state transition, facilitated by T, as:

s(t+1) = s(t) ∙per unit time step increment (iteration).

Action: an action a is defined as a choice or decision taken at the current unit of time (or iteration) that will trigger a transition from the current state into another state in the subsequent single unit of time. An action may be deterministic or random. The consequence of an action a, choice or decision, is described by the (state) transition matrix. Thus, the choice of an action is the same as a choice of a state transformation. The set of actions for a given Markov Chain is typically known in advance. Actions are typically associated with what is called a Markov Decision Process. Choosing an action at time t, in a given state transitioning to state s’, may result in a reward R(s, a, s’).

Policy: A policy represents the set (distribution) of actions associated with a given set of states (representing the Markov chain) and the respective (state) transition probability matrix. Think about a customer life-cycle process with two policies, (1) No churn remedies (or actions) and (2) Churn mitigating remedies (or actions). Policies can differ only slightly (i.e., different actions on a few states) or be substantially different. It is customary to denote a policy as πa | s), which is the math way of saying that our policy is a distribution of actions conditional to given states,

π is a function such that π : → A, with π( a | s) = PA(t) = a | S(t) = s ].

A policy, strategy or plan, specifies the set of rules governing the decision that will be used at every unit time increment.

Reward: Is defined for a given state s and is the expected reward value over all possible states that one can transition to from a given state. A reward can also be associated with a given action a (and thus may also be different for different policies π). The reward is received in state s subject to action transitioning into state s’ (which can be the same state as s). Thus, we can write the reward as R(sas’) or in case the reward is independent of the state that is transitioned to, R(sa).

The concept of reward is important in so called Markov Reward Processes and essential to the Markov Decision Process. It is customary (and good for convergence as well) to introduce a reward discount factor 0 ≤ γ ≤ 1 that discounts future rewards with γ^t. Essentially attributing less value (or reward) to events in the future (making the present more important). A positive reward can be seen as an income and a negative reward as a cost.

Thus, a Markov Chain is defined by (ST)-tuple, where S are the states and T the (state) transition probability matrix facilitating the state transition. And a Markov Reward Process is thus defined by (STR, γ)-tuple with the addition of R representing the rewards associated with the states and γ the discount factor. Finally, a Markov Decision Process can be defined by (S, ATR, γ)-tuple, with A representing the actions associated with the respective states.

The Markov Chain.

The conditional probability of being in a given state S at time t+1 (i.e., S(t+1)) given all the previous states {S(t=0), S(t=1), …, S(t=t)} is equal to the conditional probability of state S(t+1) only considering (conditioned upon) the immediate previous state S(t),

∀ S(t) ∊ Ω is a given state at time t that belong to the environment Ω the Markov chain exist in.

In other words, the state your system is in now S(t) only depends only on the previous state you where in one unit time step ago S(t-1). All other past states have no influence on your present state. Or said in another way, the future only depends on what happens now not what happened prior.

TS(t) = i → S(t+1) = j, with the transition likelihood of p_ij = P[S(t+1) = j | S(t) = i ] representing the probability of transitioning from state i to state j upon a given action a taken in state i. We will regard the T as a (n × n) transition matrix, describing how states map to each other.

Where the rows represent States and the column where a state may be mapped to. Moreover, as we deal with probabilities, each row needs to add up to 1, e.g.,

Let’s simplify a bit by considering a 4-state Markov chain;

with the following Markov chain 4-state example,

with the following transition probability matrix T,

From the above illustration we have that our states (i,j) ∈ {Conversion (0), Retention (1), Upsell (2), Churned (3)}. Thus, T(1,1) = 0.75 is the probability that an action in the Retention state results in ending up in the same Retention state. Note that the first row (first column) is designated 0, second row (column) 1, etc.. As we sum the 2nd row T(1, 0 → 3) we get 1 (i.e., 0.00 + 0.75 + 0.20 + 0.05 = 1) as we require.

Let us consider the following initial condition at time t = 0 for the above Markov model,

s0 = ( 1 0 0 0 ) we are starting out in the Conversion (initial) state s0.

s1 = s0 T = ( 0 1 0 0 ), at first time step (iteration) we end up in the Retention state.

s2 = s1 ∙T = s0 ∙T∙T = s0 ∙T^2 = ( 0.00 0.75 0.20 0.05 ). So already in 2nd time step (iteration) we have 75% likelihood of again ending up in the Retention state, 20% of ending up in the Upsell state as well as 5% chance that our customer Churn and thus ends the Markov process.

s3 = s2 ∙T = s0 ∙T∙T∙T s0 ∙T^3 =( 0.00 0.76 0.15 0.09 )

s10 = s9 ∙T = s0 ∙T^10 = ( 0.00 0.56 0.12 0.32)

s36 = s35 ∙T = s0 ∙T^36 = ( 0.00 0.19 0.04 0.77 )

Eventually, our overall Markov chain will reach steady state and ∙ T = s. It is common to use π for the Markov chain steady-state. Thus, we will frequently see π ∙ T = π, reflecting that steady state has been reached (usually within some level of pre-defined accuracy). To avoid confusion with policy mapping, which is often also described by π, I prefer to use π∞ to denote that a steady-state state has been reached.

Within a pre-set accuracy requirement of ε < 0.01, we have that s36 ≈ steady-state s-state and thus s36 ≈ s36.

It should be noted (and easy to verify) that introducing a 5th End-state (i.e., splitting up the churn-and-end-state into two states) in our example, will not change the steady-state outcome except for breaking up the churn’s steady-state value (from the 4-state steady-state analysis) into two values with their sum being equal to the 4-state churn value.

Value Iteration.

We start out with a Markov chain characterized by (S,T)-tuple that describes the backbone of our decision process. We have the option to add actions (e.g., can be a single action as well) and associate reward with the respective states and actions in our Markov chain. Thus, we expand the description of our Markov chain to that of a Markov Decision Process (MDP), that is (SATR, γ)-tuple (or for a Markov Reward Process (STR, γ)-tuple), with γ being the discount factor (0 ≤ γ ≤ 1). Rohan Jagtap in his “Understanding Markov Decision Process (MDP)” has written a great, intuitive and very assessable account of the mathematical details of MRPs and MDPs. Really a recommended reading.

We have been given a magic coin that always ends up at the opposite face of the previous coin flip, e.g., Head → Tail → Head → Tail → etc.. Thus we are dealing with a 2-state process with period cycling between the two states (i.e., after 2 tosses we are back at the at the previous face). Each state with probability 1 of transitioning to the other. Also, we are given a reward of +2 (R(H))when we are transitioning into the Head-state (S0) and a reward of +1 (R(T)) when we are transitioning into the Tail-state (S1). We have thus 2 initial conditions (a) starting with Head and (b) starting with Tail.

How does the long-run (i.e., steady-state) expected value for each of the two states H & T develop over time?

(a) Assume our magic coin’s first face is Head (H), this earns us a reward of R(H) = +2. At the next unit time step we end up in Tail (T) with probability 1 (= P[T|H)) and reward of R(T) = +1. Next step we are back in Head with probability 1 (=P(H|T)), and so forth. The future value we may choose to discount with γ (and if γ less than 1, it even guaranty that the value converges). For (b) interchange, interchange H with T (and of course rewards accordingly).

It is good to keep in mind that the reward is banked when in the state, after the transitioning into it from the previous state. The value accrued over time at a given state, is the present reward R(s) as well as the expected (discounted) reward for the subsequent states. It is customary to start out with zero value states at t=0. Though, one could also choose to use the reward vector instead to initialize the value of the states. So, here it goes,

Alright, no, I did not sum all the way infinite (I wouldn’t have finished yet). I “cheated” and used the ‘mdp_valueIteration()’ function;

# Import own Markov chain (MC) & Markov Decision Process (MDP) library
import mcmdp_v2 as mdp

# States
states = {
    0 : 'Head',
    1 : 'Tail'

# Transition Matrix - Magic Coin
T = np.array([[0.00, 1.00],
              [1.00, 0.00]])

# Reward Matrix - Magic Coin
R = np.array([[2], 

pi = np.array([1, 0,]) # Initial state, could also be [0, 1].

# Define the markov chain mc for the MDP value iteration.
mc = mdp.Mdp(states = states, pi = pi, T = T, R = R, gamma = 0.9, epsilon = 0.01)

state_values, expected_total_value, policy, ctime = mc.mdp_valueIteration() # Value iteration on mc

print('Long-run state value V[H]   :', np.round(state_values[0],1))
print('Long-run state value V[T]   :', np.round(state_values[1],1)) 

output>> Long-run state value V[H]   : 15.2
output>> Long-run state value V[T]   : 14.7

In general, we have the following value iteration algorithms representing the state-value function as we iterate over time (i),

With [1] formula describing a general MDP algorithm. Formula [2] is an MDP where the state reward function is independent of actions and subsequent state s’, and formula [3] describes a Markov Reward Process, where the reward function R is independent of the subsequent state s’. In order to get the value iteration started it is customary to begin with an initial condition (i.e., i = 0) of V_0 = 0 ∀ s ∊ S, e.g., for a 5-state process V_0 = [0, 0, 0, 0, 0] at i = 0, that is the initial value of all states in the Markov chain is set to zero.

The long-run steady-state state values are the out come of iterating the above formulas [1 – 3] until the state values are no longer changing (within a pre-determined level of accuracy). We can write the long-run steady-state values as,

with V[Sj] is the j-th state’s steady-state value and n is the number of states in the underlying Markov chain representing the MDP (or MRP for that matter).

The long-run average (overall ) value G in steady-state is

where V∞[S] is the steady-state value vector that the value iteration provided us with. π∞ is the decision process’s underlying Markov chain’s steady-state state.

One of the simpler examples to look at would be a “coin toss” process. In order to make it a bit more interesting, lets consider a unfair-ish coin to toss around. In the first example immediately below, we assume to have only 1 action and that the state rewards only depends on the state itself. Thus, we are in Formula [3] situation above. How we go around the above value-iteration algorithm is illustrated below,

Let us have another look at our customer life-cycle process. We would like to have a better appreciation of the value of each state in the decision-making process. The value iteration approach is provided in the illustration below,

Coding reference.

Kim Kyllesbech Larsen, “MarkovChains-and-MDPs“, The Python code used for all examples in this blog, (December 2021).