Markov Chains

Markov chains are reasonably common on the internet these days: they're often used as way of generating random data that looks similar—-at least, in a sense—-to some existing data set. A lot of the great examples of Markov chains come from combining two very different data sets and building a single Markov model from them:

I also have my own Markov bot, which right now is just trained on my own tweets1, and I might eventually decide to add my prose writing as a secondary source of data. It produces a few wonderful tweets and a whole lot of nonsensical crap.

I've just added a little feature which helps to visualize what's going on under the surface, so I wanted to explain how Markov chains work at a high level first, and then show how @guwan_aisamanra now visualizes the underlying mechanisms at work.

What is a Markov Chain?

There are various ways of thinking about Markov chains, but underlying them is the kind of abstract notion of steps in a sequence. You can think of these steps as being actions over time, or as words in a sentence, or as abstract symbols in any kind of sequence of symbols. It doesn't matter what: just a bunch of things one after another, in time or in space.

In many sequences like these, the next step might be random, but the odds of choosing a particular next step are often related in some way to the steps that came before. Let's take a simple contrived example: the capricious chef at your local cafeteria. Every day, the chef chooses one of three meal options: tacos, pasta, or soup. The chef chooses the next meal randomly, but is not allowed to repeat the same meal two days in a row. The chef is also inordinately fond of tacos2 and is twice as likely to pick tacos as any other option, but still won't make tacos two days in a row. A possible sequence of meal choices might go: “tacos, pasta, tacos, soup, pasta, tacos, pasta, tacos...”

This example has a very particular property where the choice of each step depends only on the previous step, but not any of the steps before the last one. If last lunch was tacos, then we have roughly equal chances of choosing pasta or soup as the next lunch because the chef has no preference between the two. On the other hand, if the last lunch was pasta or soup, then we have a two-thirds chance of choosing tacos as the next lunch, and one-third chance of choosing soup or pasta respectively, because the chef is allowed to make tacos and is more likely to make them than a different lunch. We can write a table that conveys the odds of a given meal choice given yesterday's meal choice:

yesterday's lunch today's lunch probability


tacos pasta ½ tacos soup ½ pasta tacos ⅔ pasta soup ⅓ soup tacos ⅔ soup pasta ⅓

The mathematical way of saying that the probability of a given lunch choice doesn't depend on any history of lunches except for the most recent one is that our sequence of lunches has the Markov property.

If some sequence of things—-words, mathematical objects, lunches, whatever—-has the Markov property, then we can build up a mathematical model of the underlying process. One convenient way of visualizing this model is as a graph with directed edges: each step is a node, and the directed edges between nodes are tagged with the probability that the steps of picking that step next. Consequently, the graph of the lunche pattern I described would look like this:

We can come up with a probably sequence of lunches by moving randomly starting in one state, and then following the edges from one state to another, choosing from our next edge according to the provided probabilities.

Implementing Markov Models

In software, we can implement a directed graph as a table that maps steps to a set of possible next steps, where the possible next steps are randomly chosen according to some set of probabilities. For simplicity in this implementation, I'm going to treat the set of next steps as a multiset from which we can choose any element with equal probability: if we choose at random from the multiset {tacos, tacos, pasta}, then we'll get tacos two-thirds of the time and pasta one-third of the time. This is a terrible representation in practice—-for certain situations, it can use a ridiculously wasteful amount of space—-but it makes the code in this blog post short and easy!

Let's implement a basic, inefficient-but-working Python version of a Markov model! We can represent our model in Python using a plain dictionary, with strings as our step names, and lists as our possible next steps:

model = {
  'tacos': ['pasta', 'soup'],
  # both of the following have double the chance
  # of choosing tacos
  'pasta': ['tacos', 'tacos', 'soup'],
  'soup':  ['tacos', 'tacos', 'pasta']
}

If we want to generate a sequence of lunches with a particular length, we can do so by starting with a random starting lunch, and then choosing our next lunch randomly according to the set of possible “next lunches” provided by the model:

import random

def generate_sequence(model, length=10):
    # choose our initial state randomly
    state = random.choice(model.keys())
    # our output sequence starts empty
    output = []

    # we loop equal to the desired length
    for _ in range(length):
        # each time, we add the current state
        # to the sequence
        output.append(state)
        # and choose a new state randomly from
        # the set of possible next states
        state = random.choice(model[state])

    # once we've gotten as many as we want, we
    # can return our sequence
    return output

If we run this model several times, we'll get various random sequences that conform to the patterns we'd expect:

>>> generate_sequence(model, length=5)
['pasta', 'tacos', 'soup', 'tacos', 'soup']
>>> generate_sequence(model, length=5)
['soup', 'tacos', 'soup', 'tacos', 'soup']
>>> generate_sequence(model, length=5)
['tacos', 'pasta', 'soup', 'tacos', 'pasta']
>>> generate_sequence(model, length=5)
['tacos', 'soup', 'tacos', 'soup', 'pasta']

In the example we've been using, our model was based on perfect mathematical knowledge of the underlying problem domain: we know exactly the logic the chef uses to determine the next lunch. However, in most cases, we don't have that simple knowledge, so instead we want to calculate a model based on some set of existing observed data. Using the same representation of Markov model, we can take a sequence of any kind of things and build a model out of it by looking at each pair of elements in the sequence, and building a model accordingly3:

def build_model(sequence):
    # our empty model knows nothing about anything
    model = {}

    # walking along the sequence, taking each step
    # as well as its subsequent step...
    for step, next_step in zip(sequence, sequence[1:]):

        # make sure that step has an entry in our model
        # data structure:
        if step not in model:
            model[step] = []

        # add the next_step to the list of possible next
        # steps
        model[step].append(next_step)

    return model

Now we can build a Markov model from a sequence of data even if we don't know what probabilistic relationships hold in that data set. This lets us take data we've found “in the wild” and build a model that resembles the raw data! Well, in a limited sense, as we'll see.

Markov Models of Non-Markov Sequences

It turns out that English text—-both on the level of individual letters, and on the level of sentences—-doesn't have the Markov property, because the choice of “next word” depends on a lot more than just the previous word: grammatical structures, semantic context, and so forth. Building and using a Markov mover from raw English text—-in the examples below, I'm using the Project Gutenberg text of Alice in Wonderland, stripped of everything but spaces and alphabetic characters turned lowercase—-produces some nonsense that's English-like from a distance but still mostly gibberish:

>>> alice = open('alice-in-wonderland.txt').read().lower()
>>> model = build_model([ch for ch in alice if ch.isalpha() or ch == ' '])
>>> ''.join(generate_sequence(model))
'xpr yoube '
>>> ''.join(generate_sequence(model))
'be che sof'
>>> ''.join(generate_sequence(model))
'ver athean'
>>> ''.join(generate_sequence(model))
'k wanigroo'
>>> ''.join(generate_sequence(model))
'jeado sshe'
>>> ''.join(generate_sequence(model))
'f tond mpo'

Similarly, a Markov chain run over the sequence of words in the text, rather than the characters, produces strings of English words which lack grammar or real sense:

>>> model = build_model(alice.split())
>>> ' '.join(generate_sequence(model))
'bat, and said alice, she wandered about in an opportunity'
>>> ' '.join(generate_sequence(model))
'coaxing tone, going to leave it had lost something; and'
>>> ' '.join(generate_sequence(model))
'oneself speak--and they are removed. of it; and then the'

Why does it produce such terrible nonsense? Well, because Markov chains have no memory! Each pair of words in sequence is sensible, but higher-level relationships like sentence structure don't get captured in the model, because they rely on larger amounts of context. Let's take a closer look at a four-word sequence generated by the Alice in Wonderland chain:

escape; so much evidence

We can look at as being composed of three overlapping sequential pairs of words, where each ordered pair of words must have appeared at least once in the source text, or else the model would not have provided a path between them. The first pair is escape; so, and it turns out that this word pair appears only once in the source text, in Chapter IV:

This seemed to Alice a good opportunity for making her escape; so she set off at once...

The pair so much appears at least nine times—-it is a very likely pair of words in English more generally, so this makes sense! I'll choose my favorite sentence in which that pairs appears, just as an arbitrary example:

'Curiouser and curiouser!' cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English)...

And the word pair much evidence appears only once, in Chapter XI:

Alice watched the White Rabbit as he fumbled over the list, feeling very curious to see what the next witness would be like, '—for they haven't got much evidence YET,' she said to herself.

In each case, training the Markov model produced the knowledge that these words appear in sequence somewhere in the source text, so it was okay for them to appear together in the output:

making her ESCAPE; SO she set off at once, and ran till s
ied alice (she was SO MUCH surprised, that for the moment
-for they haven't got MUCH EVIDENCE yet,' she said to her
           ESCAPE; SO MUCH EVIDENCE

But it's not really a full sentence, even if it seems more realistic than some of the others produced. Because we're choosing each word only based on the previous word, the model doesn't know about things like clauses or verbs or sentences, just word-pair frequency. I our training model contains a sentence like

behold the dishwasher, the greatest invention of mankind

then it might learn that it's okay to follow “the” with “dishwasher,” and also to follow “dishwasher,” with “the”, and produce a nonsensical sequence like this:

the dishwasher, the dishwasher, the dishwasher, the dishwasher,

(There's an even sillier example that arose from my own Markov bot: I once tweeted something that included the phrase “stabby stabby”, and consequently my Markov model learned that it was perfectly fine to follow the word “stabby” with “stabby”.)

Visualizing Markov Output Directly

It's sometimes very interesting and illustrative to compare the output of a Markov chain with the input provided to that Markov chain, because you start to see the places where the model “learned” particular patterns. To draw an example from my own bot: earlier today it tweeted [a short phrase]():

bad painter? it's happened in it

My bot is programmed to only start with words which appear at the start of a tweet elsewhere, so it chose to start with the word bad because it came at the start of this old tweet:

Bad idea: double-shot of espresso added to drip coffee (a “redeye”). OTOH now I can see infinity. So I got that going for me. Which is nice.

The word bad also appears in this more recent tweet, where it borrowed several pairs of words in sequence:

Did you know that in addition to being a bad programmer and a bad-opinion-haver, I am also a bad painter? It's true!

The word it's also appears in this tweet where it is followed by the word happened:

It usually happens with weird dream-like books. It's happened before with Dunsany and MacDonald. Last night, it was an R. A. Lafferty novel.

And the word happened also appears in this tweet, followed by a few more words:

Last time I had a fever, I watched all of the Tron cartoon. A+++ would feverishly watch again. ...no idea what happened in it.

In this case, this isn't just a reconstruction of what might have influenced the model, like in the Alice example above. To facilitate exactly this kind of introspection of Markov chain output, I've modified my own Markov bot to consistently record not just which words it's selecting, but also keep track of where it learned of a relationship between two words. That data is presented in a human-readable form on a new web page generated for each tweet and every new tweet generated will be included on a chronological index page.

A Markov model is an almost embarassingly simple example of machine learning, but it's a good, approachable example to start with—-and is also a great object-lesson in the way that machine learning can only picks up on trends that appear in its training data, and the way that the model might not be nuanced enough to pick up on every kind of pattern.


  1. They are very bad tweets.
  2. And who isn't?
  3. Again, this is a terrible representation for the list of possible next steps: a better representation might be a set of pairs, each representing the next step with the number of times it occurred in that context.