Constructing languages with Markov chains

I really don’t know if I should put this under “coding” or “gaming”. Like a good number of things I enjoy immensely, it’s a combination of two equally geeky subjects.


The process of building a constructed language (or “conlanging”, as it’s also known) is something that I consider integral to the whole world-building process. It’s easy enough to throw down some consonants, put a couple of apostrophes, exclamation points, or dashes in the mix, and call it a day, but that will just leave you with a hodgepodge of badly-formed words with no unifying character, like you’ve just raided the remains of your Scrabble game.

A far better approach, in terms of continuity at least, is to build a consistent set of rules by which your places are named. This can be as simple as finding a real language and basing your names off of those, or you can pretty much go all out.

A lot of the time, I find myself making up worlds for RPGs - either one-off games or campaigns. It’s one of the hazards of running games. Sometimes I have a strong cultural basis on which to name my places, but sometimes I have fantasy races which are by default disconnected from the usual array of cultures we find on earth. I’ve recently found several of these in the development of Blueshift, my ongoing RPG campaign world, which has a number of vaguely fantastic races which at best echo traits of Terrestrial cultures. I could make up names as before – by picking sounds out of the air and stringing them together – but I generally find this approach unsatisfying, and in addition it’s much harder to build up some sort of cultural identity when all of your names are randomly generated. The solution is to build up a small conlang framework for each of these species, which can in turn regularly produce a number of different names for places, people and events.

Just enough language

As I’ve mentioned, you can really go overboard with conlanging, and sometimes this is fun. If you want you can work out entirely new grammar structures, tenses, moods, genders, scripts, and so forth. The experience of creating a constructed language is incredibly educational, I find: I’ve learned all sorts of things about the IPA, abjads, and linguistic reduplication through conlanging, as well as getting stuck on Wikipedia pages until ridiculous hours reading up on the obscure language features of, for example, ancient Mayan. As far as hobbies go, there’s a lot worse you can do with your brain.

However, if you’re trying to produce a number of languages at once, this “Waterfall method”-esque approach probably isn’t the best way to attack your problem. I find that concentrating too hard on one aspect of the world often means I leave the rest of it “to be filled in later”, which often means “never”. In the meantime, I can’t put names to faces or places until I have a language, and if I’m still working my way through first-person verb modifiers, I’m not addressing what words sound like.

The problem here, I think, is the divide between what I’m doing and what others see. No one is going to speak in my constructed languages, because they’re not there for people to learn to use in-game. They’re there to give each race a realised cultural identity, and as such the minutiae of correct grammar are fun but not necessary.

The solution is to retool the conlanging workflow, to focus on high-yield techniques that give you what you can use, and not much more. The solution should be:

  1. Fast: ideally, I could spend ten minutes on it and be done, at least until I revisit and add more detail.
  2. Flexible: as I detail other parts of the world, I may wish to modify my conlang slightly to deal with changes.
  3. Predictive: the ultimate goal is to generate names and words, so my conlang had better do that.

After thinking about this one night, I decided that I could make a pretty good conlang word generator using maths.

Introducing: Markov chains

A Markov chain is a series of states that you shuttle between based on your current state. Here’s an example:

A three-state Markov chain

In this situation, when I’m in state A I have a 30% chance of going to State B, a 40% chance of going to State C, and a 30% chance of staying on State A. I have different probabilities of moving if I end up on other states. One of the important things here is that these probabilities are not affected by my history - it doesn’t matter if I reached A from B or C, I’ve still got a 40% chance of heading to C on my next “turn”.

The nice thing about Markov chains is that given a sample document to use as a template, you can generate “real-looking” text. If you break down your states into letters, you can also use it to generate “real-looking” words.

How it works

First, you start off with a sample document. This document gives us our state transition frequencies or what-have-you. For this example I’m going to generate some pseudo-Latin from the first sixty-nine words of the “Lorem ipsum” filler text:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

This is how our Markov chain word generator will work:

  1. Pick a letter to start with.
  2. Work out the next letter to add to the word (the next “state”) based on the last letter we added; or finish the word.
  3. GOTO 2.

That’s pretty much all it does. Let’s run an example.

First, we need to pick the letter to start with. In this case I’ll look through the sample text to determine what letters words start with:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Letter Frequency
a7
c6
d7
e11
f1
i7
l4
m3
n4
o2
p2
q2
r1
s4
t1
u4
v3

This means, for example, there’s a 4/69 or ~5.8% chance that our word should start with an s. Using a random number generator, I pick one of these frequencies and get the letter c.

Next, we need to work out the probability of moving from the c state to another state. We do this by re-analysing the original text for every instance of the letter c and seeing what comes after it:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

We build up a similar frequency table to before:

Letter Frequency
a2
c1
e1
i5
o4
t1
u2

Sixteen total occurrences, including the chance to loop back to the c state. Random number generation gives me an i as the next letter. Our word so far is:

ci

Now we repeat the step, finding all instances of the letter i in our sample text to determine the next state (which in this case is a d), then looking for all ds to find the state after that (an o in this case):

cido

The question is: how do we end this word? When we look through our sample text, we may find that some letters end a word - for example, there are 22 rs in the text, and 6 of them end the word. In this case, whenever we run across an r, there’s a 6/22 (or 27.2%) chance that this is the last letter in our word.

Improving the Chain

So far we have a pretty good method for generating words. These words will have about the right frequency of letters given our sample document, and they’ll also have about the right series of letters: if you have lots of words ending in c in your sample document, the words you generate by this method will probably end in c, and so forth.

However, it turns out you can do a little bit more than what we’ve done above. In order to make your new words sounds more authentic, you can make your algorithm remember a bit of its history.

Previously, we’ve been using the current letter of our new word as the state to use in our Markov Chain. For example, consider the word “streets”. If we built this word by shuttling between states in a Markov Chain, our trip would look like this:

Traversing states to spell “streets” using a Markov chain.

Note here that each state is a single letter. Now let’s consider a Markov Chain algorithm that considers not just the current letter, but the previous letter as well. In effect, our states change:

Spelling “streets” using two-letter states.

Our first state is still one-letter, but after that our states are more complex. This means:

  1. We suddenly need to do a lot more work on the analysis end. Assuming we care only about the twenty-six different letters of the alphabet, there’s a grand total of 26 one-letter states, but 702 one- and two-letter states.
  2. This, in turn, means that we need a larger sample document, if we want generated words to sound like the original language.
  3. Our generated words are less original than those created using one-letter-state Markov Chain algorithms, but they also mimic the sound of the sample document more closely.

As an example, let’s collect twenty words generated from the above Lorem ipsum text, using either a one- or two-letter-state Markov Chain algorithm:

One-letter Two-letter
it esseriameturehendenim
rolla si
d pre
quriam estrurempore
s etur
esuru si
dod quatat
reruremem lore
fffinintioise laborureptatet
vex officit
ocom dollabore
upret nullum
es sin
umomoredoriadonia re
isisiquatetiquabo ulpa
amm anis
uidete qua
monimad elitature
n enia
lladorun co

You can see that the one-letter-state algorithm generates some new words, but also some garbage; the two-letter-state algorithm generates (mostly!) saner words which look a bit more like the faux-Latin we started with. If we try to use a three-letter-state algorithm, we can’t actually generate any new words: everything we generate is a mirror of one of the words that already exists in the sample document.

Using this in conlangs

Now we’ve had an example of how the Markov Chain word generation algorithm works, the question remains: how do we use this to generate a constructed language?

The answer is to generate some sort of sample document, made up of words in the constructed language, and work from there. This is the hard part. The sample document needs to be quite long: long enough that the program can generate new words based on combinations present in your old ones. However, once you have a sample document, you can use these combinations of letters to generate as many words as you need.

You don’t even need to take the above and implement it as your own program: you can do that right here. Just put your sample words into a plaintext document, and let the program do the hard work.