Infinite Lorem Ipsum with Markov chains

Monday, December 16, 2013

Summary: You can train a Markov chain on Latin text and build an infinite Lorem Ipsum string.

There are hundreds of websites, apps, plugins, and test pages utilizing to the semi-coherent, utterly incomprehensible, and much loved Lorem Ipsum text. After using it a large number of times, to fill gaps in preliminary designs, I took a moment to stare into it. In the middle of a late-night coding session, I stared into it, wondering if I too could build something of such infinite semi-coherence.

According to a Straight Dope post, the main source for the Lorem Ipsum Wikipedia article, the semi-coherent text is based on selections from the De finibus bonorum et malorum text by Cicero. Since my goal was an infinite string of semi-coherent text, the simplest solution is to train a Markov chain with Cicero’s text. Then, I could lazily pull as many words as I needed. They were still valid Latin words, but letter-order scrambling could be done on the resulting words.

I first became aware of using Markov chains for text synthesis with Garkov, a Garfield comic strip where the character’s dialog is replaced by a probabilistic model trained on genuine Garfield comic strips. I have never had a use for Markov chains in normal projects. Most of my efforts involve finding libraries and gluing them together (PostModern Programming style). And, I have never found a way to sneak such a fun way to generate semi-coherent text into a project.

The wikipedia page does a great job explaining the details behind Markov chains, so I will only include the following images visualizing the training and text synthesis. In this case I used the letters in the string “at noon you can” to train the model, and generate the string “at nou”.


Generated Model

Generated Text

In this case, the function, that generates the probability hash-map, relies on Clojure’s treatment of strings as a Seqable collection of characters and the hash-map keys being of semi-arbitrary type. So, the code will work on a seqable collection of strings without modification.

Since the probability hashmap is only limited by the allowed types for hashmap keys, numbers, objects, other collections, and functions will also work.

Note: As long as the training list can be chained, without error, the generated model prevents invalid function chaining.

I’m not exactly sure how infinite probabilistic function chaining is useful, but it … um ... sounds cool.