Infinite Lorem Ipsum with Markov chains

Monday, December 16, 2013

Summary: You can train a Markov chain on Latin text and build an infinite Lorem Ipsum string.

There are hundreds of websites, apps, plugins, and test pages utilizing to the semi-coherent, utterly incomprehensible, and much loved Lorem Ipsum text. After using it a large number of times, to fill gaps in preliminary designs, I took a moment to stare into it. In the middle of a late-night coding session, I stared into it, wondering if I too could build something of such infinite semi-coherence.

According to a Straight Dope post, the main source for the Lorem Ipsum Wikipedia article, the semi-coherent text is based on selections from the De finibus bonorum et malorum text by Cicero. Since my goal was an infinite string of semi-coherent text, the simplest solution is to train a Markov chain with Cicero’s text. Then, I could lazily pull as many words as I needed. They were still valid Latin words, but letter-order scrambling could be done on the resulting words.

I first became aware of using Markov chains for text synthesis with Garkov, a Garfield comic strip where the character’s dialog is replaced by a probabilistic model trained on genuine Garfield comic strips. I have never had a use for Markov chains in normal projects. Most of my efforts involve finding libraries and gluing them together (PostModern Programming style). And, I have never found a way to sneak such a fun way to generate semi-coherent text into a project.



The wikipedia page does a great job explaining the details behind Markov chains, so I will only include the following images visualizing the training and text synthesis. In this case I used the letters in the string “at noon you can” to train the model, and generate the string “at nou”.

Input

Generated Model

Generated Text

In this case, the function, that generates the probability hash-map, relies on Clojure’s treatment of strings as a Seqable collection of characters and the hash-map keys being of semi-arbitrary type. So, the code will work on a seqable collection of strings without modification.


Since the probability hashmap is only limited by the allowed types for hashmap keys, numbers, objects, other collections, and functions will also work.


Note: As long as the training list can be chained, without error, the generated model prevents invalid function chaining.

I’m not exactly sure how infinite probabilistic function chaining is useful, but it … um ... sounds cool.

Adventures in Validating Email Addresses with Instaparse

Saturday, September 07, 2013

Summary: The email address specifications are complex enough to require a full parser for proper validation.


It started simply enough. I wanted a function to validate the format of an email address.


The regex handled all the invalid cases I could think of. I just needed to test some valid addresses, to make sure I didn't have any false-negatives. So, I looked through the Wikipedia article on Email Addresses.


OK. I've covered those cases.



I didn't know TLDs were valid domains. But, the regex had that case covered.



Really? I guess that makes sense. I just needed to add a few characters to the regex.



Quoted strings?



That's valid?!



Email addresses can have comments?! Alright. Screw regex. I'm using the EBNF hammer.



Note: I'm using some of the PEG extensions for Instaparse, so this is neither pure EBNF nor a pure CFG.

With something this complex, I wanted a more extensive test suite to validate my validator. I found just that in Dominic Sayers’s project is_email(), which was built to solve the same problem in PHP. In the process, he created a test suite covering invalid, deprecated, standard specific, and other syntax cases. It looked like a good metric. And, at the time of writing, the above grammar identifies 74% of valid test cases as valid, and 100% of the invalid test cases as invalid. Since this is meant as a generic format validation function, I consider any test in the “ISEMAIL_ERR” category as an “invalid” case, and all other categories as “valid” cases, including deprecated syntax and length restrictions.

While writing my own grammar, I also ran across a similar post by George Pollard done with Ruby, in 2009. While I didn’t use any of his grammar, it’s nice to know i’m on the right track. Because, while you can use regex, you shouldn’t use regex to parse, or validate, an email address.

Improving the Clojure-Git Interface with a Nice Facade

Monday, September 02, 2013

Summary: A more composable Git interface can be built with a facade that implements standard Clojure interfaces.

In a previous post, I used clj-jgit to interact with a local Git repository. The functions, and general workflow, matched what I would have performed at the command line (git-add followed by git-commit). The workflow made it very easy to get started, and is preferable to using jgit directly, but, to me, it didn’t feel very Clojure-like. It felt like Bash.

While I was using git-add and git-commit, I wanted conj. A Git branch can almost be imagined as a very-persistent strongly-ordered hashmap. Every commit is addressable by a “key”, and has a sequence of commits behind it. I should be able to get commits, map and reduce over them, and conj on a new ones using the built-in clojure functions.

Associative, IFn, IObj, Oh My!
Dig into the Clojure (JVM) internals and you will quickly find a list of common interfaces for everything from metadata to data structure access. I only wanted the behavior of a hashmap, so my implementation list, after much exploration, shrunk to Associative, IFn, IObj, and Object. Associative accounted for most of the functionality, but did not provide map-as-function, metadata handling, or toString.

Proof of Concept


Usage


Some points of interest:

  • The idea of a staging area disappears because you can now construct a commit, as a hashmap, independently of Git.
  • The metadata, and commit information, is queried lazily, and not cached, because the repository could be changed from outside the application.
  • I have reused the TreeIterator from a previous post to allow commits without writing to the file system.

There is missing functionality (commit totals, changing branches, and a clean commit data format to name a few), but most of it was outside of my original goal of using conj to add a new commit. There are no technical hurdles to prevent those features in the future, but my application did not call for them. This might make for a nice feature to submit to the clj-jgit project, if I ever fill in the missing features.

Automating a Clojure Project's File Layout

Saturday, August 03, 2013

I have come to accept the file's place in application development. In Leiningen based Clojure projects, a clean project layout has a file per namespace layout, and a directory per parent namespace. But, as a developer, I write functions, not files. The files are just artifacts. The creation and organization of code files is so systematic, it could be automated with something that maps functions/declarations into the hierarchical file structure used by project tooling.

Proof of concept: (https://gist.github.com/Jared314/6144582)

The idea behind this “object-hierarchical mapper” is to abstract out the project file layout from the functions and namespaces. There is too much tooling around files to replace them as a code storage format, but there is very little reason to think about file organization.

This code works by passing the function declarations (thing1, thing2, and thing3), with their corresponding namespace declarations (tester1.core, tester1.server), to the parse function which builds a tree whose nodes correspond to the namespace hierarchy. That tree (nested hash-maps) is then written out to the file system, using the write! function, into a valid project structure. The current issues center around optimizations, and messy code. In the generated files, the namespace declarations are not combined efficiently, and the functions are preceded by a list of declares to prevent reference ordering issues.

This is not a new concept. I would guess the idea extends as far back as the 1960s, because that really appears to be the case with most ideas in CS. I just remember the first time I was amazed by the idea after watching the Code Bubbles demo. More recently, Light Table also had a similar demo, where code was edited in a single visual workspace. I believe they plan to bring that feature back in a future version. No matter where the idea came from, it seems like a useful layer of indirection.

This might end up working best as a lein plugin, with editors saving their changes “through” it, because the project file layout is tied to lein and other lein plugins. It might also be an interesting experiment to use a git repository as an additional backend, but that would have to happen after a refactoring of the write! function to something more functionally pure.

Learning core.async: A High School Tale

Wednesday, July 24, 2013

The examples use the 2013-07-24 snapshot of core.async. At the time of writing, the API is still in flux and very experimental.

[org.clojure/core.async "0.1.0-SNAPSHOT"]
from:
["sonatype-oss-public" "https://oss.sonatype.org/content/repositories/snapshots/"]

Once upon a time, Alice wanted to invite Bob to an awesome party she had heard about, but she didn’t know where the party is going to be yet. So, she told Bob to wait by his phone, and she would send him the location, when she found out.



Unfortunately for them, the police were monitoring all the text messages. So, the police showed up, about an hour after the party had started, and arrested everyone for underage drinking.


Broadcasting


Alice, now feeling bad about the last party, decided to make it up to Bob and Carol by inviting them both to the next party she heard about. But, this time, she wanted to make sure no one else could see the message. So, she made a Facebook group, and only included Bob and Carol.



Unfortunately for them, the police were monitoring all the Facebook groups and messages. So, the police showed up, about an hour after the party had started, and arrested everyone for underage drinking.


Chaining Go Blocks


Alice decided to try one more time. This time also including Dave, to make up for the last failed party. But, Alice was getting slightly paranoid, so she decided to only tell Bob the location of the party, then have Bob tell Carol, and then have Carol pass it on to Dave.



Unfortunately for them, Dave got pulled over by the police, for a broken tail light, on his way to the party. Being within one hundred miles of the border, the police searched his car and found the handwritten note from Carol. So, the police showed up, about an hour after the party had started, and arrested everyone for underage drinking.


Timeouts


With this being the third time Alice, Bob, Carol, Dave, and Eve had appeared in his court, Judge Frank was ready to throw them in jail. In each case he decided to sentence them to several months in jail, based on their honesty (represented by a random number), with the option of letting them out early on good behavior (represented by an extra channel for each person).




Takeaways:

  1. Go blocks will block and execute in their own thread pool or, for the cljs version, their own state-machine-context-like-thing.
  2. The broadcast and multiplex (a.k.a. fan-out and fan-in) currently exist only as experiments. I hope they get rolled into the library, because I don’t want to have to write them myself.
  3. There is no “wait for all channels to have a value” function. Alts! and Multiplex return on the first channel to have a value.
  4. It looks like you could change the underlying channel implementation to something like a ZeroMQ (zmq-async) or RabbitMQ, but i’m not sure if it would actually be a useful abstraction.

Tree storage in Git with JGit and Clojure

Tuesday, May 28, 2013

Summary: You can store parse trees, or ASTs, directly in a Git repo, but then you have new problems.

Git stores a file structure as trees, using Objects for files and Tree Objects for directories. The tools usually ignore empty directories, but the internal object store has no problem handling them. So, every so often, I wonder why someone doesn’t store the full AST, or just the parse tree, in the repository, instead of just the files.

Proof of Concept

https://gist.github.com/Jared314/5655934

To summarize the code, I’m using a grammar, from the Instaparse tests, that groups 'a's and 'b's from the string “aaaaabbbaaaabb”. After parsing the text, I apply an Enlive transformation that removes the A nodes from the parse tree. I then render the result back into a tree, made of lists and hash-maps, to be stored in the Git repo using JGit and a custom JGit WorkingTreeIterator. Inside the custom tree iterator, I lazily convert each node into a JGit Entry with a FileMode of TREE or REGULAR_FILE, depending on if the node has children.
Text > Instaparse > Enlive transform > Render tree > Custom TreeIterator > AddCommand

The Enlive transform and the render function are not required, but, in this case, I wanted to manipulate the tree before saving it.


The Good

  • Style Independent: The stored representation is immune to statements that parse, or even compile, equivalently. The argument of tabs versus spaces, variable declaration alignment, and other, slightly pedantic, coding style arguments become meaningless, because only the parser or compiler’s “view” of the code is stored.

The Bad

  • Hard Version Dependency: The stored structure is directly dependent on a compiler version and any bugs or limitations of that version. Assuming the stored tree is tagged with language version information, the language developers would need to release a migration script, a translator, or keep previous versions available for long periods of time, as opposed to just relying on customers to change their own code.
  • Code Generator Required: Assuming only the tree is stored, checking out code from a repository would require a human-readable code generator, like what most decompilers do today. While this would allow for personalized code styling, it is an additional language specific tool to build and maintain.

  • Diff and Merge Tool Limitations: Most mainstream diff and merge tools focus solely on the contents of the stored blobs and assume node order independence. Whether this requires different algorithms or a different workflow, the tools will have to change to handle large trees and forests.


The Code