Adventures in Validating Email Addresses with Instaparse

Saturday, September 07, 2013

Summary: The email address specifications are complex enough to require a full parser for proper validation.


It started simply enough. I wanted a function to validate the format of an email address.


The regex handled all the invalid cases I could think of. I just needed to test some valid addresses, to make sure I didn't have any false-negatives. So, I looked through the Wikipedia article on Email Addresses.


OK. I've covered those cases.



I didn't know TLDs were valid domains. But, the regex had that case covered.



Really? I guess that makes sense. I just needed to add a few characters to the regex.



Quoted strings?



That's valid?!



Email addresses can have comments?! Alright. Screw regex. I'm using the EBNF hammer.



Note: I'm using some of the PEG extensions for Instaparse, so this is neither pure EBNF nor a pure CFG.

With something this complex, I wanted a more extensive test suite to validate my validator. I found just that in Dominic Sayers’s project is_email(), which was built to solve the same problem in PHP. In the process, he created a test suite covering invalid, deprecated, standard specific, and other syntax cases. It looked like a good metric. And, at the time of writing, the above grammar identifies 74% of valid test cases as valid, and 100% of the invalid test cases as invalid. Since this is meant as a generic format validation function, I consider any test in the “ISEMAIL_ERR” category as an “invalid” case, and all other categories as “valid” cases, including deprecated syntax and length restrictions.

While writing my own grammar, I also ran across a similar post by George Pollard done with Ruby, in 2009. While I didn’t use any of his grammar, it’s nice to know i’m on the right track. Because, while you can use regex, you shouldn’t use regex to parse, or validate, an email address.

Improving the Clojure-Git Interface with a Nice Facade

Monday, September 02, 2013

Summary: A more composable Git interface can be built with a facade that implements standard Clojure interfaces.

In a previous post, I used clj-jgit to interact with a local Git repository. The functions, and general workflow, matched what I would have performed at the command line (git-add followed by git-commit). The workflow made it very easy to get started, and is preferable to using jgit directly, but, to me, it didn’t feel very Clojure-like. It felt like Bash.

While I was using git-add and git-commit, I wanted conj. A Git branch can almost be imagined as a very-persistent strongly-ordered hashmap. Every commit is addressable by a “key”, and has a sequence of commits behind it. I should be able to get commits, map and reduce over them, and conj on a new ones using the built-in clojure functions.

Associative, IFn, IObj, Oh My!
Dig into the Clojure (JVM) internals and you will quickly find a list of common interfaces for everything from metadata to data structure access. I only wanted the behavior of a hashmap, so my implementation list, after much exploration, shrunk to Associative, IFn, IObj, and Object. Associative accounted for most of the functionality, but did not provide map-as-function, metadata handling, or toString.

Proof of Concept


Usage


Some points of interest:

  • The idea of a staging area disappears because you can now construct a commit, as a hashmap, independently of Git.
  • The metadata, and commit information, is queried lazily, and not cached, because the repository could be changed from outside the application.
  • I have reused the TreeIterator from a previous post to allow commits without writing to the file system.

There is missing functionality (commit totals, changing branches, and a clean commit data format to name a few), but most of it was outside of my original goal of using conj to add a new commit. There are no technical hurdles to prevent those features in the future, but my application did not call for them. This might make for a nice feature to submit to the clj-jgit project, if I ever fill in the missing features.