Adventures in Validating Email Addresses with Instaparse

Saturday, September 07, 2013

Summary: The email address specifications are complex enough to require a full parser for proper validation.

It started simply enough. I wanted a function to validate the format of an email address.

The regex handled all the invalid cases I could think of. I just needed to test some valid addresses, to make sure I didn't have any false-negatives. So, I looked through the Wikipedia article on Email Addresses.

OK. I've covered those cases.

I didn't know TLDs were valid domains. But, the regex had that case covered.

Really? I guess that makes sense. I just needed to add a few characters to the regex.

Quoted strings?

That's valid?!

Email addresses can have comments?! Alright. Screw regex. I'm using the EBNF hammer.

Note: I'm using some of the PEG extensions for Instaparse, so this is neither pure EBNF nor a pure CFG.

With something this complex, I wanted a more extensive test suite to validate my validator. I found just that in Dominic Sayers’s project is_email(), which was built to solve the same problem in PHP. In the process, he created a test suite covering invalid, deprecated, standard specific, and other syntax cases. It looked like a good metric. And, at the time of writing, the above grammar identifies 74% of valid test cases as valid, and 100% of the invalid test cases as invalid. Since this is meant as a generic format validation function, I consider any test in the “ISEMAIL_ERR” category as an “invalid” case, and all other categories as “valid” cases, including deprecated syntax and length restrictions.

While writing my own grammar, I also ran across a similar post by George Pollard done with Ruby, in 2009. While I didn’t use any of his grammar, it’s nice to know i’m on the right track. Because, while you can use regex, you shouldn’t use regex to parse, or validate, an email address.