Tree storage in Git with JGit and Clojure

Tuesday, May 28, 2013

Summary: You can store parse trees, or ASTs, directly in a Git repo, but then you have new problems.

Git stores a file structure as trees, using Objects for files and Tree Objects for directories. The tools usually ignore empty directories, but the internal object store has no problem handling them. So, every so often, I wonder why someone doesn’t store the full AST, or just the parse tree, in the repository, instead of just the files.

Proof of Concept

https://gist.github.com/Jared314/5655934

To summarize the code, I’m using a grammar, from the Instaparse tests, that groups 'a's and 'b's from the string “aaaaabbbaaaabb”. After parsing the text, I apply an Enlive transformation that removes the A nodes from the parse tree. I then render the result back into a tree, made of lists and hash-maps, to be stored in the Git repo using JGit and a custom JGit WorkingTreeIterator. Inside the custom tree iterator, I lazily convert each node into a JGit Entry with a FileMode of TREE or REGULAR_FILE, depending on if the node has children.
Text > Instaparse > Enlive transform > Render tree > Custom TreeIterator > AddCommand

The Enlive transform and the render function are not required, but, in this case, I wanted to manipulate the tree before saving it.


The Good

  • Style Independent: The stored representation is immune to statements that parse, or even compile, equivalently. The argument of tabs versus spaces, variable declaration alignment, and other, slightly pedantic, coding style arguments become meaningless, because only the parser or compiler’s “view” of the code is stored.

The Bad

  • Hard Version Dependency: The stored structure is directly dependent on a compiler version and any bugs or limitations of that version. Assuming the stored tree is tagged with language version information, the language developers would need to release a migration script, a translator, or keep previous versions available for long periods of time, as opposed to just relying on customers to change their own code.
  • Code Generator Required: Assuming only the tree is stored, checking out code from a repository would require a human-readable code generator, like what most decompilers do today. While this would allow for personalized code styling, it is an additional language specific tool to build and maintain.

  • Diff and Merge Tool Limitations: Most mainstream diff and merge tools focus solely on the contents of the stored blobs and assume node order independence. Whether this requires different algorithms or a different workflow, the tools will have to change to handle large trees and forests.


The Code