Skip to main content

Repost: ANTLR Trinity

This post is a repost of an article I had on a previous incarnation of this blog. I hadn't intended to transfer it over, as the technology is old now (ANTLR is on version 4), but I recently came acros a slide deck online, where the post was referenced, so I am reposting in case anyone was looking for it.

There are 3 components to a really useful software development technology: innovative features, clear and comprehensive documentation, and solid tools. The recent release of ANTLR v3.0 is a perfect example of this. This parser generator tool has all 3 components and each component is done superbly.

ANTLR is a parser generator tool that is capable of targeting multiple output languages. Out of the box it will generate Java, Python, C, C#, or Ruby code for parsers. Other target languages are possible if the code generators are written. Amongst its cool features are:

  • LL(*) parsing: This is an extension to the normal, top down with lookahead, LL(k) parsing technique. It allows for more powerful parsers than those possible with LL(k). Not something I have needed yet, but I can see it being useful in the future.

  • Semantic and syntactic predicates: These are tests that can be embedded in grammar rules to turn certain choices in rules on and off based on a boolean test. These allow parsers to be built that simulate recognizers for context dependent grammars and this makes the automated parser generation applicable to a lot more problems. Again, this is something I haven't used yet, but will be in the future.

  • Memoized backtracking: This is a performance improving feature for grammars that use a lot of lookahead. If a parser fails at a choice near the end of the rule it has to go back to the start of the rule and start matching again. The memoization of intermediate matches for a rule speeds this up.

  • Unicode support: parsers built with ANTLR recognize Unicode input. ANTLR grammar files themselves do not recognize Unicode characters yet, but Unicode characters can be specified as escape sequences.

  • Hierarchical lexers: Most parser generator tools define tokens by means of a regular expression like language. For these tools lexer rules are independent of each other and cannot refer to each other. ANTLR is different: it allows lexer rules to reference other lexer sub-rules. It also allows recursive lexer rules. Now, this is a very useful feature and, for me, tidies up lexer rules a lot.

  • Abstract Syntax Tree (AST) generation features: ANTLR has a powerful AST generation feature. When generating parsers I prefer to generate a parse tree or AST and pass that onto a second stage, rather than embedding actions/code in the grammar. This allows the a parser to be used as a module in numerous tools. ANTLR's AST build support is superb and really facilitates that type of development.

  • Tree grammars: This is a feature of ANTLR that allows for a parser to match tree structures such as ASTs. I used to use JavaCC and what I normally did was use it to generate a parse tree and then use the visitor pattern to process that parse tree. If the action at a particular node depended on the structure of the tree, it was up to me to track that in the Java code in the visitor class. ANTLR's tree grammars really simplify situations like that and I am really looking forward to trying that out.

The Definitive ANTLR Guide, available from the Pragmatic Programmer website, is the main documentation for ANTLR v3.0. The book is well named. It clearly and concisely describes the features of ANTLR v3.0 and how to use them. The book is divided into 3 sections: an introduction to ANTLR and language translation, a reference section for the ANTLR syntax, and a section on how to write predicated LL(*) grammars. I have read through the first section and skimmed parts of the reference and this has enabled me to put together a basic parser/recognizer for Scheme. It really is that straight forward.

In addition to The Definitive ANTLR Guide there is a really good Wiki with tutorials and FAQs at Lastly, the distribution comes with great sample grammars. There is a Java grammar and a Python grammar in there and these are worth referring to, to see ANTLR put through its paces.

ANTLRWorks is the IDE for building ANTLR grammars. It is a standalone editor, written in Java Swing, that provides the standard features that you would expect from an IDE, such as syntax highlighting and error detection. In addition it has a few handy features that you would not expect. Firstly, there is a syntax diagram pane in the GUI. Selecting a grammar rule in editor pane causes a syntax diagram for the rule to be displayed in the syntax diagram pane.

This is very useful for documenting grammars as the diagrams can be saved as graphics (JPEG, PNG, EPS, etc.). There is also a debugger and an interpreter. The interpreter will interpret the grammar and draw a graphical representation of the parse tree for an input string based on the grammar. If the parse fails it still draws the diagram for what was recognized and puts a node in the tree, at the point were the parse failed, to represent the type of error detected. The parse tree graphics can also be saved and are a real help with visualizing how the rules fit together and what choices the parser is taking.


Popular posts from this blog

Learning Forth

One of my side projects for this year is to learn the programming language, Forth. Some people might consider this an odd language to learn. It is not a popular language. There are no hot startups using it (that I know of). It doesn't even show up in the top 100 languages in the TIOBE Index . However, I am convinced learning it is worthwhile. Some of my reasons for this are: Forth is probably the most successful and widely deployed language that nobody has heard of. It is the language used to develop OpenFirmware . This boot loader is installed on the laptops of the One Laptop Per Child Project , on PowerPC based Apple Mac computers, and on SPARC based computers from SUN Microsystems. It has also been used to develop to develop control software for the National Radio Astronomy Observatory , which is where it was developed. While not as widely used as C/C++, Forth is used a lot in embedded applications and has been ported to most micro-controllers. For example, the Forth, Inc. w

First Post

Hello and welcome to the inane ramblings of an Irish software developer. The title of the blog comes from Lewis Carroll's, Through the Looking Glass . In the book, Alice goes running with the Red Queen, but they don't seem to make any progress. Alice remarks on this, saying, "Well in our country, you'd generally get to somewhere else - if you ran very fast for a long time as we've been doing." The Red Queen replies, "A slow sort of country. Now, here, you see, it takes all the running you can do, to stay in the same place." The Red Queen Effect is quite applicable to the software industry, and as I probably will be talking quite a bit about the software industry, I thought it would be a good name for a blog. I have a few objectives for my new blog. By writing here, I hope to learn how to write well. That is, I hope to learn how to write clearly and concisely, and be interesting at the same time. I also hope that this blog will become a good prof

Operational Metrics and Alerts for Distributed Software Systems

This post will be about operational metrics and alerts for distributed software systems. What do I mean by that? I mean the metrics and alerts that allow operations personel to detect failure of of a distributed software system and helps them to quickly diagnose what is wrong. Metrics The metrics are measurements of characteristics of the system collected at regular(ish) intervals and stored somewhere for processing - rendering into graphs, triggering alert notifications, etc. Metrics can be divided into 3 categories: input metrics, output metrics, and process metrics. Input metrics are measures of the inputs to the system, for example, the number of user requests, counts of particular characteristics of the requests - where they are from, how large the request data is, counts of particular features in the request (for example, which resources/items/products are being asked for). Output metrics are measures of the output of the system. Examples of these would include orders s