Fork me on GitHub

Textmapper


Language Reference

Introduction

The textmapper input file contains definitions of the lexer and parser as well as various options affecting the generation output. Either the parser or lexer section can be omitted to get only the lexer class or the parser with a prepared stub for a handwritten lexer.

The source text is a sequence of Unicode characters. By default, it is stored in a file with the extension .tm which is represented in the UTF-8 encoding.

Whitespaces (including line breaks) have no special meaning, and are used only as token separators. Single-line comments start with a hash character (#). Larger pieces of text can be commented out with traditional block comments (/* .. */). Identifiers consist of case-sensitive sequences of alphanumeric or underscore characters and cannot start with a digit. As an exception, they are also allowed to have dashes in the middle, as is common in many reference grammars. Another possible form of an identifier is a sequence of characters in single quotes (‘), which is useful for naming keywords and punctuation tokens.

Options

Textmapper files must start with a header specifying a name for the grammar and a target language for the generated code. Semantic actions and value types associated with symbols are considered to be in that language. The header is followed by zero or more generation options, which can be used to tweak the Textmapper output.

language js(go);     # A Javascript parser in Go.

eventBased = true                          # generate an event-based parser
reportTokens = [Comment, invalid_token]    # report these tokens as parser events

# ... lexer specification
# ... parser

Option values are checked against the expected types declared in the target language templates, and can contain integers, booleans, strings, grammar symbol references, and arrays of the above.

The list of available options for each output language can be found in the language section below.

Lexer

A lexical analyzer (lexer, scanner) is a component which breaks an input stream of characters into tokens (lexemes). The lexer specification consists of a set of rules. Each rule maps a lexical pattern to a terminal symbol, with the option of adding attributes and associated code. The generated lexer returns tokens one by one (in order of their appearance in the input), identifying them by applying the rule with the longest matching lexical pattern (i.e. /==/ wins over /=/). On every successful match of a rule with associated code, the code is executed. Typically, the lexer generated by Textmapper accepts a utf-8 encoded string (Java strings are in UTF-16).

# <token identifier>: <regular expression> [<attributes>] [<code>]
IntegerConstant:  /0|[1-9][0-9]+/		{ reportInteger(l.Text()) }

Lexemes marked with the (space) attribute are excluded from the stream.

# skip them
WhiteSpace:   /[\n\r\t ]+/   (space)
comment:      /#.*(\r?\n)?/  (space)

There is one predefined symbol eoi, meaning end-of-input. It is returned automatically when the end of the stream is reached. You can define you own lexical patterns for eoi though.

# treat Ctrl-Z as end-of-input, see http://en.wikipedia.org/wiki/Control-Z
eoi: /\x1a/

It is possible to extract a part of a regular expression into a reusable named pattern.

digit = /[0-9]/
IntegerConstant: /{digit}+/

By omitting a lexical pattern, you can introduce a terminal symbol without a lexer rule. For example, adding the error terminal turns on the error recovery mechanism in the parser (see below).

# a marker of the broken part of the stream
error:

Mapping an input string to a terminal can be handled by the associated code.

# some of the identifiers are types
TYPE:
ID: /[a-zA-Z][a-zA-Z_0-9]*/
  {
    if isType(l.Text()) {
      token = TYPE
    }
  }

Associated values

During the parsing process, each symbol can optionally hold an associated value. Different symbols can have values of different types. The type can be specified in braces next to the symbol name. For example, IntegerConstant represents any number in the code, while the value of the number is usually attached to the symbol as the associated data.

IntegerConstant {int}: /[0-9]+/

By default, the associated value is null (nil, NULL, 0, etc., depending on the target language and type). To compute it for a terminal, add a semantic action to the lexer rule. “$$” is an alias for the associated value variable of the current terminal under construction.

IntegerConstant {int}: /[0-9]+/ { $$ = mustParseInt(l.Text()) }

When the string type is chosen for a terminal, the matched text is taken as the default associated value. Semantic actions can override this default value (to perform decoding, for example).

# value is taken from the parsed text as is
ID {string}: /[a-z]+/

Lexer conflicts

When a string of characters is matched by two rules, Textmapper reports a conflict. In most cases one of the regular expressions can be rewritten to exclude the conflicting string. Although that is possible in theory, in practice it may lead to enormously long and unreadable definitions. Textmapper has two mechanisms to avoid this problem. The first one is applicable when one of the rules matches only one specific string. I.e. we have a more general and a more specific rule matching the same string. Marking the general one with the (class) attribute tells Textmapper to prefer specific rules over it.

# 'keyword' has a priority over identifier, since it is a more specific rule.
identifier: /[a-zA-Z]+/   (class)
'keyword': /keyword/

For all constant regular expressions, Textmapper tries to find a matching class rule. If there are two, a fatal error is reported.

Another solution is to use numerical priorities. The rule with the highest priority value is chosen. When not specified, the priority value is considered to be zero.

# 'void' is returned as kw_void, since -1 < 0
identifier: /[a-zA-Z]+/   -1
kw_void: /void/

Intensive usage of priorities can be dangerous, as it prevents Textmapper from detecting potential conflicts as you change the grammar. The conflicts get resolved silently in accordance with priorities, masking the actual problem, which is often a poorly-defined rule.

Start conditions

Parsing heterogeneous input can require adjusting the set of active lexer rules for different parts of the text. This can be done by introducing lexer start conditions and mapping rules to them. The lexer can be in only one start condition (i.e. state) at a time. Depending on the state, it uses the appropriate set of active rules to match and return the next token. The current start condition can be changed in the associated code upon a successful token match. There is one predefined condition, called initial, in which the lexer starts scanning the input string.

# on '/*' set the current start condition (also known as state) to inComment
commentStart: /\/\*/  (space)    { l.State = StateInComment }

A rule prefixed with a list of start conditions is considered to be active in those conditions only.

<initial, inComment> /[\t\x20]+/   (space)

There are two types of start conditions: inclusive and exclusive, which get declared with ‘%s’ and ‘%x’ directives respectively. Rules without explicit restrictions automatically become active in all inclusive start conditions. By default, the initial state is inclusive but it can be re-declared as exclusive by mentioning it in a ‘%x’ directive.

%s initial, afterComma;
%x inComment, inStringLiteral;

identifier: /[a-zA-Z_]+/        # This rule is active in initial and afterComma states.
<initial> ',':  /,/                      { l.State = StateAfterComma }

Substituting a list of start conditions with a star character activates a rule in all available inclusive and exclusive start conditions.

<*> /[\t\x20]+/   (space)

If several lexer rules are applicable in the same set of start conditions, they can be put together into a start conditions clause. Each rule in the clause inherits the applicability from the clause but can override it if needed. Such clauses combined with exclusive start conditions are a good way of representing lexers within lexers, which tokenize pieces of the input that are syntactically different from the rest (e.g. comments and embedded languages). Start condition clauses can nest.

<inComment, inJavadoc> {
  commentText: /*|[^*]+/  (space)
  commentEnd: /\*\// (space)   { l.State = StateInitial }

  <inJavadoc> {
    # some extra rules available in Javadoc comments only
  }
}

Using start conditions, the same string can be interpreted differently depending on the lexer context.

<newLine> TAB:  /\t/
<initial> ignoredTab: /\t/  (space)

Backtracking and invalid tokens

When the lexer cannot match the next portion of the input against some rule, it stops and reports everything it consumed so far as an invalid token. Such tokens get reported and ignored by the parser.

# Unfinished string literals are reported as INVALID_TOKEN and span from the
# opening double quote until the end of the line.
StringLiteral: /"[^"\n\r]*"/

The lexer uses backtracking and prefers longer tokens, which means that it will keep consuming characters as long as there is a chance to find a longer one. If it cannot proceed anymore, it returns the current longest token. The following rule will consume everything between the first and the last double quote in the input string.

BigStringLiteral: /"[.\n]*"/

Backtracking might lead to unexpected results for broken input - the prefix of an unfinished multi-line comment get interpreted as a pair of division and multiplication operators if such are present in the grammar.

commentChars = /([^*]|\*+[^*\/])*\**/
MultiLineComment:  /\/\*{commentChars}\*\//

'*': /\*/
'/': /\//

This can be improved by having an explicit rule for invalid_token capturing broken comments.

# Note: the following rule disables backtracking for incomplete multiline comments, which
# would otherwise be reported as '/', '*', etc.
invalid_token: /\/\*{commentChars}/

A similar situation happens in grammars which have a token for tree dots.

# Without an explicit invalid_token rule, two dots would be reported as two separate tokens.
'.': /\./
invalid_token: /\.\./
'...': /\.\.\./

Regular expressions

The syntax of regular expressions in Textmapper follows standard practices and is very close to what you know from perl or flex. It differs from them slightly, though. There are no capturing groups, lookbehind assertions or non-greedy quantifiers, as they make no sense in lexical patterns. Notation for character sets partly complies with the Unicode Technical Standard #18. A brief description of the syntax used follows.

Characters

c              matches the character `c'
.              any character except newline
\a             alert (bell, '\x07')
\b             backspace ('\x08')
\f             form-feed ('\x0c')
\n             newline (line feed, '\x0a')
\r             carriage-return ('\x0d')
\t             tab ('\x09')
\v             vertical tab ('\x0b')
\ooo           the character with octal value 0ooo
\xhh           the character with hexadecimal value 0xhh
\uhhhh         the character with hexadecimal value 0xhhhh
\              quote the following non-alphabetic character (to escape brackets, slashes, quantifiers etc.)
\\             backslash

Character classes

[abc]          a, b, or c (class)
[_a-zA-Z]      underscore, a through z or A through Z, inclusive (ranges)
[^abc]         any character except a, b, or c (negation)
\d             a digit: [0-9]
\D             a non-digit: [^0-9]
\s             a whitespace character: [ \t\n\v\f\r]
\S             a non-whitespace character: [^\s]
\w             a word character: [a-zA-Z_0-9]
\W             a non-word character: [^\w]
\p{prop}       any character in the unicode category or block `prop' (see below)
\P{prop}       a non-prop character: [^\p{prop}]
\p{Lu}         an uppercase letter (simple category)

Logical operators

RS             R followed by S (concatenation)
R|S            either R or S (union)
(R)            matches an `R' (parentheses are used to override precedence)

Quantifiers

R?             R, once or not at all
R*             R, zero or more times
R+             R, one or more times
R{n}           R, exactly n times
R{n,}          R, at least n times
R{n,m}         R, at least n but not more than m times

Pattern definitions

{name}         the substitution of the `name' pattern definition

Special symbols

{eoi}          an end-of-input (can appear at the end of a pattern only)

Supported unicode categories (in \p{xx})

Lu             Letter, uppercase
Ll             Letter, lowercase
Lt             Letter, titlecase
Lm             Letter, modifier
Lo             Letter, other
Mn             Mark, nonspacing
Mc             Mark, spacing combining
Me             Mark, enclosing
Nd             Number, decimal digit
Nl             Number, letter
No             Number, other
Pc             Punctuation, connector
Pd             Punctuation, dash
Ps             Punctuation, open
Pe             Punctuation, close
Pi             Punctuation, initial quote (may behave like Ps or Pe depending on usage)
Pf             Punctuation, final quote (may behave like Ps or Pe depending on usage)
Po             Punctuation, other
Sm             Symbol, math
Sc             Symbol, currency
Sk             Symbol, modifier
So             Symbol, other
Zs             Separator, space
Zl             Separator, line
Zp             Separator, paragraph
Cc             Other, control
Cf             Other, format
Cs             Other, surrogate
Co             Other, private use
Cn             Other, not assigned (including non-characters)

Notes

[^a-z] matches a newline, unless '\n' is explicitly added into the negated class: [^a-z\n]
bar* is equivalent to ba(r*), use parentheses to override precedence: (bar)*

Parser

A parser component, given a stream of tokens, tries to build a parse tree, or reports a syntax error in case of failure. The parser specification consists of EBNF-like production rules, which are used to rewrite input to nonterminal start symbols (in a bottom-up manner). There should be at least one grammar rule defining a start symbol (the root of the parse tree), which by convention is called input.

# the minimum grammar: input is empty
input : /* empty */ ;

The right-hand side of a rule is a sequence of terminal and nonterminal symbols. Multiple alternative definitions for the same nonterminal can be combined using the vertical bar character (|).

input : Identifier | IntegerConstant ;

The %input directive allows you to override the start symbol, or even introduce several start symbols (each gets a separate parse method in the generated code).

%input start, expr;
start : Identifier '=' expr ;
expr : IntegerConstant ;

As in lexer rules, the type of the associated value can be specified right after the symbol name.

# associated value is an AST class
IfStatement {AstIfStatement} : if '(' expr ')' statement ;

To use an optional symbol (meaning either that symbol or an empty string is accepted), the opt suffix can be added to the symbol reference. Defining a symbol ending in `opt’ is forbidden.

# a new rule is implicitly added: Identifieropt : Identifier | /* empty */ ;
input : Identifieropt ;

Rule syntax extensions

Textmapper supports extended syntax for production rules. It includes grouping, lists, unordered sequences, alternations and options. These extensions are substituted before the table generation step to form a context-free grammar. As a part of this process, rules containing any kind of internal alternation are replaced with several context-free rules having the same derivation.

Parentheses are used for grouping in the usual manner. A question mark after a symbol reference or a group means that reference or group is optional. The following rule is expanded into 4 rules, where all combinations are expressed individually.

javaImport : import static? qualifiedID ('.' '*')? ';' ;

# is equivalent to
javaImport :
    import qualifiedID ';'
  | import qualifiedID '.' '*' ';'
  | import static qualifiedID ';'
  | import static qualifiedID '.' '*' ';'
  ;

The difference between using the opt suffix and a question mark operator is that the latter does not introduce a new nonterminal. This may result in slightly more generated code, but helps to prevent LR conflicts.

In-group alternation is a way to avoid repetitions.

# accepts "block ID code" and "block ID semicolon"
element : block ID (code | semicolon) ;

Unordered sequences accept their elements in any order but each element must appear exactly once (unless it is optional). Such sequences are formed by joining the elements with an ampersand (‘&’) operator. Parentheses can be used to override precedence or to improve readability.

# accepts all permutations of A, B and C as well as an empty string
ABC : (A & B & C)? ;

# accepts: int i, const int i, int volatile i, etc.
Var : Type & Modifiers? ID ;
Modifiers : const | volatile ;

Lists can be introduced on the right-hand side of a rule without the need for a separate nonterminal. An element which may appear multiple times should be followed by a quantifier. This can be either a plus (one or more elements are accepted) or a star (in this case, an empty string is accepted as well). There is a special syntax for lists with a separator. The sequence of separator tokens follows the list element, prefixed with the separator keyword. A quantifier should then be applied to the whole block enclosed in parentheses.

# one or more methods
input : method+ ;

# comma separated list of zero or more parameters
methodHeader : Identifier '(' (parameter separator comma)* ')' ;

As with opt versus ?, there are two ways to express lists accepting an empty string. The star quantifier handles an empty production in the created list nonterminal, while the +? operator inlines it on the caller side. Inlining is more verbose, but less likely to cause LR conflicts.

Semantic Actions

Semantic actions are code in the target language, which is executed when the parser reaches that point in the grammar. Most actions are declared at the end of the rule, where they can be used to compute the resulting associated value for the left-hand nonterminal. $$ stands for the resulting variable.

# simple action
One {int} : '1'   { $$ = 1; } ;

To refer to the values of the right-hand side symbols, use a dollar sign followed by the position or the name of the symbol. Symbols are numbered from left to right starting with 0, so the leftmost one can be referred to as $0. If a symbol is not matched, its associated value will be null (nil, NULL, 0, etc., depending on the target language and type).

# using names
IfExpression {*IfNode} : 'if' '(' expr ')' stmt   { $$ = &IfNode{$expr, $stmt) } ;

# using positions. For "a - b": $0 = "a", $2 = null,  $4 = "b"
input : id ('+' id | '-' id)? { /* action */ } ;

An associated value for the symbol can be used only if its type is declared. In the previous example $1 and $3 are undefined and their usages are reported as compile-time errors.

When the same symbol is used twice on the right-hand side of a rule, it cannot be referred to by its own name due to ambiguity (reported as an error). In this case we can provide an alias for it: alias=symbol.

expr {int} :
    left=expr '+' right=expr  { $$ = $left + $right; }
  | ID '(' argumentsopt ')'   { $$ = call($ID, $argumentsopt); }
;

It is possible to declare actions in the middle of a rule. Only symbols to the left of the action are available for referring to (as the rest of the rule has not been parsed by then). Each such mid-rule action introduces a new auxiliary nonterminal in the grammar (which triggers its execution), so it may cause LR conflicts.

# actions conflict: lalr(1) is not enough to decide which action to execute
expr :
    expr { fieldAccess($expr); } '.' 'ID'
  | expr { methodAccess($expr); } '.' 'ID' '(' arguments ')'
;

Although mid-rule actions have associated nonterminals, they don’t have associated values (the use of $$ is forbidden). This keeps the syntax clean, while for scenarios where associated data is needed one can still extract an action into a separate nonterminal.

Due to syntax extensions, it may not be completely clear if the action is a mid-rule action (and so requires an additional nonterminal). If no symbol can appear after the action, it is considered as a rule action. If there are multiple rule actions, they are executed left to right when the rule is reduced.

# runs aParsed() when "read a" is reduced
input : read (a { aParsed(); } | b | c) ;

# the following two actions are executed one after another
# when "minus expr" is matched
input : (minus { handleMinus(); })? expr { handleExpr(); }

Semantic actions are processed using the Textmapper templates engine, which uses a dollar sign as the start symbol of all template constructs. All dollar signs in the target language must be escaped. Here is a brief explanation of the most used constructs:

${expr}     evaluates expression and prints out the result
$name       shortcut for ${name} (or ${self.name})
$0, $1..    shortcuts for ${self[index]}
$$          shortcut for ${self.'$'}
${'$'}      prints a single dollar sign
${self}     a string representation of the rule, like 'expr : minus expr'

Expressions in rule actions are evaluated in the context of the current rule. In addition to property and index access to the right-hand side values, a rule object also provides the following operations:

${left()}   refers to the left-hand side nonterminal
${first()}  refers to the first symbol on the RHS (only for non-empty rules)
${last()}   refers to the last symbol on the RHS (only for non-empty rules)

Although ${left()} and $$ denote the same entity, their output differs for strictly-typed target languages:

$$: Must be used for writing data to the associated value of the current rule. This value is usually stored in a variable of the most general type (e.g. Object in Java). If the data is read through the $$ shortcut, the value is returned “as is” with no type casting.

${left()}: Should be used for reading typed data; it returns an associated value cast to the rule’s nonterminal type. It cannot be used to modify values.

# Building lists in Java. ${left()} is equivalent to ((List<Method>)$$)
methods {List<Method>}
  : method         { $$ = new ArrayList<Method>(); ${left()}.add($method); }
  | methods method { ${first()}.add($method); }
  ;

Symbol Locations

The generated lexer and parser track byte offsets of the parsed elements in the source text. Location attributes of terminal symbols are filled out by the lexer. A nonterminal inherits its start and end positions from the first and last symbols on the right-hand side of the matched rule. If the rule is empty, then both locations are set to the starting position of the next non-whitespace token in the stream.

# store line and offsets in the AST class
assignment {Assignment} :
  ID '=' expr
      { $$ = new Assignment($ID, $expr, ${ID.offset}, ${expr.endoffset}); }
;

Typically element boundaries should not include comments and whitespaces, but if a nonterminal ends with a missing -opt symbol, its range expands until the next non-whitespace token. Prefer using the question mark syntax to avoid such problems.

# when parsing `( prefix  )`, the clause range spans until the closing parenthesis
clause : prefix suffixopt ;

# the range correctly ends at the end of prefix or suffix symbol
clause : prefix suffix? ;

The lexer also tracks the current line number for error reporting, and it can be useful for other purposes.

IntegerConstant {int}: /__LINE__/  { $$ = l.Line() }

Parser Algorithm

The generated parser is a shift-reduce parser. It builds a parse tree from the leaves upwards by scanning the input left-to-right. The processed part of the input is stored as a set of partially parsed trees in a stack. Initially the stack (called the parsing stack) is empty, and in the end it contains the only parse tree for the whole input (if parsing succeeded). The stack is also used to maintain the associated attributes for each symbol (like values and locations). The parser proceeds step-by-step by applying one of two simple actions:

Shift takes the next token from the input stream and pushes it onto the parsing stack.

Reduce takes a rule index as an argument and expects that the top N symbols of the stack form the right-hand side of the rule. These N symbols are then replaced with the rule’s left-hand nonterminal. The reduction process also includes computing the associated attributes for this nonterminal (e.g. executing semantic actions).

The following example shows the parsing steps for the variable declaration construct, which is common to many languages.

stack input stream action
<empty> int i = 5 + 3; Shift
int i = 5 + 3; Reduce (type : int)
type i = 5 + 3; Shift
type ID(i) = 5 + 3; Shift (2 times)
type ID = 5 + 3; Reduce (expr : integer_const)
type ID = expr + 3; Shift (2 times)
type ID = expr + 3 ; Reduce (expr : integer_const)
type ID = expr + expr ; Reduce (expr : expr ‘+’ expr)
type ID = expr ; Shift
type ID = expr ;   Reduce (var_decl : type ID ‘=’ expr ‘;’)
var_decl   Reduce (input : var_decl)
input   <success>

The decision on which action to execute is made at each step, depending on the stack content and the following items in the input stream (so-called lookahead tokens). Textmapper uses LR parsing techniques to come to this decision quickly.

Instead of analyzing the whole stack each time, an aggregated value describing the stack content is used. This value is called a parser state. The parser state is effectively maintained for every element of the stack using a special table in the generated code (called the goto table). When a terminal or nonterminal symbol is pushed onto the stack the new state is calculated as goto[s, newSymbol] where s is the previous state on top of the stack.

In each state there may be one or more actions that lead to a successful parse. If there are many, we have to examine lookahead tokens to disambiguate. The generated parser scans them one by one until the ambiguity is resolved. Then it applies the only remaining valid action, or reports a syntax error. The pre-computed action table is used by the parser to look up the next action for the current state and the next input terminal. In addition to Shift and Reduce, the action table may contain two more types of actions:

Error reports a syntax error. The parser may also try to recover if this feature is on.

Lookahead means that one more token is needed to disambiguate. It takes a new lookahead state as an argument. Lookahead states extend a set of parser states, and can be used as keys in the action table (but not in the goto table). When the parser encounters this action, it fetches the next lookahead token. It then continues the lookup for the new state and the new token until no more Lookahead actions are returned.

There is no Success action, instead the parser continues until it reaches the success state.

Grammar Ambiguities

While building the action table, Textmapper reports all unresolved ambiguities as compile-time errors. Most of these errors originate in the grammar design and can be fixed by changing the derivation and precedence rules. If a grammar was successfully compiled, the generated parser is deterministic and produces one unique parse tree for each input.

Two types of conflicts can be encountered:

Shift/Reduce appears when both shifting the next token and reducing a rule may lead to a successful parse. One of the classic examples is the Dangling Else problem.

Reduce/Reduce means there are two rules which can be reduced in this state.

These conflicts also appear when the chosen LR algorithm variation is not strong enough for the grammar. The number of lookahead symbols may be low, or the states were merged too aggressively (in order to save space). There is usually a way to rewrite parts of the grammar to make it compatible with the restrictions imposed by the algorithm.

Each error contains an example of the parsing stack content, conflicting rules, and lookahead tokens:

input: statementList if '(' Expression ')' Statement
shift/reduce conflict (next: else)
	IfStatement : if '(' Expression ')' Statement

We can resolve this conflict manually by marking the appropriate rule with a %shift modifier, which takes a comma-separated list of terminal symbols for which Shift is the preferred solution. Excess modifiers (i.e. for non-conflicting rules or terminals) are reported as errors.

IfStatement :
    kw_if '(' Expression ')' Statement %shift else
  | kw_if '(' Expression ')' Statement else Statement
;

The rule modifier should be used with care, as it may silently resolve more conflicts than you expect. A good practice is to limit its usages to well-studied cases only.

Operator Precedence

Mathematical expressions appear in many languages, and every language defines its own order in which parts of an expression are evaluated. For example, without parentheses, multiplication is typically done before addition. From the parsing point of view, it does make sense to have an AST that reflects the chosen operator precedence, i.e. operators with higher precedence are reduced first, and remain closer to the leaves in the AST. This allows us to evaluate an expression using a simple recursive tree traversal.

For operators of the same precedence, the order of execution remains unclear. For example, a+b+c can be interpreted either as (a+b)+c, or a+(b+c). We may resolve it by defining an operator as left-associative (operations are grouped left-to-right), right-associative, or non-associative. Non-associative operators produce a syntax error when applied to a term of the same precedence (e.g. a range operator is non-associative 1..100).

The following grammar snippet produces a bunch of Shift/Reduce conflicts. Obviously, without precedences and associativities, there are multiple ways to parse any complex expression.

expr :
     IntegerConstant
   | expr '+' expr
   | expr '*' expr
;

We can rewrite it so that operator priorities and associativities are encoded in the grammar. The rewritten grammar would look like this:

expr :
    plus_expr ;

plus_expr :
    mult_expr | plus_expr '+' mult_expr ;

mult_expr :
	IntegerConstant | mult_expr '*' IntegerConstant ;

It is heavy, verbose, and requires many excess reductions (every single constant passes its way through all nonterminals). Instead, we can add precedence rules to solve all the ambiguities and stick to the original version. Note that operators are represented by terminal symbols.

%left '+', '-';
%left '*', '/', '%';

Operators on the first line have the lowest precedence. Each subsequent line introduces a new, higher precedence level. On the same level, operators are interchangeable and have the same associativity (left, right, or nonassoc).

Textmapper uses precedences to resolve shift/reduce conflicts. First, it assigns a precedence to a rule which can be reduced. By default, it takes the precedence of the last terminal on the right-hand side of the rule. This can be overridden using a %prec <terminal> rule modifier. If precedences are defined for both the rule and the lookahead token, Textmapper can compare them and perform:

  • Shift if the lookahead token has higher precedence than the rule, or they have equal precedence with right associativity
  • Reduce if the lookahead token has lower precedence than the rule, or they have equal precedence with left associativity
  • Error if the lookahead token and the rule are of the same precedence and are non-associative

For example, the following conflict will be resolved as Shift for the '*' lookahead token and as Reduce for '+' (which is left-associative).

input: MethodHeader '{' expr '+' expr
shift/reduce conflict (next: '+', '*')
    expr : expr '+' expr

The unary minus requires a separate operator terminal as it has higher precedence than the usual minus operator. This means we must specify this custom precedence in a rule modifier.

:: lexer
unaryMinus:

:: parser
%left '+', '-';
%left '*';
%left unaryMinus;

expr :
     IntegerConstant
   | expr '+' expr
   | expr '-' expr
   | expr '*' expr
   | '-' expr %prec unaryMinus
;

Without the unaryMinus precedence, -1-1 can be parsed in two ways: either as -(1-1), or as (-1)-1.

Error Recovery

By default, the generated parser stops when it encounters the first syntax error. This may not be desirable in some scenarios such as parsing a user-typed text in an editor, where most of the time the syntactic structure is broken (often locally). Introducing a token with the predefined name error turns on the recovery mechanism. The error terminal represents a part of the stream which cannot be successfully parsed.

# Note: error token cannot have an associated regular expression
error:

This terminal can then be used in production rules to specify the contexts where an error is expected and can be properly handled.

statement :
    expr_statement ';'
  | error ';'		# if something is broken here, it is broken up to the next semicolon
;

When the generated parser encounters a syntax error, it starts discarding the parsing context from the stack until it finds a state in which the error terminal is accepted. This means that even if an error occurred in an expression, the stack is unwound up to the statement context in which the appropriate recovery rule is defined. The parser then creates a new error token, pushes it onto the stack, and continues parsing.

It may happen that the next few tokens in the stream contribute to the error, so the parser remains in the recovery mode until three consecutive input tokens have been successfully shifted. No new syntax errors are reported during this time.

Abstract Syntax Tree

In most grammars, authors use semantic actions to build an intermediate representation of the input, moving most or all of the semantic processing out of the parser. Usually, this representation is a compact version of the parse tree and is called an Abstract Syntax Tree (AST). The generated parser then becomes a simple component which is able to transform an input text into an AST. This approach pays off when you need to reuse the parser in different environments (e.g. for providing code completion in an IDE).

… to be continued