The textmapper input file contains definitions of the lexer and parser as well as various options affecting the generation output. Either the parser or lexer section can be omitted to get only the lexer class or the parser with a prepared stub for a handwritten lexer.
The source text is a sequence of Unicode characters. By default, it is stored in a file with the extension .tm which is represented in the UTF-8 encoding.
Whitespaces (including line breaks) have no special meaning, and are used only as token separators. Single-line comments start with a hash character (#
). Larger pieces of text can be commented out with traditional block comments (/* .. */
). Identifiers consist of case-sensitive sequences of alphanumeric or underscore characters and cannot start with a digit. As an exception, they are also allowed to have dashes in the middle, as is common in many reference grammars. Another possible form of an identifier is a sequence of characters in single quotes (‘), which is useful for naming keywords and punctuation tokens.
Textmapper files must start with a header specifying a name for the grammar and a target language for the generated code. Semantic actions and value types associated with symbols are considered to be in that language. The header is followed by zero or more generation options, which can be used to tweak the Textmapper output.
language js(go); # A Javascript parser in Go.
eventBased = true # generate an event-based parser
reportTokens = [Comment, invalid_token] # report these tokens as parser events
# ... lexer specification
# ... parser
Option values are checked against the expected types declared in the target language templates, and can contain integers, booleans, strings, grammar symbol references, and arrays of the above.
The list of available options for each output language can be found in the language section below.
A lexical analyzer (lexer, scanner) is a component which breaks an input stream of characters into tokens (lexemes). The lexer specification consists of a set of rules. Each rule maps a lexical pattern to a terminal symbol, with the option of adding attributes and associated code. The generated lexer returns tokens one by one (in order of their appearance in the input), identifying them by applying the rule with the longest matching lexical pattern (i.e. /==/
wins over /=/
). On every successful match of a rule with associated code, the code is executed. Typically, the lexer generated by Textmapper accepts a utf-8 encoded string (Java strings are in UTF-16).
# <token identifier>: <regular expression> [<attributes>] [<code>]
IntegerConstant: /0|[1-9][0-9]+/ { reportInteger(l.Text()) }
Lexemes marked with the (space) attribute are excluded from the stream.
# skip them
WhiteSpace: /[\n\r\t ]+/ (space)
comment: /#.*(\r?\n)?/ (space)
There is one predefined symbol eoi, meaning end-of-input. It is returned automatically when the end of the stream is reached. You can define you own lexical patterns for eoi though.
# treat Ctrl-Z as end-of-input, see http://en.wikipedia.org/wiki/Control-Z
eoi: /\x1a/
It is possible to extract a part of a regular expression into a reusable named pattern.
digit = /[0-9]/
IntegerConstant: /{digit}+/
By omitting a lexical pattern, you can introduce a terminal symbol without a lexer rule. For example, adding the error terminal turns on the error recovery mechanism in the parser (see below).
# a marker of the broken part of the stream
error:
Mapping an input string to a terminal can be handled by the associated code (see the lexer hack).
# some of the identifiers are types
TYPE:
ID: /[a-zA-Z][a-zA-Z_0-9]*/
{
if isType(l.Text()) {
token = TYPE
}
}
During the parsing process, each symbol can optionally hold an associated value. Different symbols can have values of different types. The type can be specified in braces next to the symbol name. For example, IntegerConstant represents any number in the code, while the value of the number is usually attached to the symbol as the associated data.
IntegerConstant {int}: /[0-9]+/
By default, the associated value is null (nil, NULL, 0, etc., depending on the target language and type). To compute it for a terminal, add a semantic action to the lexer rule. “$$” is an alias for the associated value variable of the current terminal under construction.
IntegerConstant {int}: /[0-9]+/ { $$ = mustParseInt(l.Text()) }
When the string type is chosen for a terminal, the matched text is taken as the default associated value. Semantic actions can override this default value (to perform decoding, for example).
# value is taken from the parsed text as is
ID {string}: /[a-z]+/
When a sequence of characters is matched by two rules, Textmapper reports a conflict. In most cases one of the regular expressions can be rewritten to exclude the conflicting string. Although that is possible in theory, in practice it may lead to enormously long and unreadable definitions. Textmapper has two mechanisms to avoid this problem. The first one is applicable when one of the rules matches only one specific string. I.e. we have a more general and a more specific rule matching the same text. Marking the general one with the (class) attribute tells Textmapper to prefer specific rules over it.
# 'keyword' has a priority over identifier, since it is a more specific rule.
identifier: /[a-zA-Z]+/ (class)
'keyword': /keyword/
For all constant regular expressions, Textmapper tries to find a matching class rule. If there are two, a fatal error is reported.
Another solution is to use numerical priorities. The rule with the highest priority value is chosen. When not specified, the priority value is considered to be zero.
# 'void' is returned as kw_void, since -1 < 0
identifier: /[a-zA-Z]+/ -1
kw_void: /void/
Intensive usage of priorities can be dangerous, as it prevents Textmapper from detecting potential conflicts as you change the grammar. The conflicts get resolved silently in accordance with priorities, masking the actual problem, which is often a poorly-defined rule.
Parsing heterogeneous input can require adjusting the set of active lexer rules for different parts of the text. This can be done by introducing lexer start conditions and mapping rules to them. The lexer can be in only one start condition (i.e. state) at a time. Depending on the state, it uses the appropriate set of active rules to match and return the next token. The current start condition can be changed in the associated code upon a successful token match. There is one predefined condition, called initial, in which the lexer starts scanning the input string.
# on '/*' set the current start condition (also known as state) to inComment
commentStart: /\/\*/ (space) { l.State = StateInComment }
A rule prefixed with a list of start conditions is considered to be active in those conditions only.
<initial, inComment> /[\t\x20]+/ (space)
There are two types of start conditions: inclusive and exclusive, which get declared with ‘%s’ and ‘%x’ directives respectively. Rules without explicit restrictions automatically become active in all inclusive start conditions. By default, the initial state is inclusive but it can be re-declared as exclusive by mentioning it in a ‘%x’ directive.
%s initial, afterComma;
%x inComment, inStringLiteral;
identifier: /[a-zA-Z_]+/ # This rule is active in initial and afterComma states.
<initial> ',': /,/ { l.State = StateAfterComma }
Substituting a list of start conditions with a star character activates a rule in all available inclusive and exclusive start conditions.
<*> /[\t\x20]+/ (space)
By default, the EOI token is reported in inclusive start conditions only. In exclusive start conditions, whenever the lexer reaches the end of the input it produces an invalid_token
. This behavior can be changed by providing explicit rules handling /{eoi}/ in specific start conditions.
# Accept end-of-input in all start conditions.
<*> eoi: /{eoi}/
If several lexer rules are applicable in the same set of start conditions, they can be put together into a start conditions clause. Each rule in the clause inherits the applicability from the clause but can override it if needed. Such clauses combined with exclusive start conditions are a good way of representing lexers within lexers, which tokenize pieces of the input that are syntactically different from the rest (e.g. comments and embedded languages). Start condition clauses can nest. It is typical to handle eoi
in such clauses and reset the lexer state to initial.
<inComment, inJavadoc> {
commentText: /*|[^*]+/ (space)
commentEnd: /\*\// (space) { l.State = StateInitial }
eoi: /{eoi}/ { ... report an unexpected end of file ... ; l.State = StateInitial }
<inJavadoc> {
# some extra rules available in Javadoc comments only
}
}
Using start conditions, the same string can be interpreted differently depending on the lexer context.
<newLine> TAB: /\t/
<initial> ignoredTab: /\t/ (space)
When the lexer cannot match the next portion of the input against some rule, it stops and reports everything it consumed so far as an invalid token. Such tokens get reported and ignored by the parser.
# Unfinished string literals are reported as INVALID_TOKEN and span from the
# opening double quote until the end of the line.
StringLiteral: /"[^"\n\r]*"/
The lexer uses backtracking and prefers longer tokens, which means that it will keep consuming characters as long as there is a chance to find a longer one. If it cannot proceed anymore, it returns the current longest token. The following rule will consume everything between the first and the last double quote in the input string.
BigStringLiteral: /"[.\n]*"/
Backtracking might lead to unexpected results for broken input - the prefix of an unfinished multi-line comment get interpreted as a pair of division and multiplication operators if such are present in the grammar.
commentChars = /([^*]|\*+[^*\/])*\**/
MultiLineComment: /\/\*{commentChars}\*\//
'*': /\*/
'/': /\//
This can be improved by having an explicit rule for invalid_token
capturing broken comments.
# Note: the following rule disables backtracking for incomplete multiline comments, which
# would otherwise be reported as '/', '*', etc.
invalid_token: /\/\*{commentChars}/
A similar situation happens in grammars which have a token for tree dots.
# Without an explicit invalid_token rule, two dots would be reported as two separate tokens.
'.': /\./
invalid_token: /\.\./
'...': /\.\.\./
Most grammars do not require backtracking and it is best to ensure that generated code does not contain a backtracking table. If it does, keep adding more invalid_token rules until the table disappears.
... lexer_tables.go for the previous example without the invalid token rule
var tmBacktracking = []int{
4, 3, // in '.'
}
There are legitimate use-cases for backtracking such as allowing identifiers to have dashes or escaped Unicode characters but those are relatively rare.
The syntax of regular expressions in Textmapper follows standard practices and is very close to what you know from perl or flex. It differs from them slightly, though. There are no capturing groups, lookbehind assertions or non-greedy quantifiers, as they make no sense in lexical patterns. Notation for character sets partly complies with the Unicode Technical Standard #18. A brief description of the syntax used follows.
Characters
c matches the character `c'
. any character except newline
\a alert (bell, '\x07')
\b backspace ('\x08')
\f form-feed ('\x0c')
\n newline (line feed, '\x0a')
\r carriage-return ('\x0d')
\t tab ('\x09')
\v vertical tab ('\x0b')
\ooo the character with octal value 0ooo
\xhh the character with hexadecimal value 0xhh
\uhhhh the character with hexadecimal value 0xhhhh
\ quote the following non-alphabetic character (to escape brackets, slashes, quantifiers etc.)
\\ backslash
Character classes
[abc] a, b, or c (class)
[_a-zA-Z] underscore, a through z or A through Z, inclusive (ranges)
[^abc] any character except a, b, or c (negation)
\d a digit: [0-9]
\D a non-digit: [^0-9]
\s a whitespace character: [ \t\n\v\f\r]
\S a non-whitespace character: [^\s]
\w a word character: [a-zA-Z_0-9]
\W a non-word character: [^\w]
\p{prop} any character in the unicode category or block `prop' (see below)
\P{prop} a non-prop character: [^\p{prop}]
\p{Lu} an uppercase letter (simple category)
[^a-z]
matches a newline, unless ‘\n’ is explicitly added into the negated class: [^a-z\n]
Logical operators
RS R followed by S (concatenation)
R|S either R or S (union)
(R) matches an `R' (parentheses are used to override precedence)
Coming soon: case-insensitive regular expressions. /(?i)foo/
Quantifiers
R? R, once or not at all
R* R, zero or more times
R+ R, one or more times
R{n} R, exactly n times
R{n,} R, at least n times
R{n,m} R, at least n but not more than m times
bar*
is equivalent to ba(r*)
, use parentheses to override precedence: (bar)*
Pattern definitions
{name} the substitution of the `name' pattern definition
Special symbols
{eoi} an end-of-input (can appear at the end of a pattern only)
Supported unicode categories (in \p{xx}
)
Lu Letter, uppercase
Ll Letter, lowercase
Lt Letter, titlecase
Lm Letter, modifier
Lo Letter, other
Mn Mark, nonspacing
Mc Mark, spacing combining
Me Mark, enclosing
Nd Number, decimal digit
Nl Number, letter
No Number, other
Pc Punctuation, connector
Pd Punctuation, dash
Ps Punctuation, open
Pe Punctuation, close
Pi Punctuation, initial quote (may behave like Ps or Pe depending on usage)
Pf Punctuation, final quote (may behave like Ps or Pe depending on usage)
Po Punctuation, other
Sm Symbol, math
Sc Symbol, currency
Sk Symbol, modifier
So Symbol, other
Zs Separator, space
Zl Separator, line
Zp Separator, paragraph
Cc Other, control
Cf Other, format
Cs Other, surrogate
Co Other, private use
Cn Other, not assigned (including non-characters)
A parser component, given a stream of tokens, tries to build a parse tree, or reports a syntax error in case of failure. The parser specification consists of EBNF-like production rules, which are used to rewrite input to nonterminal start symbols (in a bottom-up manner). There should be at least one grammar rule defining a start symbol (the root of the parse tree), which by convention is called input.
# the minimum grammar: input is empty
input : /* empty */ ;
The right-hand side of a rule is a sequence of terminal and nonterminal symbols. Multiple alternative definitions for the same nonterminal can be combined using the vertical bar character (|
).
input : Identifier | IntegerConstant ;
The %input directive allows you to override the start symbol, or even introduce several start symbols (each gets a separate parse method in the generated code).
%input start, expr;
start : Identifier '=' expr ;
expr : IntegerConstant ;
As in lexer rules, the type of the associated value can be specified right after the symbol name.
# associated value is an AST class
IfStatement {AstIfStatement} : if '(' expr ')' statement ;
To use an optional symbol (meaning either that symbol or an empty string is accepted), the opt suffix can be added to the symbol reference. Defining a symbol ending in `opt’ is forbidden.
# a new rule is implicitly added: Identifieropt : Identifier | /* empty */ ;
input : Identifieropt ;
Textmapper supports extended syntax for production rules. It includes grouping, lists, alternations and options. These extensions are substituted before the table generation step to form a context-free grammar. As a part of this process, rules containing any kind of internal alternation are replaced with several context-free rules having the same derivation.
Parentheses are used for grouping in the usual manner. A question mark after a symbol reference or a group means that reference or group is optional. The following rule is expanded into 4 rules, where all combinations are expressed individually.
javaImport : import static? qualifiedID ('.' '*')? ';' ;
# is equivalent to
javaImport :
import qualifiedID ';'
| import qualifiedID '.' '*' ';'
| import static qualifiedID ';'
| import static qualifiedID '.' '*' ';'
;
The difference between using the opt suffix and a question mark operator is that the latter does not introduce a new nonterminal. This may result in slightly more generated code but helps to prevent LR conflicts.
In-group alternation is a way to avoid repetitions.
# accepts "block ID code" and "block ID semicolon"
element : block ID (code | semicolon) ;
Lists can be introduced on the right-hand side of a rule without the need for a separate nonterminal. An element which may appear multiple times should be followed by a quantifier. This can be either a plus (one or more elements are accepted) or a star (in this case, an empty string is accepted as well). There is a special syntax for lists with a separator. The sequence of separator tokens follows the list element, prefixed with the separator
keyword. A quantifier should then be applied to the whole block enclosed in parentheses.
# one or more methods
input : method+ ;
# comma separated list of zero or more parameters
methodHeader : Identifier '(' (parameter separator comma)* ')' ;
As with opt versus ?
, there are two ways to express lists accepting an empty string. The star quantifier handles an empty production in the created list nonterminal, while the +?
operator inlines it on the caller side. Inlining is more verbose, but less likely to cause LR conflicts.
Semantic actions are code in the target language, which is executed when the parser reaches that point in the grammar. Most actions are declared at the end of the rule, where they can be used to compute the resulting associated value for the left-hand nonterminal. $$
stands for the resulting variable.
# simple action
One {int} : '1' { $$ = 1; } ;
To refer to the values of the right-hand side symbols, use a dollar sign followed by the position or the name of the symbol. Symbols are numbered from left to right starting with 0, so the leftmost one can be referred to as $0
. If a symbol is not matched, its associated value will be null (nil, NULL, 0, etc., depending on the target language and type).
# using names
IfExpression {*IfNode} : 'if' '(' expr ')' stmt { $$ = &IfNode{$expr, $stmt} } ;
# using positions. For "a - b": $0 = "a", $2 = null, $4 = "b"
input : id ('+' id | '-' id)? { /* action */ } ;
An associated value for the symbol can be used only if its type is declared. In the previous example $1
and $3
are undefined and their usages are reported as compile-time errors.
When the same symbol is used twice on the right-hand side of a rule, it cannot be referred to by its own name due to ambiguity (reported as an error). In this case we can provide an alias for it: alias=symbol
.
expr {int} :
left=expr '+' right=expr { $$ = $left + $right; }
| ID '(' argumentsopt ')' { $$ = call($ID, $argumentsopt); }
;
It is possible to declare actions in the middle of a rule. Only symbols to the left of the action are available for referring to (as the rest of the rule has not been parsed by then). Each such mid-rule action introduces a new auxiliary nonterminal in the grammar (which triggers its execution), so it may cause LR conflicts.
# actions conflict: lalr(1) is not enough to decide which action to execute
expr :
expr { fieldAccess($expr); } '.' 'ID'
| expr { methodAccess($expr); } '.' 'ID' '(' arguments ')'
;
Although mid-rule actions have associated nonterminals, they don’t have associated values (the use of $$
is forbidden). This keeps the syntax clean, while for scenarios where associated data is needed one can still extract an action into a separate nonterminal.
Due to syntax extensions, it may not be completely clear if the action is a mid-rule action (and so requires an additional nonterminal). If no symbol can appear after the action, it is considered as a rule action. If there are multiple rule actions, they are executed left to right when the rule is reduced.
# runs aParsed() when "read a" is reduced
input : read (a { aParsed(); } | b | c) ;
# the following two actions are executed one after another
# when "minus expr" is matched
input : (minus { handleMinus(); })? expr { handleExpr(); }
Semantic actions are processed using the Textmapper templates engine, which uses a dollar sign as the start symbol of all template constructs. All dollar signs in the target language must be escaped. Here is a brief explanation of the most used constructs:
${expr} evaluates expression and prints out the result
$name shortcut for ${name} (or ${self.name})
$0, $1.. shortcuts for ${self[index]}
$$ shortcut for ${self.'$'}
${'$'} prints a single dollar sign
${self} a string representation of the rule, like 'expr : minus expr'
Expressions in rule actions are evaluated in the context of the current rule. In addition to property and index access to the right-hand side values, a rule object also provides the following operations:
${left()} refers to the left-hand side nonterminal
${first()} refers to the first symbol on the RHS (only for non-empty rules)
${last()} refers to the last symbol on the RHS (only for non-empty rules)
Although ${left()}
and $$
denote the same entity, their output differs for statically-typed target languages:
$$
: Must be used for writing data to the associated value of the current rule. This value is usually stored in a variable of the most general type (e.g. Object in Java). If the data is read through the $$
shortcut, the value is returned “as is” with no type casting.
${left()}
: Should be used for reading typed data; it returns an associated value cast to the rule’s nonterminal type. It cannot be used to modify values.
# Building lists in Java. ${left()} is equivalent to ((List<Method>)$$)
methods {List<Method>}
: method { $$ = new ArrayList<Method>(); ${left()}.add($method); }
| methods method { ${first()}.add($method); }
;
The generated lexer and parser track byte offsets of the parsed elements in the source text. Location attributes of terminal symbols are filled out by the lexer. A nonterminal inherits its start and end positions from the first and last symbols on the right-hand side of the matched rule respectively. If the rule is empty, then both locations are set to the starting position of the next non-whitespace token in the stream.
# store line and offsets in the AST class
assignment {Assignment} :
ID '=' expr
{ $$ = new Assignment($ID, $expr, ${ID.offset}, ${expr.endoffset}); }
;
Typically element boundaries should not include comments and whitespaces, but if a nonterminal ends with an omitted nullable symbol, its range expands until the next non-whitespace token. Prefer using the question mark syntax to avoid such problems.
# when parsing `( prefix )`, the clause range spans until the closing parenthesis
clause : prefix suffixopt ;
# the range correctly ends at the end of prefix or suffix symbol
clause : prefix suffix? ;
The lexer also tracks the current line number for error reporting, and it can be useful for other purposes.
IntegerConstant {int}: /__LINE__/ { $$ = l.Line() }
Nullable nonterminals must be used with care if you rely on the default AST construction mechanism that comes with Textmapper (see the Arrow notation section below).
From the grammar, Textmapper knows which tokens can start, follow, or be found within a nonterminal, and it makes these sets available to use in the same grammar via the token set syntax. When set(<setexpr>)
is found on the right-hand side of a grammar rule, it matches any terminal from <setexpr>
.
Expression syntax:
<term> represents itself
<nonterm> all terminals that can be found within nonterm
first <nonterm> all terminals that can start this nonterminal
last <nonterm> all terminals that can end this nonterminal
follow <nonterm> all terminals that can follow this nonterminal (careful about eoi)
precede <nonterm> all terminals that can precede this nonterminal
<setexpr> | <setexpr> union
<setexpr> & <setexpr> intersection
~<setexpr> negation (careful about eoi)
(<setexpr>) grouping
Only follow and negation expressions might contain eoi
, and it is often preferable to remove it:
set(follow X & ~(eoi))
set(~(first X | eoi))
The following code uses token sets to skip nested braces and avoids enumerating all terminals in the grammar.
BracesContent:
'{' BracesContent+? '}'
| set(~(eoi | '{' | '}'))
;
Grammar lookaheads (see below) can use token sets to reduce the number of tokens they need to scan before making a decision.
# Accept that this is a type cast after looking at exactly one token after parentheses.
StartOfCast : '(' Type ')' set(first Primary) ;
It is possible to make certain tokens sets available in generated code for the lexer and/or tests. The following set is generated automatically when error recovery is enabled.
%generate afterErr = set(follow Error);
parser_tables.go will contain:
// set(follow error) = RPAREN, RBRACK, RBRACE, COMMA, SEMICOLON
var afterErr = []int32{
17, 40, 41, 42, 43,
}
Sometimes it is useful to parse a prefix of a string and ignore the remaining text (for lookaheads, for example). We can enable such behavior for any start nonterminal by adding no-eoi
after its name in the %input directive.
%input input no-eoi;
input : '(' ... ')' ;
The main requirement for such inputs is that the state after consuming the last terminal of an accepted string is an LR(0) state, which means that it does not require a lookahead. Consider the following grammar:
file : declaration* ;
After parsing each declaration, the parser needs to know if there are more declarations in the stream, and for this it has to check the next token (up to eoi
). This nonterminal always consumes the full input.
Most input symbols that end with a terminal (or a fixed set of them) fulfil this requirement.
The generated parser is a shift-reduce parser. It builds a parse tree from the leaves upwards by scanning the input left-to-right. The processed part of the input is stored as a set of partially parsed trees in a stack. Initially the stack (called the parsing stack) is empty, and in the end it contains the only parse tree for the whole input (if parsing succeeded). The stack is also used to maintain the associated attributes for each symbol (like values and locations). The parser proceeds step-by-step by applying one of two simple actions:
Shift takes the next token from the input stream and pushes it onto the parsing stack. Shifting is only valid if the augmented parsing stack is a prefix of one of the production rules.
Reduce takes a rule index as an argument and expects that the top N symbols of the stack form the right-hand side of the rule. These N symbols are then replaced with the rule’s left-hand nonterminal. The reduction process also includes computing the associated attributes for this nonterminal (e.g. executing semantic actions).
At any step, either a shift or a single reduce is possible, any ambiguities are caught during compilation.
The following example shows the parsing steps for the variable declaration construct, which is common to many languages.
stack | input stream | action |
---|---|---|
<empty> | int i = 5 + 3; | Shift |
int | i = 5 + 3; | Reduce (type : int) |
type | i = 5 + 3; | Shift |
type ID(i) | = 5 + 3; | Shift (2 times) |
type ID = INT(5) | + 3; | Reduce (expr : integer_const) |
type ID = expr | + 3; | Shift (2 times) |
type ID = expr + INT(3) | ; | Reduce (expr : integer_const) |
type ID = expr + expr | ; | Reduce (expr : expr + expr) |
type ID = expr | ; | Shift |
type ID = expr ; | Reduce (var_decl : type ID = expr ;) | |
var_decl | Reduce (input : var_decl) | |
input | <success> |
The decision on which action to execute is made at each step, depending on the stack content and the following items in the input stream (so-called lookahead tokens). Textmapper uses LR parsing techniques to come to this decision quickly.
Instead of analyzing the whole stack each time, an aggregated value describing the stack content is used. This value is called a parser state. The parser state is effectively maintained for every element of the stack using a special table in the generated code (called the goto table). When a terminal or nonterminal symbol is pushed onto the stack the new state is calculated as goto[s, newSymbol]
where s is the previous state on top of the stack.
In each state there may be one or more actions that lead to a successful parse. If there are many, we have to examine lookahead tokens to disambiguate. The generated parser scans them one by one until the ambiguity is resolved. Then it applies the only remaining valid action, or reports a syntax error. The pre-computed action table is used by the parser to look up the next action for the current state and the next input terminal. In addition to Shift and Reduce, the action table may contain two more types of actions:
Error reports a syntax error. The parser may also try to recover if error recovery is enabled.
Lookahead means that one more token is needed to disambiguate. It takes a new lookahead state as an argument. Lookahead states extend a set of parser states, and can be used as keys in the action table (but not in the goto table). When the parser encounters this action, it fetches the next lookahead token. It then continues the lookup for the new state and the new token until no more Lookahead actions are returned.
The success state is reached at the end of file and when the parsing stack has been reduced to a single element, the start symbol a.k.a. input.
LR generates deterministic parsers and this means there is a single parse tree for each input. To guarantee this property, Textmapper stops on all ambiguities, or conflicts during the compilation of the grammar. Most of these conflicts can be fixed by changing the derivation rules and expliciting precedence rules. Two types of conflicts can be encountered:
Shift/Reduce means that a state offers two transitions: shifting the next token or reducing the stack.
Reduce/Reduce means there exist a state for which two rules match to reduce the stack.
The classic example for a shift/reduce conflict is the Dangling Else problem. The parsing rules for the grammar are as follow:
:: parser
expr:
num | id | pred | ifexpr ;
pred:
id '==' num;
ifexpr:
'if' pred expr # 1
| 'if' pred expr 'else' expr ; # 2
num
, id
, 'if'
, 'else'
and '=='
are terminals.
When compiling the rules above, a shift-reduce error message describing the conflict is raised. Conflict error messages include an example of the content of the parsing stack, the lookahead token that can be shifted, the type of conflict and the rule that can be reduced. For example:
input: 'if' pred expr # example parsing stack leading to the conflict
shift/reduce conflict (next: 'else') # type of conflict and lookahead terminal that can shift
ifexpr : 'if' pred expr # rule whose the last elements match the parsing stack.
Let’s use a concrete source: if a==1 3 else 4
, when the parsing stack has: 'if'
, a==1
, 3
, and the next token is else
, the can:
else
into the parsing stack according to rule #2'if'
pred
expr
to the non-terminal ifexpr
, via rule #1.Any two rules sharing the same prefix are a shift/reduce conflict. In this case, the grammar is said to be not left-factored.
In practice, there is only a single way to parse the expression if a==1 3 else 4
, so this example provide little help to get to understand the conflict. The ambiguity is clearer with another expression allowed by this grammar: if a==1 if b==2 3 else 4
. This is either:
if a==1 (if b==2 3) else 4
: else clause refers to the outer if,if a==1 (if b==2 3 else 4)
: else clause refers to the inner if.In C and Java, the else clause refers to the inner if. In other words, the inner if should not be reduced until the else clause is shifted. We need to augment the grammar to express this policy. Operator precedence, described in the next section, is a mechanism to resolve shift/reduce conflicts.
An operator is similar to a function in the sense that both can have arguments and return a value: 2 + 3
is comparable toadd(2, 3)
. Operators are usually more compact: typically the name is one or two characters. More relevant to our concern, parentheses can be omitted for infix operator. This omissionleads to shift reduce conflicts when using operators instead of functions:
First example, 1-2-3
can interpreted as:
substract(1, substract(2,3)) → substract(1, -1) → 0
❌ orsubstract(substract(1, 2),3) → substract(-1,3) → -4
✔️Second example, 2**2**3
:
power(power(2, 2),3)
→ 64 ❌ orpower(2, power(2, 3))
→ 256 ✔️Whether correct interpretation requires to evaluate first from the left-most symbol or from the right-most symbol depends on the operator, it depends on the conventions of arithmetic, for these examples. This is named the associativity of the operator, and this is one of:
1..100
, is non-associative.When operator are composed, another type of ambiguity arises: which arguments belong to which operator. For example, 3*2+1
can be interpreted:
product(3, add(2, 1))
→ product(3, 3) → 9
❌add(product(3, 2), 1)
→ add(6, 1) → 7
✔️In arithmetic, multiplication is done before addition, that is, multiplication binds stronger to the arguments than addition. Multiplication has an higher precedence than addition.
Note that we don’t have to build a parser that uses the same precedence rules than what evaluation requires, but it makes sense. When the AST reflects the chosen operator precedence, operators with higher precedence are reduced first, and remain closer to the leaves in the AST. This allows us to evaluate an expression using a simple recursive tree traversal: the order of reduces matches the sequence of evaluations.
Let’s see how Textmapper encodes associativity and precedence in a grammar. The following grammar snippet produces the Shift/Reduce conflicts described above.
expr :
IntegerConstant
| expr '+' expr
| expr '*' expr
;
We could rewrite it so that operator precedences and associativities are encoded in the grammar:
expr :
plus_expr ;
plus_expr :
mult_expr | plus_expr '+' mult_expr ;
mult_expr :
IntegerConstant | mult_expr '*' IntegerConstant ;
It is heavy, verbose, and requires many excess reductions (every single constant passes its way through all nonterminals). Instead, we can add precedence rules to solve all the ambiguities and stick to the original version. There is no operator
type in Textmapper, operators
must be terminals. Textmapper provides %left
, %right
and %nonassoc
directives to define the associativity. There is no directive to define associative operators like addition, use %left
or %right
. The precedence is implicitly defined by the order of the directives. Example:
%left '+' '-';
%left '*' '/' '%';
Operators that appears first in the file have the lowest precedence. Each subsequent preecedence directive introduces a new, higher precedence level. On the same level, operators are interchangeable and have the same associativity (left, right, or nonassoc). The precedence rules must be located in the parser section of a grammar and can be interspersed with non-terminals definitions.
To resolve shift/reduce conflicts, Textmapper, first, assigns a precedence to a rule. By default, it takes the precedence of the last terminal on the right-hand side of the rule. This can be overridden using a %prec <terminal>
rule modifier. The rule modifier must be at the end of the rule, after a semantic action and before an arrow declaring an AST node.
When precedences are defined for both the rule and the lookahead token, Textmapper can compare them and perform:
For example, the following conflict will be resolved as Shift for the '*'
lookahead token and as Reduce for '+'
(which is left-associative).
input: MethodHeader '{' expr '+' expr
shift/reduce conflict (next: '+', '*')
expr : expr '+' expr
Another example, the unary minus requires a separate operator terminal as it has higher precedence than the usual minus operator. This means we must specify this custom precedence in a rule modifier.
:: lexer
unaryMinus: # introduce a "precedence" terminal
:: parser
%left '+', '-';
%left '*';
%left unaryMinus;
expr :
IntegerConstant
| expr '+' expr
| expr '-' expr
| expr '*' expr
| '-' expr %prec unaryMinus
;
Without the unaryMinus precedence, -1-1
can be parsed in two ways: either as -(1-1)
, or as (-1)-1
.
Finally, let’s resolve the Dangling Else problem from the previous section. When the parsing stack is 'if'
, pred
and expr
and the next token is else
, the statement must be interpreted as an if/else statement and not prematurely reduced as a short if
statement. The next token else
must have a higher priority than the rule 'if' pred expr 'else' expr
. To do this, we set the precedence of the rule with %prec
and assign it the lowest precedence via a virtual token conventionally named resolveShift
.
%left resolveShift
%left 'else'
expr:
num | id | pred | ifexpr ;
pred:
id '==' num;
ifexpr:
'if' pred expr # 1
| 'if' pred expr 'else' expr %prec resolveShift # 2
;
Some grammar ambiguities require more advanced lookaheads than those provided by LALR, or have to be addressed with backtracking. Both cases are covered with a special declarative syntax that instructs Textmapper to spawn sub-parsers and do the disambiguation at runtime.
The grammar can contain lookahead terms in the form of (?= ...)
. Each term acts like a predicate and declares a set of alternative nonterminals that either match or not match a prefix of the remaining input.
Lambda : (?= StartOfLambda) '(' Parameters ')' '->' LambdaBody ;
StartOfLambda : '(' Parameters ')' '->' ;
A lookahead term becomes an empty nonterminal which needs to be reduced, so as long as there are no conflicts, the only price we pay for having it around is an extra reduce action to parse the containing rule but otherwise, this is a no-op.
# Negative lookahead.
Parenthesized : (?= !StartOfLambda) '(' Expression ')' ;
If two or more lookahead tokens can be reduced at the same time, Textmapper decides on the smallest set of lookaheads and the order in which they need to be performed at runtime to decide which production to prefer. It actually builds a decision tree, so all conflicting terms must have mutually exclusive predicates. This is integrated into the LALR algorithm and produces compile-time errors if there is no way to properly disambiguate the grammar.
In the example above, as soon as the parser sees an opening parenthesis, it creates a copy of the current lexer, and spawns a prefix sub-parser for StartOfLambda to look behind parentheses and decide if this is a lambda or a parenthesized expression. The result is pushed back onto the parsing stack, and the parser continues with either one or another production.
Disambiguating N productions requires N-1 lookaheads, and they are performed in the order in which they are mentioned in the lookaheads terms. In the following case, the parser will attempt CastStart first and then LambdaStart if CastStart fails.
Expr :
(?= !CastStart & !LambdaStart) '(' Expr ')'
| (?= CastStart & !LambaStart) '(' Type ')' Expr
| (?= LambdaStart) '(' Parameters ')' '->' LambdaBody
;
If your lookahead parsers contain rules that require lookaheads themselves, enable the recursive lookahead option:
recursiveLookaheads = true
This option also adds memoization for decisions made by inner parsers, guarding against exponential complexity in lookaheads (turning it into worst-case quadratic). Consider the following input for the above rules: ((((((1))))))
.
Lookahead sub-parsers never execute semantic actions or produce AST nodes.
It is possible to mark any position on the right-hand side of a rule with a state marker. The syntax for state markers is a period followed by an identifier. The main purpose of state markers is to expose parser internals to generated code for advanced parsing techniques, such as semicolon insertion. The same marker identifier can occur multiple times in a grammar, capturing parser states that include at least one of its positions. Keep in mind that one parser state corresponds to multiple positions in the grammar, and vice versa, one position in the grammar usually participates in more than one state.
ReturnStatement:
'return' ';'
| 'return' .noLineBreak Expression ';'
;
The previous snippet generates a map available to the parser code at runtime. This is the only visible effect of state markers, so one can also use them for purely documentation purposes.
var noLineBreakStates = map[int]bool{
27: true,
92: true,
}
Now, we can implement the rules of semicolon insertion by inspecting the current parser state at line breaks and checking whether the state allows or disallows semicolons (noLineBreak means eagerly insert a semicolon even if the next line parses fine without one).
By default, the generated parser stops when it encounters the first syntax error. This may not be desirable in some scenarios such as parsing a user-typed text in an editor, where most of the time the syntactic structure is broken (often locally). Introducing a token with the predefined name error
turns on the recovery mechanism. The error terminal represents a part of the stream which cannot be successfully parsed.
# Note: the error token cannot have an associated regular expression
error:
This terminal can then be used in production rules to specify the contexts in which an error is expected and can be properly handled.
statement :
expr_statement ';'
| error ';' # if something is broken here, it is broken up to the next semicolon
;
When the generated parser encounters a syntax error, it starts discarding the parsing context from the stack until it finds a state in which the error terminal is accepted. This means that even if an error occurred in an expression, the stack is unwound up to the statement context in which the appropriate recovery rule is defined. The parser then creates a new error token, pushes it onto the stack, and continues parsing.
The error token range covers all skipped tokens, even those that were successfully reduced into nonterminals before. If the range is empty, it gets positioned right before the next accepted token.
It may happen that the next few tokens in the stream contribute to the error, so the parser remains in the recovery mode until three consecutive input tokens have been successfully shifted. No new syntax errors are reported during this time. The logic for error recovery tries to skip as few tokens as possible and makes sure that the parser will accept the next token after the error. This significantly reduces the number of reported errors compared to Bison and other LALR parser generators.
Textmapper supports restricting the error recovery mechanism to a given production rule. Once the parser passes by a .recoveryScope
marker in a rule, the recovery becomes confined within that rule. The text range of any injected error
token will start after the marker. The feature teaches the parser to respect structural parentheses and makes the parsing output more predictable on broken input.
# Input:
# delay(func() {
# foo) // <- unexpected parenthesis, this whole line is reported as BrokenStatement
# })
Block:
'{' .recoveryScope Statement* '}' ;
Without .recoveryScope
, the parser matches the offending closing parenthesis with the opening parenthesis of the delay call, completely messing up the parse tree.
In most grammars, authors use semantic actions to build an intermediate representation of the input, moving most or all of the semantic processing out of the parser. Usually, this representation is a compact version of the parse tree and is called an Abstract Syntax Tree (AST). The generated parser then becomes a simple component which is able to transform input text into an AST. This approach pays off when you need to reuse the parser in different environments (e.g. in a compiler and for providing code completion in an IDE).
expr {Expression} :
id { $$ = IdExpr{$id} }
| '(' expr ')' { $$ = Parenthesized{$expr} }
;
expr_list {[]Expression} :
expr { $$ = []Expression{$expr} }
| expr_list ',' expr { $$ = append($expr_list, $expr) }
;
input {[]Expression} :
expr_list ;
This grammar is quite verbose but your generated parse function will immediately start returning the assembled AST (or a syntax error).
func (p *Parser) Parse(lexer *Lexer) ([]Expression, error)
If you prefer to keep your grammar clean, there is a different way to produce ASTs out of Textmapper grammars. You can declare your parser ‘event-based’, and use an arrow notation to map nonterminals (or their parts) into AST nodes.
Each token and each nonterminal during parsing have start and end byte offsets. Arrow notation tells the parser to report the range on the left-hand side of the arrow as an AST node of a given type. This turns the parser into a producer of (type, start, end) triples, and that stream completely defines the resulting AST (children nodes are reported before parents, they nest within parent nodes and don’t overlap).
Enable the event-based APIs:
eventBased = true
The most typical thing to start with is to report each non-list and non-nullable nonterminal as a separate node:
IfStatement -> IfStatement :
'if' expr block
| 'if' expr block 'else' block
;
The arrow notation can also be applied to individual rules (with fallback to the nonterminal node type):
IfStatement -> IfStatement :
'if' expr block # inherits the default type from its nonterminal
| 'if' expr block 'else' block -> IfElseStatement
;
.. and even to individual parts of a rule (requires parentheses):
Foo :
'(' (Parameters ','? -> Parameters) ')' ;
The arrow notation is compatible with semantic actions and associated values but it must always be the last thing in a rule or nonterminal header.
Foo {int} -> Foo :
IntegerConstant { $$ = $IntegerConstant; } %prec IntegerConstant -> IntFoo ;
Alternations require special treatment to make them easier to work with in reconstructed ASTs. We can declare an interface node that will represent all nodes that come out of certain production(s), and then use a generated selector for the interface node to locate its instances. Interface nodes are not reported but they do guarantee that all productions under them produce exactly one child.
%interface Statement;
Statement -> Statement:
IfStatement
| WhileStatement
| ';' -> EmptyStatement
...
;
# Statement is now a set of (IfStatement, WhileStatement, EmptyStatement, ...)
# ifStmt.Children(selector.Statement) returns one or two nodes ("then" + optional "else")
The right-hand side of the arrow can contain both - a node and an interface node (joined with as
), in which case the interface node starts covering the reported node.
CtorCall -> CtorCall as Statement:
'this' '(' args ')' ';' ;
Then we can merge CtorCall with Statements in a single production. stmt+=
and stmt=
show that we are merging statements coming from different productions (with different arity) into a single field in the AST.
CtorBody -> Body:
'{' stmt+=CtorCall? stmts=Statements* '}' ;
Simple terminals can also be turned into events by adding them to the reportTokens list. This works even if they are marked as (space), in which case the filtering gets moved into the parser.
reportTokens = [Comment, Identifier, invalid_token]
The final AST gets reconstructed from those events. This is a very convenient way to write parsers since the grammar is not polluted with random chunks of code. Enable the AST code generation:
eventFields = true
eventAST = true
Note: eventFields will enable more checks on the shape of the produced AST:
expr -> Expr : left=expr '+' right=expr ;
In Textmapper, nonterminals can be parameterized, and their parameters can be used to conditionally enable or disable individual productions, adjusting parsed language to the outer context. This happens statically, at compile time, via a grammar instantiation process that lazily duplicates and rewrites nonterminals.
%flag withComma;
list<withComma> : # here we enumerate the required parameters
elem
| [withComma] list ',' elem
| [!withComma] list elem
;
file :
list<withComma: true>
| '[' list<withComma: false> ']'
;
There is also a shorter form of specifying boolean arguments using +
and ~
.
file :
list<+withComma> # withComma: true
| '[' list<~withComma> ']' # withComma: false
;
Most parameter are declared at the grammar level to make use of implicit
propagation of parameters available in the context into referenced nonterminals
(in the above example both references to list
inherit the value of
withComma
from the context). It is also possible to introduce a local flag
inside a nonterminal header.
# Each nonterminal can introduce their own parameter.
a<flag foo> :
... ;
b<flag foo> :
a<foo: foo> ; # then we have to explicitly propagate one into another
Parameters without a default value have to be provided every time a nonterminal that requires them is referenced and they cannot be taken from the context.
%flag a;
%flag b = true;
nt<a, b> :
'(' nt ')' | .... ; # both "a" and "b" will be taken from the context.
file:
nt<+a> ; # we can omit specifying "b" since it has a default value (true)
Predicates are full boolean expressions that can use parameter values,
negation, and operators of boolean logic (&&
and ||
).
nt<a, b, c>:
'(' ')'
| [a && b || c] '(' nt<a: b> ')'
;
The normal parameters we’ve discussed so far get auto-propagated into all nonterminals that require them in the right-hand side of a rule but they have to be mentioned explicitly in all nonterminals on the way between the argument and the predicate.
Textmapper has a second flavor of parameters - lookahead flags. Lookahead flags don’t need to be mentioned in any of the nonterminals, and they propagate into the left-most nonterminal in every rule only, making it very easy to exclude or rewrite productions in certain contexts. See the following example from EcmaScript.
%lookahead flag NoFuncClass = false;
PrimaryExpression:
....
| [!NoFuncClass] ClassExpression
;
# Class expressions are not allowed in statement contexts.
ExpressionStatement<Yield, Await> -> ExprStmt :
Expression<+NoFuncClass> ';' ;
Note: if a lookahead flag argument is not used in some predicate inside the nonterminal it was applied to, Textmapper will fail the grammar compilation with an error.