Parsing with yacc

Basic specifications

Names refer to either tokens or nonterminal symbols. yacc requires token names to be declared as such. While the lexical analyzer may be included as part of the specification file, it is perhaps more in keeping with modular design to keep it as a separate file. Like the lexical analyzer, other subroutines may be included as well. Thus, every specification file theoretically consists of three sections: the declarations, (grammar) rules, and subroutines. The sections are separated by double percent signs (%%; the percent sign is generally used in yacc specifications as an escape character).

A full specification file looks like

when all sections are used. The declarations and subroutines sections are optional. The smallest valid yacc specification might be

Blanks, tabs, and new-lines are ignored, but they may not appear in names or multicharacter reserved symbols. Comments may appear wherever a name is valid. They are enclosed in /* and */, as in the C language.

The rules section is made up of one or more grammar rules. A grammar rule has the form

   A  :  BODY  ;
where A represents a nonterminal symbol, and BODY represents a sequence of zero or more names and literals. The colon and the semicolon are yacc punctuation.

Names may be of any length and may be made up of letters, periods, underscores, and digits although a digit may not be the first character of a name. Uppercase and lowercase letters are distinct. The names used in the body of a grammar rule may represent tokens or nonterminal symbols.

A literal consists of a character enclosed in single quotes. As in the C language, the backslash is an escape character within literals. yacc recognizes all the C language escape sequences. For a number of technical reasons, the null character should never be used in grammar rules.

If there are several grammar rules with the same left-hand side, the vertical bar can be used to avoid rewriting the left-hand side. In addition, the semicolon at the end of a rule is dropped before a vertical bar. Thus the grammar rules

   A   :  B  C  D   ;
   A   :  E  F   ;
   A   :  G   ;
can be given to yacc as
   A   :  B  C  D
       |  E  F
       |  G
by using the vertical bar. It is not necessary that all grammar rules with the same left side appear together in the grammar rules section although it makes the input more readable and easier to change.

If a nonterminal symbol matches the empty string, this can be indicated by

   epsilon :   ;
The blank space following the colon is understood by yacc to be a nonterminal symbol named ``epsilon''.

Names representing tokens must be declared. This is most simply done by writing

   %token   name1  name2  name3
and so on in the declarations section. Every name not defined in the declarations section is assumed to represent a nonterminal symbol. Every nonterminal symbol must appear on the left side of at least one rule.

Of all the nonterminal symbols, the start symbol has particular importance. By default, the symbol is taken to be the left-hand side of the first grammar rule in
the rules section. It is possible and desirable to declare the start symbol explicitly in the declarations section using the %start keyword:

   %start   symbol

The end of the input to the parser is signaled by a special token, called the end-marker. The end-marker is represented by either a zero or a negative number. If the tokens up to but not including the end-marker form a construct that matches the start symbol, the parser function returns to its caller after the end-marker is seen and accepts the input. If the end-marker is seen in any other context, it is an error.

It is the job of the user-supplied lexical analyzer to return the end-marker when appropriate. Usually the end-marker represents some reasonably obvious I/O status, such as end of file or end of record.

Next topic: Actions
Previous topic: Parsing with yacc

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004