Parsing with yacc

Parsing with yacc

yacc provides a general tool for imposing structure on the input to a computer program. When you use yacc, you prepare a specification that includes

yacc then turns the specification into a C language function that examines the input stream. This function, called a parser, works by calling the low-level scanner. The scanner, called a lexical analyzer, picks up items from the input stream. The selected items are known as tokens. Tokens are compared to the input construct rules, called grammar rules. When one of the rules is recognized, the code you have supplied for the rule is invoked. This code is called an action. Actions are fragments of C language code. They can return values and make use of values returned by other actions.

The heart of the yacc specification is the collection of grammar rules. Each rule describes a construct and gives it a name. For example, one grammar rule might be

   date  :  month_name  day  ´,´  year   ;
where date, month_name, day, and year represent constructs of interest; presumably, month_name, day, and year are defined in greater detail elsewhere. In the example, the comma is enclosed in single quotes. This means that the comma is to appear literally in the input. The colon and semicolon merely serve as punctuation in the rule and have no significance in evaluating the input. With proper definitions, the input
   July  4, 1776
might be matched by the rule.

The lexical analyzer is an important part of the parsing function. This user-supplied routine reads the input stream, recognizes the lower-level constructs, and communicates these as tokens to the parser. The lexical analyzer recognizes constructs of the input stream as terminal symbols; the parser recognizes constructs as nonterminal symbols. To avoid confusion, we will refer to terminal symbols as tokens.

There is considerable leeway in deciding whether to recognize constructs using the lexical analyzer or grammar rules. For example, the rules

   month_name : 'J' 'a' 'n'  ;
   month_name : 'F' 'e' 'b'  ;
               . . .
   month_name : 'D' 'e' 'c'  ;
might be used in the above example. While the lexical analyzer only needs to recognize individual letters, such low-level rules tend to waste time and space, and may complicate the specification beyond the ability of yacc to deal with it. Usually, the lexical analyzer recognizes the month names and returns an indication that a month_name is seen. In this case, month_name is a token and the detailed rules are not needed.

Literal characters such as a comma must also be passed through the lexical analyzer and are also considered tokens.

Specification files are very flexible. It is relatively easy to add to the above example the rule

   date  :  month '/' day '/' year   ;
as a synonym for
   July 4, 1776
on input. In most cases, this new rule could be slipped into a working system with minimal effort and little danger of disrupting existing input.

The input being read may not conform to the specifications. With a left-to-right scan, input errors are detected as early as is theoretically possible. Thus, not only is the chance of reading and computing with bad input data substantially reduced, but the bad data usually can be found quickly. Error handling, provided as part of the input specifications, permits the reentry of bad data or the continuation of the input process after skipping over the bad data.

In some cases, yacc fails to produce a parser when given a set of specifications. For example, the specifications may be self-contradictory, or they may require a more powerful recognition mechanism than that available to yacc. The former cases represent design errors; the latter cases often can be corrected by making the lexical analyzer more powerful or by rewriting some of the grammar rules. While yacc cannot handle all possible specifications, its power compares favorably with similar systems. Moreover, the constructs that are difficult for yacc to handle are also frequently difficult for human beings to handle. Some users have reported
that the discipline of formulating valid yacc specifications for their input revealed errors of conception or design early in program development.

The remainder of this topic describes the following subjects:

In addition, there are two examples and a summary of the yacc input syntax.

Next topic: Basic specifications

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004