Parsing with yacc

Lexical analysis

You must supply a lexical analyzer to read the input stream and communicate tokens (with values, if desired) to the parser. The lexical analyzer is an integer-valued function called yylex(). The function returns an integer, the token number, representing the kind of token read. If there is a value associated with that token, it should be assigned to the external variable yylval.

The parser and the lexical analyzer must agree on these token numbers in order for communication between them to take place. The numbers may be chosen by yacc or the user. In either case, the #define mechanism of C language is used to allow the lexical analyzer to return these numbers symbolically. For example, suppose that the token name DIGIT has been defined in the declarations section of the yacc specification file. The relevant portion of the lexical analyzer might look like

   int yylex()
   {
   	extern int yylval;
   	int c;
   	...
   	c = getchar();
   	...
   	switch (c)
   	{
      	   ...
   	   case '0':
   	   case '1':
     	   ...
   	   case '9':
      	   yylval = c - '0';
      	   return (DIGIT);
      	   ...
   	}
   	...
   }

to return the appropriate token.

The intent is to return a token number of DIGIT and a value equal to the numerical value of the digit. You put the lexical analyzer code in the subroutines section and the declaration for DIGIT in the declarations section. Alternatively, you can put the lexical analyzer code in a separately compiled file, provided

you invoke yacc with the -d option, which generates a file called y.tab.h that contains #define statements for the tokens, and
you #include y.tab.h in the separately compiled lexical analyzer.

This mechanism leads to clear, easily modified lexical analyzers. The only pitfall to avoid is using any token names in the grammar that are reserved or significant in C language or the parser. For example, the use of token names if or while will almost certainly cause severe difficulties when the lexical analyzer is compiled. The token name error is reserved for error handling and should not be used naively.

In the default situation, token numbers are chosen by yacc. The default token number for a literal character is the numerical value of the character in the local character set. Other names are assigned token numbers starting at 257.

If you prefer to assign the token numbers, the first appearance of the token name or literal in the declarations section must be followed immediately by a nonnegative integer. This integer is taken to be the token number of the name or literal. Names and literals not defined this way are assigned default definitions by yacc. The potential for duplication exists here. Care must be taken to make sure that all token numbers are distinct.

For historical reasons, the end-marker must have token number 0 or be negative. You cannot redefine this token number. Thus, all lexical analyzers should be prepared to return 0 or a negative number as a token upon reaching the end of their input.

As noted in ``Lexical analysis with lex'', lexical analyzers produced by lex are designed to work in close harmony with yacc parsers. The specifications for these lexical analyzers use regular expressions instead of grammar rules. lex can be used to produce quite complicated lexical analyzers, but there remain some languages that do not fit any theoretical framework and whose lexical analyzers must be crafted by hand.