Lexical analysis with lex

Using lex with yacc

If you work on a compiler project or develop a program to check the validity of an input language, you may want to use the UNIX system tool yacc (see ``Parsing with yacc''). yacc generates parsers, programs that analyze input to insure that it is syntactically correct. lex often forms a fruitful union with yacc in the compiler development context. Whether or not you plan to use lex with yacc, be sure to read this section because it covers information of interest to all lex programmers.

As noted, a program uses the lex-generated scanner by repeatedly calling the function yylex(). This name is used because a yacc-generated parser calls its lexical analyzer with this very name. To use lex to create the lexical analyzer for a compiler, you want to end each lex action with the statement return token, where token is a defined term whose value is an integer. The integer value of the token returned indicates to the parser what the lexical analyzer has found. The parser, called yyparse() by yacc, then resumes control and makes another call to the lexical analyzer when it needs another token.

In a compiler, the different values of the token indicate what, if any, reserved word of the language has been found or whether an identifier, constant, arithmetic operator, or relational operator has been found. In the latter cases, the analyzer must also specify the exact value of the token: what the identifier is, whether the constant is, say, 9 or 888, whether the operator is + or , and whether the relational operator is ``='' or ``>''. Consider the following portion of lex source (discussed in another context earlier) for a scanner that recognizes tokens in a "C-like" language:

    begin                          return(BEGIN);
    end                            return(END);
    while                          return(WHILE);
    if                             return(IF);
    package                        return(PACKAGE);
    reverse                        return(REVERSE);
    loop                           return(LOOP);
    [a-zA-Z][a-zA-Z0-9]           { tokval = put_in_tabl();
                                   return(IDENTIFIER); }
    [0-9]+                         { tokval = put_in_tabl();
                                   return(INTEGER); }
    \+                             { tokval = PLUS;
                                   return(ARITHOP); }
    \-                             { tokval = MINUS;
                                   return(ARITHOP); }
    >                              { tokval = GREATER;
                                   return(RELOP); }
    >=                             { tokval = GREATEREQL;
                                   return(RELOP); }

Despite appearances, the tokens returned, and the values assigned to tokval, are indeed integers. Good programming style dictates that we use informative terms such as BEGIN, END, WHILE, and so forth to signify the integers the parser understands, rather than use the integers themselves. You establish the association by using #define statements in your parser calling routine in C. For example,

   #define BEGIN  1
   #define END  2
     .
   #define PLUS 7
     .

If the need arises to change the integer for some token type, you then change the #define statement in the parser rather than hunt through the entire program changing every occurrence of the particular integer. In using yacc to generate your parser, insert the statement

   #include "y.tab.h"

in the definitions section of your lex source. The file y.tab.h, which is created when yacc is invoked with the -d option, provides #define statements that associate token names such as BEGIN, END, and so on with the integers of significance to the generated parser.

To indicate the reserved words in the example, the returned integer values suffice. For the other token types, the integer value of the token type is stored in the programmer-defined variable tokval. This variable, whose definition was an example in the definitions section, is globally defined so that the parser as well as the lexical analyzer can access it. yacc provides the variable
yylval for the same purpose.

Note that the example shows two ways to assign a value to tokval. First, a function put_in_tabl() places the name and type of the identifier or constant in a symbol table so that the compiler can refer to it in this or a later stage of the compilation process. More to the present point, put_in_tabl() assigns a type value to tokval so that the parser can use the information immediately to determine the syntactic correctness of the input text. The function put_in_tabl() would be a routine that the compiler writer might place in the user routines section of the parser. Second, in the last few actions of the example, tokval is assigned a specific integer indicating which arithmetic or relational operator the scanner recognized. If the variable PLUS, for instance, is associated with the integer 7 by means of the #define statement above, then when a + is recognized, the action assigns to tokval the value 7, which indicates the +. The scanner indicates the general class of operator by the value it returns to the parser (that is, the integer signified by ARITHOP or RELOP).

In using lex with yacc, either may be run first. The command

   $ yacc -d grammar.y

generates a parser in the file y.tab.c. As noted, the -d option creates the file y.tab.h, which contains the #define statements that associate the yacc-assigned integer token values with the user-defined token names. Now you can invoke lex with the command

   $ lex lex.l

then compile and link the output files with the command

   $ cc lex.yy.c y.tab.c -ly -ll

Note that the yacc library is loaded (via -ly) before the lex library (via -ll) to ensure that the supplied main() will call the yacc parser.