Lexical analysis with lex

Some special features

Besides storing the matched input text in yytext[], the scanner automatically counts the number of characters in a match and stores it in the variable yyleng. You may use this variable to refer to any specific character just placed in the array yytext[]. Remember that C language array indexes start with 0, so to print out the third digit (if there is one) in a just recognized integer, you might enter

   [1-9]+         {if  (yyleng > 2)
                  printf("%c", yytext[2]); }

lex follows a number of high-level rules to resolve ambiguities that may arise from the set of rules that you write. In the following lexical analyzer example, the ``reserved word'' end could match the second rule as well as the eighth, the one for identifiers:

    begin                          return(BEGIN);
    end                            return(END);
    while                          return(WHILE);
    if                             return(IF);
    package                        return(PACKAGE);
    reverse                        return(REVERSE);
    loop                           return(LOOP);
    [a-zA-Z][a-zA-Z0-9]*           { tokval = put_in_tabl();
                                   return(IDENTIFIER); }
    [0-9]+                         { tokval = put_in_tabl();
                                   return(INTEGER); }
    \+                             { tokval = PLUS;
                                   return(ARITHOP); }
    \-                             { tokval = MINUS;
                                   return(ARITHOP); }
    >                              { tokval = GREATER;
                                   return(RELOP); }
    >=                             { tokval = GREATEREQL;
                                   return(RELOP); }

lex follows the rule that, where there is a match with two or more rules in a specification, the first rule is the one whose action will be executed. By placing the rule for end and the other reserved words before the rule for identifiers, we ensure that our reserved words will be duly recognized.

Another potential problem arises from cases where one pattern you are searching for is the prefix of another. For instance, the last two rules in the lexical analyzer example above are designed to recognize > and >=. If the text has the string >= at one point, you might worry that the lexical analyzer would stop as soon as it recognized the > character and execute the rule for >, rather than read the next character and execute the rule for >=. lex follows the rule that it matches the longest character string possible and executes the rule for that. Here the scanner would recognize the >= and act accordingly. As a further example, the rule would enable you to distinguish + from ++ in a C program.

Still another potential problem exists when the analyzer must read characters beyond the string you are seeking because you cannot be sure that you've in fact found it until you've read the additional characters. These cases reveal the importance of trailing context. The classic example here is the DO statement in FORTRAN. In the statement

   DO 50 k = 1 , 20, 1

we cannot be sure that the first 1 is the initial value of the index k until we read the first comma. Until then, we might have the assignment statement

   DO50k = 1

(Remember that FORTRAN ignores all blanks.) The way to handle this is to use the slash, /, which signifies that what follows is trailing context, something not to be stored in yytext[], because it is not part of the pattern itself. So the rule to recognize the FORTRAN DO statement could be

   DO/([ ][0-9]+[ ][a-zA-Z0-9]+=[a-zA-Z0-9]+,)  {
   	printf("found DO");
   	}

Different versions of FORTRAN have limits on the size of identifiers, here the index name. To simplify the example, the rule accepts an index name of any length. See ``Start conditions'' for a discussion of lex`s similar handling of prior context.

lex uses the ``$'' symbol as an operator to mark a special trailing context -- the end of a line. An example would be a rule to ignore all blanks and tabs at
the end of a line:

   [  \t]+$     ;

which could also be written:

   [  \t]+/\n   ;

On the other hand, if you want to match a pattern only when it starts a line or a file, you can use the ^ operator. Suppose a text-formatting program requires that you not start a line with a blank. You might want to check input to the program with some such rule as

   ^[ ]      printf("error: remove leading blank");

Note the difference in meaning when the ^ operator appears inside the left bracket, as described in ``Operators''.