Lexical analysis with lex

lex routines

Some of your action statements themselves may require your reading another character, putting one back to be read again a moment later, or writing a character on an output device. lex supplies three macros to handle these tasks -- input(), unput(c), and output(c), respectively. One way to ignore all characters between two special characters, say between a pair of double quotation marks, would be to use input(), thus:

   \"         while (input() != '"');

Upon finding the first double quotation mark, the scanner will simply continue reading all subsequent characters so long as none is a double quotation mark, and not look for a match again until it finds a second double quotation mark. (See the further examples of input() and unput(c) usage in ``User routines''.)

By default, these routines are provided as macro definitions. To handle special I/O needs, such as writing to several files, you may use standard I/O routines in C to rewrite the functions. Note, however, that they must be modified consistently. In particular, the character set used must be consistent in all routines, and a value of 0 returned by input() must mean end of file. The relationship between input() and unput(c) must be maintained or the lex lookahead will not work.

If you do provide your own input(), output(c), or unput(c), you will have to write a #undef input and so on in your definitions section first:

   #undef input
   #undef output
   	.
   	.
   	.
   #define input()  . . . etc.
   more declarations
   	.
   	.
   	.

Your new routines will replace the standard ones. See ``Definitions'' for further details.

A lex library routine that you may sometimes want to redefine is yywrap(), which is called whenever the scanner reaches end of file. If yywrap() returns 1, the scanner continues with normal wrapup on end of input. Occasionally, however, you may want to arrange for more input to arrive from a new source. In that case, redefine yywrap() to return 0 whenever further processing is required. The default yywrap() always returns 1. Note that it is not possible to write a normal rule that recognizes end of file; the only access to that condition is through yywrap(). Unless a private version of input() is supplied, a file containing nulls cannot be handled because a value of 0 returned by input() is taken to be end of file.

There are a number of lex routines that let you handle sequences of characters to be processed in more than one way. These include yymore(), yyless(n), and REJECT. Recall that the text that matches a given specification is stored in the array yytext[]. In general, once the action is performed for the specification, the characters in yytext[] are overwritten with succeeding characters in the input stream to form the next match. The function yymore(), by contrast, ensures that the succeeding characters recognized are appended to those already in yytext[]. This lets you do one thing and then another, when one string of characters is significant and a longer one including the first is significant as well. Consider a language that defines a string as a set of characters between double quotation marks and specifies that to include a double quotation mark in a string it must be preceded by a backslash. The regular expression matching that is somewhat confusing, so it might be preferable to write:

   \"[^"]	{
   		if (yytext[yyleng-1] == '\\')
   			yymore();
   		else
   		. . . normal processing
   		}

When faced with the string ``"abc\"def"'', the scanner will first match the characters "abc\, whereupon the call to yymore() will cause the next part of the string "def to be tacked on the end. The double quotation mark terminating the string should be picked up in the code labeled ``normal processing.''

The function yyless(n) lets you specify the number of matched characters on which an action is to be performed: only the first n characters of the expression are retained in yytext[]. Subsequent processing resumes at the nth + 1 character. Suppose you are again in the code deciphering business and the idea is
to work with only half the characters in a sequence that ends with a certain one, say upper or lowercase Z. The code you want might be

   [a-yA-Y]+[Zz]    {  yyless(yyleng/2);
                       . . . process first half of string . . . }

Finally, the function REJECT lets you more easily process strings of characters even when they overlap or contain one another as parts. REJECT does this by immediately jumping to the next rule and its specification without changing the contents of yytext[]. If you want to count the number of occurrences both of the regular expression snapdragon and of its subexpression dragon in an input text, the following will do:

   snapdragon     {countflowers++; REJECT;}
   dragon         countmonsters++;

As an example of one pattern overlapping another, the following counts the number of occurrences of the expressions comedian and diana, even where the input text has sequences such as comediana..:

   comedian      {comiccount++; REJECT;}
   diana         princesscount++;

Note that the actions here may be considerably more complicated than simply incrementing a counter. In all cases, you declare the counters and other necessary variables in the definitions section commencing the lex specification.