Lexical analysis with lex


Once the scanner recognizes a string matching the regular expression at the start of a rule, it looks to the right of the rule for the action to be performed. You supply the actions. Kinds of actions include recording the token type found and its value, if any; replacing one token with another; and counting the number of instances of a token or token type. You write these actions as program fragments in C. An action may consist of as many statements as are needed for the job at hand. You may want to change the text in some way or simply print a message noting that the text has been found. So, to recognize the expression Amelia Earhart and to note such recognition, the rule

   "Amelia Earhart"   printf("found Amelia");
would do. And to replace in a text lengthy medical terms with their equivalent acronyms, a rule such as
   Electroencephalogram    printf("EEG");
would be called for. To count the lines in a text, we need to recognize the ends of lines and increment a linecounter. As we have noted, lex uses the standard C escape sequences, including \n for new-line. So, to count lines we might have
   \n   lineno++;
where lineno, like other C variables, is declared in the definitions section that we discuss later.

Input is ignored when the C language null statement ; is specified. So the rule

   [ \t\n]  ;
causes blanks, tabs, and new-lines to be ignored. Note that the alternation operator | can also be used to indicate that the action for a rule is the action for the next rule. The previous example could have been written:
   " "   |
   \t    |
   \n    ;
with the same result.

The scanner stores text that matches an expression in a character array called yytext[]. You can print or manipulate the contents of this array as you like.
In fact, lex provides a macro called ECHO that is equivalent to printf("%s", yytext). We will see an example of its use in ``Start conditions''.

Sometimes your action may consist of a long C statement, or two or more C statements, and you wish to write it on several lines. To inform lex that the action is for one rule only, simply enclose the C code in braces. For example, to count the total number of all digit strings in an input text, print the running total of the number of digit strings, and print out each one as soon as it is found, your lex code might be

   \+?[1-9]+           { digstrngcount++;
                        printf("%s", yytext);   }
This specification matches digit strings whether they are preceded by a plus sign or not, because the ? indicates that the preceding plus sign is optional. In addition, it will catch negative digit strings because that portion following the minus sign will match the specification. The next section explains how to distinguish negative from positive integers.
Next topic: Advanced lex usage
Previous topic: lex operators

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004