Introduction to LEX [Compiler Design]

Leave a Comment

The word “lexical” in the traditional sense means pertaining to words. In terms of programming languages, words are objects like variable names, numbers, keywords etc. Such words are traditionally called tokens.

A lexical analyzer, or lexer for short, will take input as a string of individual letters and divide this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called white-space), i.e., layout characters (spaces, new lines etc.) and comments.

The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis. This interaction, summarized schematically in Following picture, is commonly implemented by making the lexical analyzer be a subroutine or a co routine of the parser.

Interaction of lexical analyzer with parser
Interaction of lexical analyzer with parser

A Language for Specifying Lexical Analyzer

Several tools have been built for constructing lexical analyzers from special purpose notations based on regular expressions.

In this section, we describe a particular tool, called Lex that has been widely used to specify lexical analyzers for a variety of languages. We refer to the tool as Lex compiler, and its input specification as the Lex language.

First, a specification of a lexical analyzer is prepared by creating a program lex.l in the lex language. Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c. The program lex.yy.c consists of a tabular representation of a transition diagram constructed from the regular expression of lex.l, together with a standard routine that uses the table to recognize lexemes. The actions associated with regular expression in lex.l are pieces of C code and are carried over directly to lex.yy.c. Finally lex.yy.c is run through the C compiler to produce an object program a.out, which is the lexical analyzer that transforms an input stream into a sequence of tokens.

creating a lexical analyzer with Lex
Creating a lexical analyzer with Lex

Lex specifications (Source)

A Lex program consists of three parts:
The general format of Lex source is:
{definitions}
%%
{rules}
%%
{user subroutines}

The declarations section includes declarations of variables, constants, and regular definitions.

The rules of a lex program are statements of the form

R1 {action1}
R2 {action2}
.... ....
Rn {action n} where each Ri is regular expression and each action i, is a program fragment describing what action the lexical analyzer should take when pattern Ri matches lexeme. Typically, action i will return control to the parser. In Lex actions are written in C; in general, however, they can be in any implementation language.

The third section holds whatever auxiliary procedures are needed by the actions.

The Role of a Parser

The parser obtains a string of tokens from the lexical analyzer and verifies that the string can be generated by grammar for the source language. We expect the parser to report any syntax errors in an intelligible fashion. It should also recover from commonly occurring errors so that it can continue processing the remainder of its input. We know that programs can contain errors at many different levels. For example, errors can be

  1. Lexical, such as misspelling an identifier, keyword or operator.
  2. Syntactic, such as arithmetic expression with unbalanced parentheses
  3. Semantic, such as an operator applied to an incompatible operand.
  4. Logical, such as infinitely recursive call.


Often much of the error detection and recovery in a compiler is centered around the syntax analysis phase.

LEXICAL CONVENTIONS

The notations for specifying tokens:

  1. . Matches any single character except the new line (.\n.)
  2. * Matches zero or more copies of the preceding expression.
  3. [ ] A character class which matches any character within the brackets.
  4. ^ Matches the beginning of a line as the first character of a regular expression.
  5. $ Matches the end of line as the last character of a regular expression.
  6. \ Used to escape met character.
  7. + Matches one or more occurrence of the preceding regular expression. For example [0-9]+ matches .12.,.9..but not an empty string.
  8. ? Matches zero or one occurrence
  9. | Matches either the preceding regular expression or the following expression. For example are | is | because matches any three words.
  10. / Matches the preceding regular expression but only if followed by the regular expression.
  11. ( ) Groups of series of regular expressions together into a new regular expression.
  12. Blanks between tokens are optional, with the exception that keywords must be surrounded by blanks, new lines, the beginning of the program or the final dot.


0 comments:

Post a Comment