Applications of Regular Expressions and Finite Automata: Lexical Analy

ABSTRACT

Compared to Chapter 3, this application-oriented chapter differs in size as well as style� Indeed, this chapter is much shorter� Second, regarding its presentation style, all concepts underlying lexical analysis are presented in a more descriptive way rather than in the form of strictly rigorous mathematical notions used in Chapter 3� A special attention is paid to their implementation to reflect the pragmatically orientated focus of this chapter�

As far as its organization is concerned, this chapter consists of two sections� Section 4�1 explains how to implement completely specified deterministic finite automata (csDFAs) (see Section 3�2�2) because this implementation is needed in Section 4�2, which discusses the lexical analysis-the subject of this chapter� More precisely, Section 4�2 consists of two subsections� First, Section 4�2�1 gives an introduction to lexical analysis� Second, Section 4�2�2 explains how to implement a lexical analyzer�

Before this pragmatically oriented chapter is started, a note concerning its description of algorithms is in order� Just like the previous chapter, the present chapter describes them by using Pascal-like notation, which is very intuitive and readable� However, some of these algorithms, such as Algorithms 4�2 and 4�3, are actually meant as sample programs that implement components of lexical analyzers� All of them are straightforwardly convertible to any common programming language, such as Java�

4.1 Implementation of Finite Automata This section explains how to implement csDFAs because grasping the construction of a lexical analyzer presupposes a familiarity with this implementation� As a matter of fact, we give two alternative algorithms that implement these automata� The first algorithm actually implements them based on their state tables� The other algorithm is based on a nested case statement; based on this algorithm, a lexical analyzer is implemented in Section 4�2�2� In both of them, we make use of Convention 4�1�

Convention 4.1 Algorithms 4�2 and 4�3 assume that a given input string w is followed by ◃, which thus acts as an input end marker� As their output, the algorithms announces ACCEPT if w ∈ L(M) and REJECT if w ∉ L(M)�

As obvious, in practice, the end of a given input string is always somehow specified, so ◃ only symbolically expresses this end� For instance, if the input string is written as a file, then the end of the file fulfills the role of ◃�

4.1.1 Table-Based Implementation First, we give the algorithm that implements any completely specified DFA, M, based on its state table� In this algorithm, we assume that its states are numbered by 1 through n, for some n ∈ ℕ, where 1 is the start state�

Algorithm 4.2 FA Implementation: A Tabular Method�

Input. A csDFA, M = (MΣ, MR) with MQ = {1, …, n}, MΔ = {a1, …, am}, s = 1, and w◃ with w ∈ MΔ*�

Output� ACCEPT if w ∈ L(M ), and REJECT if w ∉ L(M )�

Note� Recall that MF denotes the set of final states in M according to Convention 3�2�

Method�

type states = 1‥n symbols = 'a1'‥'am' rules = array[states, symbols] of states stateset = set of states

var state: states symbol: symbols rule: rules {array rule represents the state table of M} finalstates: stateset

◾

begin

for all i = 1, …, n and all j = 1, …, m, {filling in the state table} if iaj → k ∈ MR then set rule[i, 'aj'] = k set finalstates to MF {initialization of final states} set state = 1 {1 is the start state} read(symbol) while symbol ≠ ◃ do begin state = rule[state, symbol] {simulation of a move} read(symbol) end if state ∈ finalstates then {decision of acceptance} ACCEPT {w ∈ L(M )} else REJECT {w ∉ L(M )}

end.

Applications of Regular Expressions and Finite Automata: Lexical Analysis

ABSTRACT