ABSTRACT

In this chapter, we demonstrate the use of context-free grammars (CFGs) and pushdown automata (PDAs) in practice� More specifically, we apply them to the syntax analysis of programming languages because this analysis is literally unthinkable without these models� First, we explain how to specify the programming language syntax by using CFGs� Then, we construct syntax analyzers based on PDAs�

Today, CFGs are recognized as the most widely used specification tool for the syntactic structures of programming languages� Accordingly, PDAs, which represent the basic automaton-based counterpart to CFGs (see Theorem 6�64), usually underlie syntax analyzers, whose fundamental task is to decide whether the syntactic structures are correctly specified according to the grammatical rules� If they are not, the syntax analyzers detect and specify all the errors in them� If the structures are syntactically correct, the syntax analyzers produce parses-that is, the sequences of rules according to which the syntactic structures are generated because these parses are usually important to engineering techniques for language processing within which the syntax analyzer is applied� For instance, compilers make use of parses obtained in this way to direct the translation of programs written in high-level programming languages to functionally equivalent machine-language programs, which perform the computation specified by the input high-level programs�

This chapter consists of three sections� By using the terminology and results obtained in the theoretically oriented discussion in Chapter 6, Section 7�1 conceptualizes two fundamental approaches to syntax analysis-top-down parsing and bottom-up parsing� Then, Section 7�2 explains the former approach while Section 7�3 explores the latter� The style of presentation in Sections 7�2 and 7�3 slightly differs from that used in Section 7�1, which is more theoretically oriented and, therefore, gives all results and concepts in the form of mathematical statements and formulas� Instead of this theoretical approach, Sections 7�2 and 7�3 approach parsing less formally because they are primarily application-oriented� Indeed, they describe parsing in terms of easy-to-implement tables and algorithms based on them�

132 

7.1 Introduction to Syntax Analysis This section gives the basics of syntax analysis in terms of CFGs and PDAs, introduced in Chapter 6� Its purpose is threefold� First, it gives a quite general insight into the syntax analysis of programming languages� Second, it explains how CFGs and PDAs conceptually underlie syntax analyzers� Finally, it introduces important conventions that make syntax analyzers simple to describe and easy to implement in Sections 7�2 and 7�3�

The syntax of a programming language L is almost always specified by a CFG, G = (Σ, R), satisfying L = L(G)� In essence, a syntax analyzer for G is a PDA M that decides whether a string w ∈ L(G)� M makes this decision by accepting w exactly when G generates w; consequently, L(M) = L(G)� In greater detail, to demonstrate that w ∈ L(G), M simulates the construction of GS ⇒* w [ρ], where ρ represents a parse of w-that is, a sequence of rules from GR by which G derives w in G� If M successfully completes this construction, it usually produces the parse ρ as its output, hence it is customarily referred to as a G-based parser� Typically, M is designed so it constructs GS ⇒* w either in a leftmost way or in a rightmost way� Accordingly, there exist two different approaches to parsing, customarily referred to as top-down parsing and bottom-up parsing, which produce left parses and right parses, respectively� Both the approaches together with the two notions of parses are described as follows:

I� M simulates GS lm⇒* w [ρ] so it starts from GS and proceeds toward w by simulating leftmost derivation steps according to rules from GR� If and when it completes GS lm⇒* w [ρ], it usually also produces ρ as the left parse of w corresponding to GS lm⇒* w [ρ]—the sequence of rules according to which G makes this leftmost derivation� In terms of the corresponding derivation tree ♣(GS lm⇒* w), this approach can be naturally rephrased as the construction of ♣(GS lm⇒* w) so it starts from the root and proceeds down toward the frontier, hence a parser that works in this top-down way is called a G-based top-down parser�

II� If M simulates GS rm⇒* w [ρ], it makes this simulation in reverse� That is, it starts from w and proceeds toward GS by making rightmost derivation steps, each of which is performed in reverse by reducing the right-hand side of a rule to its left-hand side� If and when it completes this reverse construction of GS rm⇒* w [ρ], it produces reversal(ρ) as the right parse of w corresponding to GS rm⇒* w [ρ]—the reverse sequence of rules according to which G makes this rightmost derivation� To express this construction in terms of ♣(GS rm⇒* w), a parser like this constructs this tree so it starts from its frontier w and proceeds up toward the root, hence a parser that works in this way is referred to as a G-based bottom-up parser�

Whichever way a parser is designed, it is always based on a PDA� Convention 7�1 simplifies the upcoming discussion of parsers by considering only one-state PDAs, which are equivalent with ordinary PDAs as follows from Algorithm 6�52, and Theorems 6�53 and 6�58� In addition, some pragmatically motivated conventions concerning configurations are introduced too�

Convention 7.1 Throughout this chapter, we assume that every PDA M has a single state denoted by ⬦, so MQ = MF = {⬦}� Instead of a configuration of the form x⬦y from MΧ (see Convention 6�51), we write ▹x⬦y◃, where ▹ and ◃ are two special symbols such that {▹,◃} ∩ MΣ = ∅� By pd, we refer to the pushdown ▹x, whose rightmost symbol represents the pd top

symbol and ▹ is called the pd bottom� We consider pd empty if x = ε and, therefore, ▹ occurs on the pushdown top� By ins, we refer to the input symbol defined as the leftmost symbol of y◃� When ins = ◃, referred to as the input end, all the input string has been read� As a result, M always accepts in a configuration of the form ▹⬦◃�

7.1.1 Syntax Specified by Context-Free Grammars Of course, parsers can verify the syntax of a programming language only if it is precisely specified� Today, CFGs are almost exclusively used for this purpose� The following example illustrates how to specify the syntax of common syntactical structures, such as logical expressions, by using CFGs�

Example 7.1 We want to describe logical expression of the form

where oj ∈ {∨, ∧}, and ιk is a logical variable, symbolically denoted by i (intuitively, i stands for an identifier) or another logical expression enclosed in parentheses, for all 1 ≤ j ≤ n and 0 ≤ k ≤ n, for some n ∈ 0ℕ� For instance, (i ∨ i) ∧ i is an expression like this� A logical variable can be simply derived by this rule

S → i

We also introduce

S → (S)

to derive a parenthesized expression-that is, any valid expression enclosed by parentheses� To derive logical operators ∨ and ∧, we add these two rules

S → S ∨ S and S → S ∧ S

As a result, we define the expressions by the four-rule CFG defined as

1: S → S ∨ S, 2: S → S ∧ S, 3: S → (S), 4: S → i

This CFG is obviously ambiguous (see Definition 6�13); for instance, S lm⇒* i ∨ i ∧ i [14244] as well as S lm⇒* i ∨ i ∧ i [21444]� Observe, however, that the same language that the four-rule CFG generates is also generated by the unambiguous six-rule CFG defined as

1: S → S ∨ A, 2: S → A, 3: A → A ∧ B, 4: A → B, 5: B → (S), 6: B → i

Both previous CFGs are left-recursive (see Definition 6�37)� However, some important methods of syntax analysis work only with non-left-recursive CFGs; in fact, all the methods described in Section 7�2 are of this kind� Therefore, we give one more equivalent non-left-recursive unambiguous CFG, obtained from the previous CFG by Algorithm 6�38:

1: S → AC, 2: C → ∨AC, 3: C → ε, 4: A → BD, 5: D → ∧BD, 6: D → ε, 7: B → (S), 8: B → i

134 

Compare the three equivalent CFGs introduced in this example� Intuitively, we obviously see that in the syntax analysis, any grammatical ambiguity may represent an undesirable phenomenon, and if it does, we prefer the second CFG to the first CFG� On the other hand, the definition of the former is more succinct than the definition of the latter because the former contains a single nonterminal and four rules while the latter has three nonterminals and six rules� As already noted, some methods of syntax analysis necessitate using non-left-recursive CFGs, in which case we obviously use the third CFG, which has more nonterminals and rules than the other two CFGs in this example� Simply put, all three CFGs have their pros and cons�

Consequently, from a broader perspective, the previous example illustrates a typical process of designing an appropriate grammatical specification for a programming language in practice� Indeed, we often design several equivalent CFGs that generate the language, carefully consider their advantages and disadvantages, and based on this consideration, we choose the CFG that is optimal under given circumstances�

In what follows, we make use of the CFGs from the previous example so often that we introduce Convention 7�2 for the sake of brevity�

Convention 7.2 Consider the CFGs defined in Example 7�1� Throughout the rest of this chapter, E, H, and J denote its first, second, and third CFG, respectively� That is,

I� E denotes 1: S → S ∨ S, 2: S → S ∧ S, 3: S → (S), 4: S → i;

II� H denotes 1: S → S ∨ A, 2: S → A, 3: A → A ∧ B, 4: A → B, 5: B → (S), 6: B → i;

III� J denotes 1: S → AC, 2: C → ∨AC, 3: C → ε, 4: A → BD, 5: D → ∧BD, 6: D → ε, 7: B → (S), 8: B → i�

7.1.2 Top-Down Parsing Let I ∈ CFGΨ� Given w ∈ Δ*, an I-based top-down parser works on w so it simulates a leftmost derivation of w in I� If there is no leftmost derivation of w in I, the parser rejects w because w ∉ L(I)� On the other hand, if the parser successfully completes the simulation of S lm⇒* w [ρ] in I, which means that w ∈ L(I), the parser accepts w to express that I generates w; in addition, it often produces the left parse ρ as output too�

Algorithm 7�3 transforms any CFG I to an equivalent PDA O that acts as an I-based top-down parser� In many respects, this transformation resembles Algorithm 6�52, reformulated in terms of Convention 7�1� Notice that O performs only the following two pushdown operations-popping and expanding�

I� If a terminal a occurs as the pushdown top symbol, O pops a off the pushdown by a rule of the form a⬦a → ⬦, by which O removes a from the pushdown top and, simultaneously, reads the input symbol a, so it actually verifies their identity�

II� If a nonterminal A occurs as the pushdown top symbol, O simulates a leftmost derivation step made by a rule r: A → X1X2…Xn ∈ IR, where each Xi ∈ IΣ, 1 ≤ i ≤ n, for some n ∈ 0ℕ (n = 0 means X1X2…Xn = ε)� O performs this simulation so it expands its pushdown by using

A⬦ → reversal(X1X2…Xn)⬦ so it replaces the pushdown top A with Xn…X1� To describe an expansion like this more formally, consider IS lm⇒* w [ρ] in I, where w ∈ IΔ*, expressed as

IS lm⇒* vAy lm⇒ vX1X2…Xn y lm⇒* vu

where w = vu, so v, u ∈ IΔ*� Suppose that I has just simulated the first portion of this derivation, S lm⇒* vAy; at this point, the pushdown of O contains Ay in reverse, and u is the remaining input to be read� In symbols, the PDA O occurs in the configuration ▹reversal(y) A⬦u◃, from which it simulates vAy lm⇒ vX1X2…Xny [r] by performing

▹reversal(y)A⬦u◃ ⇒ ▹reversal(y)Xn…X1⬦u◃

according to the rule A⬦ → Xn…X1⬦ from the set of rules in O; notice that reversal(y)Xn… X1 = reversal(X1…Xn y)�

Algorithm 7.3 Top-Down Parser�

Input. I ∈ CFGΨ�

Output. O ∈ PDAΨ such that O works as an I-based top-down parser�

Method.