Models for Context-Free Languages | 10 | Formal Languages and Computat

ABSTRACT

Context-free grammars (CFGs) and pushdown automata (PDA), that is, two rudimental models for context-free languages (CFLs) represent special cases of rewriting systems, defined in Section 2�2� A CFG is a language-generating rewriting system� Each of its rewriting rules has a single symbol on its left-hand sides� By repeatedly applying these rules, the grammar generates sentences of its language� A PDA is a language-accepting rewriting system� In essence, consider the notion of a finite automaton (FA) (see Definition 3�1)� A PDA represents an FA extended by a potentially infinite pushdown list� During a move, according to one of its rules, it reads a symbol, changes the current state, and rewrites a string of symbols occurring on the pushdown top� If it reads the entire input string, empties the pushdown list and enters a final state, the automaton accepts the input string; the set of all accepted strings in this way is the language that the automaton accepts�

This chapter, consisting of Sections 6�1 through 6�3, introduces and studies CFGs and PDA� Section 6�1 defines general versions of CFGs, after which Section 6�2 presents several grammatical transformations that can be applied to these general versions to make their representation more convenient� Section 6�3 defines PDA and proves their equivalence with CFGs; in other words, it demonstrates that they both characterize the family of CFLs�

It is noteworthy that some notions, such as the notion of a CFG and the notion of a linear grammar (LG), were already introduced in Section 2�2 in an informal way� This chapter defines them fully rigorously�

6.1 Context-Free Grammars As already mentioned and illustrated in Section 2�2�2, a CFG represents a language-generating rewriting system based on finitely many terminal symbols, nonterminal symbols, and grammatical rules� Terminal symbols occur in the language generated by the grammar while nonterminal symbols do not� Each rule has a single nonterminal symbol on its left-hand side while its righthand side is a string, which may contain both terminal and nonterminal symbols� Starting from a special start symbol, the grammar repeatedly rewrites nonterminal symbols according to its rules

until it obtains a sentence, that is, a string that consists solely of terminal symbols� The set of all sentences represents the language generated by the grammar�

Definition 6.1 A context-free grammar (CFG) is a rewriting system G = (Σ, R), where

◾ Σ is divided into two disjoint subalphabets, denoted by N and Δ� ◾ R is a finite set of rules of the form A → x, where A ∈ N and x ∈ Σ*�

N and Δ are referred to as the alphabet of nonterminal symbols and the alphabet of terminal symbols, respectively� N contains a special start symbol, denoted by S�

If S ⇒* w, where w ∈ Σ*, G derives w, and w is a sentential form� F(G) denotes the set of all sentential forms derived by G� The language generated by G, symbolically denoted by L(G), is defined as L(G) = F(G) ∩ Δ*� Members of L(G) are called sentences� If S ⇒* w and w is a sentence, S ⇒* w is a successful derivation in G�

To explain Definition 6�1 less formally, consider a string xAy and a rule A → u ∈ R, where A ∈ N and x, y, u ∈ Σ*� By using A → u, G makes a derivation step from xAy to xuy so it changes A to u in xAy, symbolically written as xAy ⇒ xuy� If G makes a sequence of derivation steps from S to a string w ∈ Σ*, then w is a sentential form� Every sentential form over Δ is a sentence, and the set of all sentences is the language of G, denoted by L(G)� As a result, every sentence w satisfies S ⇒* w with w ∈ Δ*, so L(G) = {w ∈ Δ*| S ⇒* w}�

Since any CFG, G = (Σ, R), represents a rewriting system, all the notions introduced for rewriting systems are applicable to G as well (see Section 2�2)� That is, by u ⇒ v [r], where u, v ∈ Σ*, and r ∈ R, we say that G directly rewrites u to v by r or, as we customarily say in terms of CFGs, G makes a derivation step from u to v by r� Furthermore, to express that G makes u ⇒* w according to a sequence of rules, r1r2…rn, we write u ⇒* v [r1r2…rn], which is read as a derivation from u to v by using r1r2…rn� On the other hand, whenever the information regarding the applied rules is immaterial, we omit these rules� In other words, we often simplify u ⇒ v [r] and u ⇒* v [r1r2…rn] to u ⇒ v and u ⇒* v, respectively�

Convention 6.2 For any CFG G, we automatically assume that Σ, N, Δ, S, and R denote the total alphabet, the alphabet of nonterminal symbols, the alphabet of terminal symbols, the start symbol, and the set of rules, respectively� If there exists a danger of confusion, we mark Σ, N, Δ, S, and R with G as GΣ, GN, GΔ, GS, and GR, respectively, to clearly relate these components to G (in particular, we make these marks when several CFGs are simultaneously discussed)� For brevity, we often abbreviate nonterminal symbols and terminal symbols to nonterminals and terminals, respectively� If we want to express that a nonterminal A forms the left-hand side of a rule, we refer to this rule as an A-rule� If we want to specify that A is rewritten during a derivation step, xAy ⇒ xuy, we underline this A as xAy ⇒ xuy�

For brevity, G is often defined by simply listing its rules together with specifying its nonterminals and terminals, usually denoted by uppercases and lowercases, respectively�

We denote the family of CFGs by CFGΨ� We set CFGΦ = {L(G)| G ∈ CFGΨ}, and we refer to CFGΦ as the family of CFLs, so any language in CFGΦ is called a CFL�

Example 6�1 demonstrates how to determine the language generated by a CFG in a rigorous way� The CFG considered in the example is relatively simple; in fact, it represents an LG, in which

◾

every rule has no more than one occurrence of a nonterminal on its right-hand side as already stated in Example 2�7� As a result, every sentential form contains no more than one occurrence of a nonterminal, and this property makes the determination of the generated language relatively simple� Before giving the example, we repeat the notion of an LG (see Example 2�7) as a CFG in which the right-hand side of every rule contains no more than one occurrence of a nonterminal�

Definition 6.3 Let G = (Σ, R) be a CFG such that each A → x ∈ R satisfies x ∈ Δ*(N ∪ {ε})Δ*� Then, G is said to be a linear grammar (LG)�

Example 6.1 Consider L = {akbk| k ≥ 1}� In principle, we generate the strings from L by a CFG, G, so G derives sentential forms

S, aSb, aaSbb, aaaSbbb, …

in which G rewrites S with ab during the very last derivation step� Formally, G = (Σ, R), where Σ = {a, b, S} and R = {S → aSb, S → ab}� In Σ, Δ = {a, b} is the alphabet of terminals and N = {S} is the alphabet of nonterminals, where S is the start symbol of G� Under Convention 6�2, we can specify G simply as

1: S → aSb 2: S → ab

Consider aaSbb� By using rule 1, G rewrites S with aSb in this string, so aaSbb ⇒ aaaSbbb [1]� By using rule 2, G rewrites S with ab, so aaSbb ⇒ aaabbb [2]� By using the sequence of rules 112, G makes

aaSbb ⇒ aaaSbbb [1] ⇒ aaaaSbbbb [1] ⇒ aaaaabbbbb [2]

Briefly, we write aaSbb ⇒* aaaaabbbbb [112] or, even more simply, aaSbb ⇒* aaaaabbbbb� To verify that G generates {akbk| k ≥ 1}, recall that by using rule 1, G replaces S with aSb, and by

rule 2, G replaces S with ab� Consequently, every successful derivation has the form

which G makes according to a sequence of rules of the form 1k2, for some k ≥ 0� In symbols, S ⇒* ak+1bk+1 [1k2]� From these observations, we see L(G) = {akbk| k ≥ 1}; a detailed verification of this identity is left as an exercise�

The two-rule linear CFG considered in Example 6�1 generates every sentence by a unique derivation, so the analysis of its derivation process is rather easy� As a result, the determination of its generated language represents a simple task as well� In a general case, however, G ∈ CFGΨ may contain several occurrences of nonterminals on the right-hand sides of their rules and generate their sentences by a variety of different derivations� Under these circumstances, the determination of L(G) may be more complicated as shown in Example 6�2� This example also illustrates a typical two-phase approach to achieving complicated results: (1) first, a more general result is established then (2) as its straightforward consequence, the desired result is derived�

Example 6.2 Consider L from Example 6�1� Let K be the set of all permutations of strings in L� Equivalently speaking, K contains all nonempty strings consisting of an equal number of as and bs� Formally,

K = {w| w ∈ {a, b}+ and occur(w, a) = occur(w, b)}

In this example, we prove that K is generated by the CFG G defined as

1: S → aB, 2: S → bA, 3: A → a, 4: A → aS, 5: A → bAA, 6: B → b, 7: B → bS, 8: B → aBB

We first prove the following claim, which says something more than we actually need to establish K = L(G)� From this claim, we subsequently obtain K = L(G) as a straightforward consequence of this claim�

Claim� For all w ∈ {a, b}*, these three equivalences hold

I� S ⇒* w iff occur(a, w) = occur(b, w); II� A ⇒* w iff occur(a, w) = occur(b, w) + 1; III� B ⇒* w iff occur(b, w) = occur(a, w) + 1�

Proof� This claim is proved by induction on |w| ≥ 1�

Basis� Let |w| = 1�

I� From S, G generates no sentence of length one� On the other hand, no sentence of length one satisfies occur(a, w) = occur(b, w)� Thus, in this case, the basis holds vacuously�

II� Examine G to see that if A ⇒* w with |w| = 1, then w = a� For w = a, A ⇒* w [3]� Therefore, II holds in this case�

III� Prove III by analogy with the proof of II�

Consequently, the basis holds�

Induction Hypothesis� Assume that there exists a positive integer n ≥ 1 such that the claim holds for every w ∈ {a, b}* satisfying 1 ≤ |w| ≤ n�

Induction Step� Let w ∈ {a, b}* with |w| = n + 1�

Consider I in the claim� To prove its only-if part, consider any derivation of the form S ⇒* w [ρ], where ρ is a sequence of rules� This derivation starts from S� As only rules 1 and 2 have S on the left-hand side, express S ⇒* w [ρ] as S ⇒* w [rπ], where ρ = rπ and r ∈ {1, 2}�

i� If r = 1, S ⇒* w [1π], where 1: S → aB� At this point, w = av, and B ⇒* v [π], where |v| = n� By the induction hypothesis, III holds for v, so occur(b, v) = occur(a, v) + 1� Therefore, occur(a, w) = occur(b, w)�

ii� If r = 2, S ⇒* w [2π], where 2: S → bA� Thus, w = bv, and A ⇒* v [π], where |v| = n� By the induction hypothesis, II holds for v, so occur(a, v) = occur(b, v) + 1� As w = bv, occur(a, w) = occur(b, w)�

To prove the if part of I, suppose that occur(a, w) = occur(b, w)� Clearly, w = av or w = bv, for some v ∈ {a, b}* with |v| = n�

i� Let w = av� Then, |v| = n and occur(a, v) + 1 = occur(b, v)� As |v| = n, by the induction hypothesis, we have B ⇒* v iff occur(b, v) = occur(a, v) + 1 from III� By using 1: S → aB, we obtain S ⇒ aB [1]� Putting S ⇒ aB and B ⇒* v together, we have S ⇒ aB ⇒* av, so S ⇒* w because w = av�

ii� Let w = bv� Then, |v| = n and occur(a, v) = occur(b, v) + 1� By the induction hypothesis, we have A ⇒* v iff occur(a, v) = occur(b, v) + 1 (see II)� By 2: S → bA, G makes S ⇒ bA� Thus, S ⇒ bA and A ⇒* v, so S ⇒* w�

◾

Take II in the claim� To prove its only-if part, consider any derivation of the form A ⇒* w [ρ], where ρ is a sequence of rules in G� Express A ⇒* w [ρ] as A ⇒* w [rπ], where ρ = rπ and r ∈ {3, 4, 5} because rules 3: A → a, 4: A → aS, and 5: A → bAA are all the A-rules in G�

i� If r = 3, A ⇒* w [rπ] is a one-step derivation A ⇒ a [3], so w = a, which satisfies occur(a, w) = occur(b, w) + 1�

ii� If r = 4, A ⇒* w [4π], where 4: A → aS� Thus, w = av, and S ⇒* v [π], where |v| = n� By the induction hypothesis, from I, occur(a, v) = occur(b, v), so occur(a, w) = occur(b, w) + 1�

iii� If r = 5, A ⇒* w [5π], where 5: A → bAA� Thus, w = buv, A ⇒* u, A ⇒* v, where |u| ≤ n, |v| ≤ n� By the induction hypothesis, from II, occur(a, u) = occur(b, u) + 1 and occur(a, v) = occur(b, v) + 1, so occur(a, uv) = occur(b, uv) + 2� Notice that occur(b, uv) = occur(b, w) – 1 implies occur(a, uv) – 2 = occur(b, w) – 1� Furthermore, from occur(a, uv) – 2 = occur(b, w) – 1, it follows that occur(a, uv) = occur(a, w), so occur(a, w) = occur(b, w) + 1�

To prove its if part of II, suppose that occur(a, w) = occur(b, w) + 1� Obviously, w = av or w = bv, for some v ∈ {a, b}* with |v| = n�

i� Let w = av� At this point, |v| = n and occur(a, v) = occur(b, v)� As |v| = n, by the induction hypothesis, we have S ⇒* v� By using 4: A → aS, A ⇒ aS [4]� Putting A ⇒ aS and S ⇒* v together, we obtain A ⇒ aS ⇒* av, so A ⇒* w because w = av�

ii� Let w = bv� At this point, |v| = n and occur(a, v) = occur(b, v) + 2� Express v as v = uz so that occur(a, u) = occur(b, u) + 1 and occur(a, z) = occur(b, z) + 1; as an exercise, we leave a proof that occur(a, v) = occur(b, v) + 2 implies that v can always be expressed in this way� Since |v| = n, |u| ≤ n ≥ |z|� Thus, by the induction hypothesis (see II), we have A ⇒* u and A ⇒* z� By using 5: A → bAA, A ⇒ bAA [5]� Putting A ⇒ bAA, A ⇒* u, and A ⇒* z together, we obtain A ⇒ bAA ⇒* buz, so A ⇒* w because w = bv = buz�

Prove III by analogy with the proof of the inductive step of II, given earlier�

Having established this claim, we easily obtain the desired equation L(G) = {w| w ∈ {a, b}+ and occur(w, a) = occur(w, b)} as a consequence of I� Indeed, this equivalence says that for all w ∈ {a, b}*, S ⇒* w iff occur(a, w) = occur(b, w)� Consequently, w ∈ L(G) iff occur(a, w) = occur(b, w)� As G has no ε-rules, ε ∉ L(G), so L(G) = {w| w ∈ {a, b}+ and occur(w, a) = occur(w, b)}�

6.2 Restricted Context-Free Grammars In this section, we introduce several reasonable restrictions placed on CFGs to simplify their investigation in theory and use in practice� We struggle to introduce all the restricted versions so they are as powerful as their general versions, defined in Section 6�1�

Section 6�2�1 reduces the derivation multiplicity of CFGs by restricting its attention only to canonical derivations and derivation trees� This section also explores the phenomenon of ambiguity in CFGs and their languages, and it points out the existence of CFLs that are always generated in an ambiguous way� Section 6�2�2 explains how to remove all redundant symbols from grammars in CFGΨ� Section 6�2�3 describes how to turn any CFG to an equivalent CFG in which no rule has ε as its right-hand side� Section 6�2�4 transforms any CFG to an equivalent CFG in which a single nonterminal does not form the right-hand side of any rule� Continuing with the same topic in a more restrictive way, Section 6�2�5 describes how to convert any grammar in CFGΨ to an equivalent Chomsky normal form grammar in which every rule has on its right-hand side a terminal or two nonterminals� Section 6�2�6 discusses the grammatical phenomenon of left recursion, which causes the grammars to go into an infinite loop, and explains how to remove

this phenomenon� Finally, Section 6�2�7 describes how to transform any grammar in CFGΨ to an equivalent Greibach normal form grammar in which every rule has on its right-hand side a terminal followed by zero or more nonterminals�

6.2.1 Canonical Derivations and Derivation Trees As illustrated by Example 6�2, in a general case, G ∈ CFGΨ may generate the same sentence by many different derivations, and this derivation multiplicity obviously complicates the discussion of G and L(G) in both theory and practice� To reduce this derivation multiplicity, we first introduce two special types of canonical derivations-namely, leftmost derivations and rightmost derivations-and demonstrate that x ∈ L(G) iff G generates x by either of these two canonical derivations� In addition, in terms of graph theory, we simplify the discussion concerning grammatical derivations by derivation trees, which represent derivations by graphically displaying rules but suppressing the order of their applications�

During this section, the reader should carefully keep in mind that we frequently and automatically make use of Convention 6�2�

A derivation is leftmost if during its every single derivation step, the leftmost occurrence of a nonterminal is rewritten in the sentential form�

Definition 6.4 Let G = (Σ, R) be a CFG�

I� Let r: A → z ∈ R, t ∈ Δ*, o ∈ Σ*� Then, G makes the leftmost derivation step from tAo to tzo according to r, symbolically written as tAo lm⇒ tzo [r]�

II� Leftmost derivations in G are defined recursively as follows: i� For all u ∈ Σ*, then G makes the leftmost derivation from u to u according to ε, symboli-

cally written u lm⇒* u [ε]� ii� If u, w, v ∈ Σ*, ρ = σr, σ ∈ R*, r ∈ R, u lm⇒* w [σ] and w lm⇒ v [r] in G, then G makes the

leftmost derivation from u to v according to ρ, symbolically written as u lm⇒* v [ρ] in G�

To point out the crucial parts of Definition 6�4, notice that in I, A is the leftmost occurrence of a nonterminal in the rewritten string tAo� According to II, u lm⇒* u [ε] in G, for every u ∈ Σ*� If ρ = r1r2…rn, where rj ∈ R, for some n ∈ ℕ, and there are w0, w1, …, wn ∈ Σ* such that wj−1lm⇒ wj [rj] in G (see I), for all 1 ≤ j ≤ n, then G makes the leftmost derivation from w0 to wn according to ρ, w0lm⇒* wn [ρ]� If ρ represents an immaterial piece of information, we omit it and simplify w0lm⇒* wn [ρ] to w0lm⇒* wn� It is worth noting that apart from w0lm⇒* wn [ρ], there may exist σ ∈ R* such that ρ ≠ σ and w0lm⇒* wn [σ] in G as well� In fact, in a general case, G can make w0lm⇒* wn according to several different sequences of rules from R*�

Next, we demonstrate that every CFG can generate each sentence by a leftmost derivation�

Theorem 6.5 Let G ∈ CFGΨ� Then, w ∈ L(G) iff S lm⇒* w in G�

Proof. The if part of the proof says that S lm⇒* w implies w ∈ L(G), for every w ∈ Δ*� As S lm⇒* w is a special case of a derivation from S to w, this implication surely holds� Therefore, we need only to prove the only-if part that says that w ∈ L(G) implies S lm⇒* w in G� This implication straightforwardly follows from the next claim�

◾

Claim� For every w ∈ L(G), S ⇒n w implies S lm⇒n w, for all n ≥ 0�

Proof. (by induction on n ≥ 0)�

Basis� For n = 0, this implication is trivial�

Induction Hypothesis� Assume that there exists an integer n ≥ 0 such that the claim holds for all derivations of length n or less�

Induction Step� Let S ⇒n+1 w [ρ], where w ∈ L(G), ρ ∈ R+, and |ρ| = n + 1� If S ⇒n+1 w [ρ] is leftmost, the induction step is completed� Assume that this derivation is not leftmost� Express S ⇒n+1 w [ρ] as

S lm⇒* uAvBx [σ] ⇒ uAvyx [r: B → y] ⇒* w [θ]

where σ, θ ∈ R*, ρ = σrθ, r: B → y ∈ R, u ∈ prefixes(w), A ∈ N, and v, x, y ∈ Σ*� In other words, S lm⇒* uAvBx is the longest leftmost derivation that begins S ⇒n+1 w� As w ∈ L(G) and L(G) ⊆ Δ* (see Definition 6�1), w ∈ Δ*� Thus, A ∉ symbols(w) because A ∈ N� Hence, A is surely rewritten during uAvyx ⇒* w� Express S ⇒n+1 w as

S lm⇒* uAvBx [σ] ⇒ uAvyx [r: B → y] ⇒* uAz [π] lm⇒ utz [p: A → t] ⇒* w [ο]

where π, ο ∈ R*, θ = πpο, p: A → t ∈ R, vyx ⇒* z, z ∈ Σ*� Rearrange this derivation so the derivation step according to p is made right after the initial part S lm⇒* uAvBx [σ]; more formally written,

S lm⇒* uAvBx [σ] lm⇒ utvBx [p: A → t] ⇒ utvyx [r: B → y] ⇒* utz [π] ⇒* w [ο]

The resulting derivation S ⇒* w [σprπο] begins with at least |σp| leftmost steps, so its leftmost beginning is definitely longer than the leftmost beginning of the original derivation S ⇒n+1 w [ρ]� If S ⇒* w [σprπο] is leftmost, the induction step is completed� If not, apply the derivation rearrangement described above to S ⇒* w [σprπο]� After no more than n − 2 repetitions of this rearrangement, we necessarily obtain S lm⇒* w, which completes the induction step, so the proof of the claim is completed�

By this claim, we see that w ∈ L(G) implies S lm⇒* w in G, so the theorem holds�

We might naturally ask whether for every G ∈ CFGΨ, G = (Σ, R), S ⇒* w iff S lm⇒* w, for all w ∈ Σ*� To rephrase this question less formally, we might ask whether Theorem 6�5 can be generalized in terms of all sentential forms, not just sentences� Surprisingly, the answer is no� Indeed, to give a trivial counterexample, consider a two-rule CFG defined as S → AA and A → a� Observe that this grammar makes S ⇒ AA ⇒ Aa; however, there is no leftmost derivation of Aa in G�

As their name indicates, in rightmost derivations, the rightmost occurrence of a nonterminal is rewritten during its every derivation step�

Definition 6.6 Let G = (Σ, R) be a CFG�

I� Let r: A → z ∈ R, t ∈ Δ*, o ∈ Σ*� Then, G makes the rightmost derivation step from oAt to ozt according to r, symbolically written as oAt rm⇒ ozt [r]�

II� Rightmost derivations in G are defined recursively as follows: i� For all u ∈ Σ*, then G makes the rightmost derivation from u to u according to ε, symboli-

cally written u rm⇒* u [ε]� ii� If u, w, v ∈ Σ*, ρ = σr, σ ∈ R*, r ∈ R, u rm⇒* w [σ] and w rm⇒ v [r] in G, then G makes the

rightmost derivation from u to v according to ρ, symbolically written as u rm⇒* v [ρ] in G�

Let u, v ∈ Σ*, ρ ∈ R*, and u rm⇒* v [ρ] in G� If ρ represents an immaterial piece of information, we usually simplify u rm⇒* v [ρ] to u rm⇒* v� Of course, in a general case, G can make u rm⇒* v according to several different sequences of rules from R*�

As an exercise, by analogy with Theorem 6�5, prove Theorem 6�7�

Theorem 6.7 Let G ∈ CFGΨ� Then, w ∈ L(G) iff S rm⇒* w�

Apart from the canonical derivations, we often simplify the discussion concerning grammatical derivations graphically by derivation trees, composed of rule trees� As their names indicate, rule trees describe grammatical rules while derivation trees represent derivations by specifying the rules, expressed as rule trees, according to which they are made together with the nonterminals to which the rules are applied� On the other hand, derivation trees suppress the order of the applications of the rules, so we make use of these trees when this order is immaterial�

A derivation tree in a CFG, G = (Σ, R), is any tree such that its root is from Σ, and for each of its elementary subtrees (see Section 1�4), e, there is a rule, l ∈ R, such that e represents l (see I in Definition 6�8)� Apart from the notion of a derivation tree, the following definition also specifies its correspondence to the derivation it represents� Let us note that this definition makes use of the terminology concerning trees introduced in Section 1�4�

Definition 6.8 Let G = (Σ, R) be a CFG�

I� For l: A → x ∈ R, A〈x〉 is the rule tree that represents l� II� The derivation trees representing derivations in G are defined recursively as follows: i� One-node tree X is the derivation tree corresponding to X ⇒0 X in G, where X ∈ Σ� ii� Let d be the derivation tree representing A ⇒* uBv [ρ] with frontier(d ) = uBv, and let l:

B → z ∈ R� The derivation tree that represents

A ⇒* uBv [ρ] ⇒ uzv [l]

is obtained by replacing the (|u|+1)st leaf in d, B, with the rule tree corresponding to l, B〈z〉�

III� A derivation tree in G is any tree t for which there is a derivation represented by t (see II)�

◾

Convention 6.9 Let G = (Σ, R) be a CFG� For any l: A → x ∈ R, G♣(l) denotes the rule tree corresponding to l� For any A ⇒* x [ρ] in G, where A ∈ N, x ∈ Σ*, and ρ ∈ R*, G♣(A ⇒* x [ρ]) denotes the derivation tree corresponding to A ⇒* x [ρ]� Just like we often write A ⇒* x instead of A ⇒* x [ρ] (see Convention 6�2), we sometimes simplify G♣(A ⇒* x [ρ]) to G♣(A ⇒* x) in what follows if there is no danger of confusion� Finally, G♣all denotes the set of all derivation trees for G�

Theorem 6.10 Let G ∈ CFGΨ, G = (Σ, R), A ∈ N, and x ∈ Σ*� Then, A ⇒* x in G iff t ∈ G♣all with root(t) = A and frontier(t) = x�

Proof� Consider any CFG, G = (Σ, R)� The only-if part of the equivalence says that for every derivation A ⇒* x, where A ∈ N and x ∈ Σ*, there exists t ∈ G♣all such that root(t) = A and frontier(t) = x� From Definition 6�8, we know how to construct G♣(A ⇒* x), which satisfies these properties�

The if part says that for every t ∈ G♣all with root(t) = A and frontier(t) = x, where A ∈ N and x ∈ Σ*, there exists A ⇒* x in G� We prove the if part by induction on depth(t) ≥ 0�

Basis� Consider any t ∈ G♣all such that depth(t) = 0� As depth(t) = 0, t is a tree consisting of one node, so root(t) = frontier(t) = A, where A ∈ N� Observe that A ⇒0 A in G; therefore, the basis holds�

Induction Hypothesis� Suppose that the if part holds for all trees of depth n or less, where n ∈ 0ℕ�

Induction Step� Consider any t ∈ G♣all with depth(t) = n + 1, root(t) = A, frontier(t) = x, A ∈ N, x ∈ Σ*� Consider the topmost rule tree, G♣(p), occurring in t� That is, G♣(p) is the rule tree whose root coincides with root(t)� Let p: A → u ∈ R� Distinguish these two cases-(a) u = ε and (b) u ≠ ε�

a� If u = ε, t has actually the form A〈〉, which means u = ε and depth(t) = 1, and at this point, A ⇒ ε [p], so the induction step is completed�

b� Assume u ≠ ε� Let u = X1X2…Xm, where Xi ∈ Σ, 1 ≤ i ≤ m, for some m ≥ 1� Thus, t is of the form A〈t1t2…tm〉, where each ti is in G♣all and satisfies root(ti) = Xi, 1 ≤ i ≤ m, with depth(ti) ≤ n� Let frontier(ti) = yi, where yi ∈ Σ*, so x = y1y2…ym� As depth(ti) ≤ n, by the induction hypothesis, Xi ⇒* yi in G, 1 ≤ i ≤ m� Since A → u ∈ R with u = X1X2…Xm, we have A ⇒ X1X2…Xm� Putting together A ⇒ X1X2…Xm and Xi ⇒* yi for all 1 ≤ i ≤ m, we obtain

A ⇒ X1X2…Xm ⇒* y1X2…Xm ⇒* y1y2…Xm ⋮ ⇒* y1y2…ym

Thus, A ⇒* x in G, the induction step is completed, and the if part of the equivalence holds true�

Corollary 6.11 Let G ∈ CFGΨ� Then, w ∈ L(G) iff G♣all contains t such that root(t) = S and frontier(t) = w�

Proof� This corollary follows from Theorem 6�10 for S ⇒* w with w ∈ GΔ*�

Theorem 6�10 and Corollary 6�11 imply the following important Corollary 6�12, which says that without any loss of generality, we can always restrict our attention to the canonical derivations or derivation trees when discussing the language generated by CFGs�

Corollary 6.12 For every G ∈ CFGΨ, I through III, as follows, coincide with L(G) = {w ∈ GΔ*| S ⇒* w}�

I� {w ∈ GΔ*| S lm⇒* w}; II� {w ∈ GΔ*| S rm⇒* w}; III� {w ∈ GΔ*| w = frontier(t), where t ∈ G♣all with root(t) = S}�

Unfortunately, even if we reduce our attention only to canonical derivations or derivation trees, we may still face a derivation multiplicity of some sentences� Indeed, some CFGs make several different canonical derivations of the same sentences; even worse, some languages in CFGΦ are generated only by CFGs of this kind�

Definition 6.13 Let G ∈ CFGΨ�

I� G ∈ CFGΨ is ambiguous if L(G) contains a sentence w such that S lm⇒* w [ρ] and S lm⇒* w [σ] for some ρ, σ ∈ R* with ρ ≠ σ; otherwise, G is unambiguous�

II� L ∈ CFGΦ is inherently ambiguous if every G ∈ CFGΨ such that L(G) = L is ambiguous�

Less formally, according to I in Definition 6�13, G is ambiguous if it generates a sentence by two different leftmost derivations� To rephrase I in terms of rightmost derivations, G is ambiguous if there exist S rm⇒* w [ρ] and S rm⇒* w [σ] with ρ ≠ σ, for some w ∈ L(G)� In terms of G♣all (see Convention 6�9), G is ambiguous if G♣all contains t and u such that t ≠ u while frontier(t) = frontier(u)�

We close this section by illustrating its key notions-canonical derivations, rule trees, derivation trees, and ambiguity�

Example 6.3 Consider

Z = {aib jck| i, j, k ∈ ℕ, i = j or j = k}

Observe that Z = L(G), where G ∈ CFGΨ is defined by the following 10 rules:

→ → → → → → → → → →

0 : ,1: , 2 : , 3 : , 4 : , 5 : , 6 : , 7 : , 8 : , 9 :

S AB A aAb A ab B cB B c S CD C aC C a D bDc D bc

Indeed, G uses rules 0 through 4 to generate {aib jck| i, j, k ∈ ℕ, i = j}� By using rules 5 through 9, it generates {aib jck| i, j, k ∈ ℕ, j = k}� As the union of these two languages coincides with Z, L(G) = Z�

Notice that G can generate every sentence by a variety of different derivations� For instance, consider aabbcc ∈ L(G)� Observe that G generates this sentence by the twelve different derivations, I through XII, listed in Figure 6�1 (according to Convention 6�2, we specify the rewritten symbols by underlining)�

Figure 6�2 describes the rule trees G♣(0) through G♣(9) corresponding to the 10 rules in G� In addition, G♣(0) is pictorially shown in Figure 6�3�

Consider, for instance, the first derivation in Figure 6�1� Figure 6�4 presents this derivation together with its corresponding derivation tree constructed in a step-by-step way� In addition, the resulting derivation tree, S〈A〈aA〈ab〉b〉B〈cB〈c〉〉〉, is pictorially shown in Figure 6�5�

In Figure 6�1, derivations I and VII represent two different leftmost derivations that generate aabbcc� Thus, G is an ambiguous CFG�

◾

Figure 6.2 G♣(0) through G♣(9).