Formal Languages and Rewriting Systems | 6

ABSTRACT

The purpose of this chapter is threefold� First, Section 2�1 introduces the terminology concerning formal languages� Second, Section 2�2 introduces rewriting systems� Then, based upon these systems, in an intuitive and preliminary way, it outlines major topics discussed later in this book in a more rigorous and thorough way� Third, Section 2�3 gives a synopsis of this book�

2.1 Formal Languages An alphabet Σ is a finite nonempty set, whose members are called symbols� Any nonempty subset of Σ is a subalphabet of Σ� A finite sequence of symbols from Σ is a string over Σ; specifically, ε is referred to as the empty string-that is, the string consisting of zero symbols� By Σ*, we denote the set of all strings over Σ; Σ+ = Σ* − {ε}� Let x ∈ Σ*� Like for any sequence, |x| denotes the length of x-that is, the number of symbols in x� For any a ∈ Σ, occur(x, a) denotes the number of occurrences of as in x, so occur(x, a) always satisfies 0 ≤ occur(x, a) ≤ |x|� Furthermore, if x ≠ ε, symbol(x, i) denotes the ith symbol in x, where i = 1, …, |x|� Any subset L ⊆ Σ* is a formal language or, briefly, a language over Σ� Set symbol(L, i) = {a| a = symbol(x, i), x ∈ L − {ε}, 1 ≤ i ≤ |x|}� Any subset of L is a sublanguage of L� If L represents a finite set of strings, L is a finite language; otherwise, L is an infinite language� For instance, Σ*, which is called the universal language over Σ, is an infinite language while ∅ and {ε} are finite; noteworthy, ∅ ≠ {ε} because card(∅) = 0 ≠ card({ε}) = 1� Sets whose members are languages are called families of languages�

Example 2.1 The English alphabet, consisting of 26 letters, illustrates the definition of an alphabet as stated earlier, except that we refer to its members as symbols in this book� Our definition of a language includes all common artificial and natural languages� For instance, programming languages represent formal languages in terms of this definition, and so do English, Navaho, and Japanese� Any family of natural languages, including Indo-European, Sino-Tibetan, Niger-Congo, Afro-Asiatic, Altaic, and Japonic families of languages, is a language family according to the definition�

Convention 2.1 In strings, for brevity, we simply juxtapose the symbols and omit the parentheses and all separating commas� That is, we write a1a2…an instead of (a1, a2, …, an)�

Let finΦ and infinΦ denote the families of finite and infinite languages, respectively� Let allΦ denote the family of all languages; in other words, allΦ = finΦ ∪ infinΦ�

Operations� Let x, y ∈ Σ* be two strings over an alphabet Σ, and let L, K ⊆ Σ* be two languages over Σ� As languages are defined as sets, all set operations apply to them� Specifically, L ∪ K, L ∩ K, and L – K denote the union, intersection, and difference of languages L and K, respectively� Perhaps most importantly, the concatenation of x with y, denoted by xy, is the string obtained by appending y to x� Notice that for every w ∈ Σ*, wε = εw = w� The concatenation of L and K, denoted by LK, is defined as LK = {xy| x ∈ L, y ∈ K }�

Apart from binary operations, we also make some unary operations with strings and languages� Let x ∈ Σ* and L ⊆ Σ*� The complement of L is denoted by ~L and defined as ~L = Σ* − L� The reversal of x, denoted by reversal(x), is x written in the reverse order, and the reversal of L, reversal(L), is defined as reversal(L) = {reversal(x)| x ∈ L}� For all i ≥ 0, the ith power of x, denoted by x i, is recursively defined as (1) x0 = ε, and (2) x i = xx i−1, for i ≥ 1� Observe that this definition is based on the recursive definitional method� To demonstrate the recursive aspect, consider, for instance, the ith power of x i with i = 3� By the second part of the definition, x3 = xx2� By applying the second part to x2 again, x2 = xx1� By another application of this part to x1, x1 = xx0� By the first part of this definition, x0 = ε� Thus, x1 = xx0 = xε = x� Hence, x2 = xx1 = xx� Finally, x3 = xx2 = xxx� By using this recursive method, we frequently introduce new notions, including the ith power of L, Li, which is defined as (1) L0 = {ε} and (2) Li = LLi−1, for i ≥ 1� The closure of L, L*, is defined as L* = L0 ∪ L1 ∪ L2 ∪ …, and the positive closure of L, L+, is defined as L+ = L1 ∪ L2 ∪ … � Notice that L+ = LL* = L*L, and L* = L+ ∪ {ε}� Let w, x, y, z ∈ Σ*� If xz = y, then x is a prefix of y; if in addition, x ∉ {ε, y}, x is a proper prefix of y� By prefixes(y), we denote the set of all prefixes of y� Set prefixes(L) = {x| x ∈ prefixes(y) for some y ∈ L}� For i = 0, …, |y|, prefix(y, i) denotes y’s prefix of length i; notice that prefix(y, 0) = ε and prefix(y, |y|) = y� If zx = y, x is a suffix of y; if in addition, x ∉ {ε, y}, x is a proper suffix of y� By suffixes(y), we denote the set of all suffixes of y� Set suffixes(L) = {x| x ∈ suffixes(y) for some y ∈ L}� For i = 0, …, |y|, suffix(y, i) denotes y’s suffix of length i� If wxz = y, x is a substring of y; if in addition, x ∉ {ε, y}, x is a proper substring of y� By substrings(y), we denote the set of all substrings of y� Observe that for all v ∈ Σ*, prefixes(v) ⊆ substrings(v), suffixes(v) ⊆ substrings(v), and {ε, v} ∈ prefixes(v) ∩ suffixes(v) ∩ substrings(v)� Set symbols(y) = {a| a ∈ substrings(y), |a| = 1}� Furthermore, set substrings(L) = {x| x ∈ substrings(y) for some y ∈ L} and symbols(L) = {a| a ∈ symbols(y) for some y ∈ L}�

Example 2.2 Consider the alphabet {0, 1}� For instance, ε, 1, and 010 are strings over {0, 1}� Notice that |ε| = 0, |1| = 1, and |010| = 3� The concatenation of 1 and 010 is 1010� The third power of 1010 equals 101010101010� Observe that reversal(1010) = 0101� We have prefixes(1010) = {ε, 1, 10, 101, 1010}, where 1, 10, and 101 are proper prefixes of 1010 while ε and 1010 are not� We have suffixes(1010) = {ε, 0, 10, 010, 1010}, substrings(1010) = {ε, 0, 1, 01, 10, 010, 101, 1010}, and symbols(1010) = {0, 1}�

Set K = {0, 01} and L = {1, 01}� Observe that L ∪ K, L ∩ K, and L – K are equal to {0, 1, 01}, {01}, and {1}, respectively� The concatenation of K and L is KL = {01, 001, 011, 0101}� For L, ~L = Σ* − L, so ~L contains all strings in {0, 1}* but 1 and 01� Furthermore, reversal(L) = {1, 10} and L2 = {11, 101, 011, 0101}� The strings in L* that consists of four or fewer symbols are ε, 1, 01, 11, 011, 101, 111, 0101, 0111, 1011, 1101, and 1111� L+ = L* − {ε}� Notice that prefixes(L) = {ε, 1, 0, 01}, suffixes(L) = {ε, 1, 01}, substrings(L) = {ε, 0, 1, 01}, and symbols(L) = {0, 1}�

◾

Let T and U be two alphabets� A total function τ from T * to power (U *) such that τ(uv) = τ(u) τ(v) for every u, v ∈ T * is a substitution from T * to U *� By this definition, τ(ε) = {ε} and τ(a1a2… an) = τ(a1)τ(a2)…τ(an), where ai ∈ T, 1 ≤ i ≤ n, for some n ≥ 1, so τ is completely specified by defining τ(a) for every a ∈ T� A total function υ from T * to U * such that υ(uv) = υ(u)υ(v) for every u, v ∈ T * is a homomorphism from T * to U *� As any homomorphism is obviously a special case of a substitution, we simply specify υ by defining υ(a) for every a ∈ T; if υ(a) ≠ ε for all a ∈ T, υ is said to be an ε-free homomorphism� It is worth noting that a homomorphism from T * to U * may not represent an injection from T * to U * as illustrated in Example 2�3�

Example 2.3 Let EnglishΔ denote the English alphabet� The Morse code, denoted by µ, can be seen as a homomorphism from EnglishΔ* to {·, −}* (see Figure 2�1)� For instance,

µ(SOS) = · · · – – – · · ·

Notice that µ is no injection from English∆* to {⋅, −}*; for instance, µ(SOS) = µ(IJS)�

We conclude this section by Example 2�4, which demonstrates how to represent nonnegative integers by strings in a very simple way� More specifically, it introduces function unary, which represents all nonnegative integers by strings consisting of as� Later in this book, especially in Section IV, we frequently make use of unary�

Example 2.4 Let a be a symbol� To represent nonnegative integers by strings over {a}, define the total function unary from 0ℕ to {a}* as unary(i) = ai, for all i ≥ 0. For instance, unary(0) = ε, unary(2) = aa, and unary(1000000) = a1000000�

2.2 Rewriting Systems Rewriting systems, defined and discussed in this section, are used in many mathematical and computer science areas, ranging from purely theoretically oriented areas, such as the investigation of computational principals in terms of logic, up to quite pragmatically oriented areas, such as the construction of translators� Considering the subject of this book, it comes as no surprise that we primarily make use of rewriting systems as language-defining models and computational models�

This section is divided into three subsections� Section 2�2�1 introduces rewriting systems in general� Section 2�2�2 treats them as language-defining models� Finally, Section 2�2�3 formalizes the intuitive notion of a procedure by them�

Figure 2.1 Morse code.