ABSTRACT

Based on years of instruction and field expertise, this volume offers the necessary tools to understand all scientific, computational, and technological aspects of speech processing. The book emphasizes mathematical abstraction, the dynamics of the speech process, and the engineering optimization practices that promote effective problem solving in this area of research and covers many years of the authors' personal research on speech processing. Speech Processing helps build valuable analytical skills to help meet future challenges in scientific and technological advances in the field and considers the complex transition from human speech processing to computer speech processing.

part 1|2 pages

Part I ANALYTICAL BACKGROUND AND TECHNIQUES

chapter |5 pages

f_ 00 X(w) exp(7w77T 0)&7.

, - oo < 0 < oo (1.4) where each replica is shifted horizontally by an integer multiple of the angular sampling frequency wo = Proof: Using the inverse Fourier transform Eq. 1.1.2 at a sampling time t = nTo, we have exp(7w77T 0)&7. This integral can be decomposed into an infinite sum, of 27r/To length each, to give -(2k+1)7/To exp(3wnTo)dw. Using variable substitution w = - tc, this becomes

chapter |1 pages

Ex(wF — 27 b T= b

k ) 1 p-27k(wh—win• Using the definition of the band-pass signal, Eq. 1.6, which specifies only a narrow frequency range of wh — wi over which the spectrum is non-zero, we in the above, X (w — 27rk(wh — we)), is non-zero only in the range of < Iw — 27rk(wh — wt)I < wh •

chapter 1|13 pages

2 Discrete-Time Systems and z-Transforms

chapter |5 pages

E [n] p

0 < n < N — 1, (1.24)

chapter 2|24 pages

Analysis of Discrete-Time Speech Signals

chapter 111111111|10 pages

111111111

chapter 2|2 pages

7 Summary

chapter 3|7 pages

Probability and Random Processes

chapter |3 pages

r( (

The following summary statistics can be easily verified: (a + /(3)2(a + + 1) • The beta distribution can be generalized to its multivariate counterpart, called the Dirichlet distribution, which has found useful applications in speech processing (Chapter 13). A vector-valued discrete random variable, x = (x1 ,..., sk)Tr, has a Dirichlet

chapter 3|7 pages

2 Conditioning, Total Probability Theorem, and Bayes' Rule

The notion of conditioning has a fundamental importance in probability theory, statis- tics, and their engineering applications including speech processing. We first discuss several key concepts related to conditioning. 3.2.1 Conditional probability, conditional PDF, and conditional inde-

chapter 3|5 pages

3 Conditional Expectations

chapter 3|7 pages

5 Markov Chain and Hidden Markov Sequence

In this section, we provide basics for two very important random sequences related to the general Markov sequence discussed above. 3.5.1 Markov chain as discrete-state Markov sequence A Markov chain or discrete-state Markov sequence is a special case of a general Markov sequence. The state space of a Markov chain is of a discrete nature and is finite: which is called the transition matrix of the Markov chain. Given the transition probabilities of a Markov chain, the state-occupation probability can be easily computed. The computation is recursive according to

chapter |3 pages

= Ep(st=j,oils„

chapter 4|4 pages

Linear Model and Dynamic System Model

chapter |14 pages

Eo k,n o[nk] + v[n]. Eon(4, 00[n

This can be put in the canonical form of a linear model with the parameter vector This is a time-invariant linear model because the parameter vector 0 is not a function of time, n. To turn this linear model into a time-varying one, we impose time dependence

chapter 4|1 pages

4 Time-Varying Linear Dynamic System Model

4.4.1 From time-invariant model to time-varying model The linear state-space model defined earlier by Eqs. 4.14 and 4.15 is time invariant. This is so because the parameters 0 = {A, u, Q, C, R} that characterize this model do not change as a function of time k. When these model parameters are constant, it can be easily shown that the first-, second-, and higher-order statistics are all invariant with

chapter 4|5 pages

5 Non-Linear Dynamic System Model

4.5.1 From linear model to nonlinear model Many physical systems are characterized by nonlinear relationships between various phys- ical variables. Taking an example of the speech process, we can use a dynamic system model to describe the speech production process. The state equation can be used to describe the dynamic articulation process, while the observation equation be used to

chapter 5|1 pages

1 Classical Optimization Techniques

chapter |20 pages

f (x*), , f(x * ) EE...E

where for each r, 1 < r < where the number of summations equals r.

chapter |6 pages

a? +2 E nNf oi

a2 — al The several estimators discussed in the above preliminary section are, although fun- damental to estimation theory, often not widely used in engineering and signal/speech processing applications. This is because they are either difficult to find and to compute (while having highly desirable qualities such as an MVU estimator), or because they are too empirical in nature (such as the method of moments). The requirement for knowing

chapter 5|14 pages

6 Maximum Likelihood Estimation

estimation of deterministic parameters in statistical signal processing, and in speech processing in particular. In addition to its asymptotic optimal properties, the MLE technique is especially powerful in handling the complex situations when the data are partially, rather than fully, observable. In this section, we will discuss the simplest case

chapter |3 pages

A = -, = E(010) = + c oan (HeO ' + c,„)--' ( 0 - Hp(?

which is the classic estimator. 5.7.2 Bayesian linear model Now consider the Bayesian linear model (for which the MMSE computation is also rel- atively easy): o = HO + w, where 0 is the unknown (random) parameter to be estimated with prior PDF AT(µ0, Co) c90 coo j L / x k / x / then the conditional PDF, p(OIo), is also Gaussian and: E(01o) = Applying this property to o and 0 that are jointly Gaussian in the Bayesian linear model,

chapter |1 pages

= (5.36) 6)(0 - 6)1

where now Coo = E [(0 - E(0))(o - E(o))Trj is the p x N cross-covariance matrix. The error covariance matrix is: Mb- = where Coo = E [(0 - E(0))(0 - E(0)) is the p x p covariance matrix.

chapter |1 pages

e are assumed zero mean. Thus the

chapter |11 pages

e[n] = b[n — 1] + K[n](o[n] — hTr[n]i3[n — 1]) [7*(0 — e[n])11 8 State Estimation

K[n] M[n — 1]h[n] M[n] = (I — K[n]h rrr[n])M[n — 1], where M[n] is the error covariance matrix: M[n] = E[(0 — Most of the materials presented so far in this chapter have concerned the problem of estimating parameters in statistical models. Some common statistical models used in signal processing, speech processing in particular, have been covered in Chapters 3 and 4. These models can be classified into two groups. The first group has the models with no "hidden" random variables or states. That is, all random variables defined in the

chapter 6|1 pages

Statistical Pattern Recognition

chapter 6|2 pages

1 Bayes' Decision Theory

Bayes' decision theory is the foundation for optimal pattern classifier design, and pro- vides the "fundamental equation" for modern speech recognition. The theory quantifies the concept of "accuracy" in pattern classification and recognition in statistical terms. That is, it defines the measure of accuracy in terms of the minimum expected risk, which can be achieved via the use of the Bayes decision rule.

chapter 6|15 pages

2 Minimum Classification Error Criterion for Recognizer Design

The practical issues discussed above against the theoretical optimality of the MAP- based classifier design motivate the search for alternative designs of classifiers. In this section, we discuss one such alternative, namely the use of the minimum classification error (MCE) criterion. This particular approach was originally proposed in [Juang 92],

chapter |4 pages

E6E

8(Pit - i) (6.52) (6.53)

part 2|2 pages

Part II FUNDAMENTALS OF SPEECH SCIENCE

chapter 7|16 pages

Phonetic Process

7.1 Introduction

chapter |44 pages

AiNZ.--- B -----"N.

/ F2

chapter 8|11 pages

Phonological Process

8.1 Introduction

chapter 8|6 pages

5 Feature Geometry — Internal Organization of Speech Sounds

8.5.1 Introduction The discussion in the preceding section illustrates that the basic units of phonological representation are not phonemes but features. Features play the same linguistically meaningful distinctive role as phonemes, but the use of features offers straightforward

chapter I|1 pages

n p ou z [+ con

—son J [—cont] [+nas] [—cont] [+vc] —vc] [cor] [lab]

chapter |8 pages

k u 1

[— cons +sons [+cont] [+cont] [+lat]

chapter |1 pages

onset

chapter |3 pages

• 4'

part 3|2 pages

Part III COMPUTATIONAL PHONOLOGY AND PHONETICS

chapter 9|19 pages

Computational Phonology

chapter 9|17 pages

4 Use of High-Level Linguistic Constraints

chapter 10|9 pages

Computational Models for Speech Production

chapter |12 pages

Qc(1.14.0) = E p(sior, a3t3t+, t=1 E Nt Elog aso m •

4.o) log P(oT, SIB). (10.4) To simplify the writing, denote by Nt(i) the quantity log(2r) log lEd— 2— (of — gt(A,))TE:, 1(ot — gt (10.5)

chapter |9 pages

[EEEN(.k4_1) , EEEN(..10)1.9

chapter 10|5 pages

4 Hidden Dynamic Model Implemented Using Piece-wise Linear Approximation

In this section, we will describe a particular computational implementation of the hidden dynamic model in Eqs. 10.40 and 10.41, where a deterministic target vector is assumed and the general nonlinear function h[x(k)] in Eq. 10.41 is approximated by a set of

chapter |1 pages

p({0,m} N1 0) FEL P(°n 1 171n , e)P(ma le)

p({o} NIe) E7,-113 (°n I/7 C)P(11e) (10.52) where we define the (token-dependent) mixture weighting factors to be logp(fo,x ,mrio) E flogp({o,z,M}N10) p(fmrifor,e), {M}N where O denotes the model parameters associated with the immediately previous it- eration of the EM algorithm. Since is continuously distributed and

chapter |3 pages

Q(Ole) = E E foogp(zgi.) + Epog p(xz fin ;:_ifin) +

log p(ok'z14,m)1} logp(m10) • 4„ (10.56) after substituting Eq. 10.52 into Eq. 10.56, changing the order of the summations in

chapter |5 pages

a Qx = \--, KnRma ,

E En [42,,nek2fin aR;7_ 2 .., m 2 L-• n=1 k=1 Setting the derivatives equal to zero, we obtain the estimates for Qm and R.,n:

chapter L|4 pages

P(stist_i)=HP(.014"i). p(sio=71.41)1=i)=.T.

The transition structure of this constrained (uncoupling) factorial Markov chain can be parameterized by L distinct KM x K(i) matrices. This is significantly simpler than the original KL L matrix as in the unconstrained case. This model is called an overlapping feature model because the independent dynam- ics of the features at different tiers cause many ways in which different feature values

chapter |2 pages

E wi, • 6 Summary

where y = ELI W,i,xr. Besides the MLP, the radial basis function (RBF) is an attractive alternative choice as another form of universal function approximator for implementing the articulatory- to-acoustic mapping.

chapter |31 pages

_ 1/2+1 - 2Yi +Y_1 d2 ,Tx7

x)2 • For the boundary point at i = 1, the forward-difference approximation to the derivative is used, which gives u2 -u1 (k(x)d no simplification was made and all the terms involved are taken into account. Further, because there is no coupling at the edges, the longitudinal stiffness coupling coefficient

chapter |2 pages

(b) v/ Tx

chapter 00000|9 pages

0 0 0 0 0

part 4|2 pages

Part IV SPEECH TECHNOLOGY IN SELECTED AREAS

chapter 12|10 pages

Speech Recognition

chapter 12|14 pages

4 Use of HMMs in Acoustic Modeling

chapter |12 pages

EN-1 A = N-1E[z(k)jo, 0], B = E EN-1 N-1C = E[z(k)z(k)Tr lo, 0], D = E

E[z(k)jo, 0], B = E[z(k)z(k)Tr lo, 0], k=0 Eq. 12.8 is another third-order nonlinear algebraic equation (in 4 , and t) of the form: N4.T1.4.t — — N4,11"t — N4,t + 4,T 3 + 4, A + Nt — B = 0. (12.12)

chapter vi|3 pages

12(:_,V1

-11 wi,J) 0 < i < 2Q, i < j < 2Q, 1 < v < V. (12.24) Re-estimation of the remaining model parameters via maximization of Q2(41 ,14,0) is described in detail below. Referring to the constraint expressed in Eq. 12.20, we note that the constraint has been imposed only on µ,,, m , and be. The parameters and a,' are free of the constraint and hence can be re-estimated using the conventional formulae [Baum 72]:

chapter |1 pages

Elog f (bi, o 2)} , (12.44)

2)} , (12.44) where the first term, the conditional expectation (E step in the EM algorithm) involving the log-likelihood function for the observation data, was derived previously in [Deng 92b] and is rewritten as

chapter |12 pages

0 art ' a

which, after use of Eq. 12.46 again, becomes E-yt(i) log(ri) — 1. (ot bi)2} + log(ri) — 2[b — ttirMT I[bi — ,J + (pi — 1) log(ri) — gird = 0, 2qi + t(i)(ot _xpb,)2+[bi cbi] MT 1 Ebi — (12.51)

chapter |7 pages

E,-LIEri )

ET—(;) Ek( -1 e(k) E1' 1 T(r) r 1Erl Ecc 1 4r Err Er,)Ek er(k) • — md)2

chapter 12|5 pages

11 Statistical Language Modeling

chapter 12|2 pages

12 Summary

chapter 13|48 pages

Speech Enhancement

chapter 14|21 pages

Speech Synthesis

chapter 14|1 pages

10 Summary