ABSTRACT

DNA microarray technology has provided an efficient way of measuring the expression levels of thousands of genes in a single experiment on a single “chip.” It enables the monitoring of expression levels of thousands of genes simultaneously. The potential of such technologies for functional genomics is tremendous. Measuring gene expression levels in different conditions may prove useful in medical diagnosis, treatment and drug design. Microarray technology has been heralded as the new biological revolution after the advent of the human genome project, since it become possible to extract the important information from gene expression time-series data. Through the extensive microarray image stages conducted in previous chap-

ters, we would have obtained a global view on the expression levels of all genes when the cell undergoes specific conditions or processes. Obviously, the next step is to infer useful biological information and determine the relationships between individual genes. In this regard, many current research efforts have focused on clustering. Cluster analysis of the gene expression data appeared first in [70] and has quickly attracted considerable research attention. A number of clustering algorithms have been examined on gene expression data, such as hierarchical clustering [70], self-organizing map [202], k-means [203] and Gaussian model-based clustering [171, 228], to name just a few [110]. However, a fundamental shortcoming of such clustering schemes is that they are based on the assumption that there exists the correlation similarity between genes. Recently, there has been an increasing research interest to reconstruct models for gene regulatory networks from time series data [56,190], such as Boolean network model [5,106,131,194], linear differential equation model [43,55,58,105], Bayesian model [96,119,132,146], state space model [22,173,222] and stochastic model [48, 207]. Obviously, selecting a good model to fit gene regulatory networks is essen-

tial to a meaningful analysis of the expression data. It turns out that the model for gene regulatory networks should posses the following three properties. First, the model should be easy to evolve the biological information,

An

such as the linear dynamical model. Second, the model should reflect the “stochastic” characteristics, since it is well known that the gene expression is an inherently stochastic phenomenon [121, 136, 146, 204]. Third, the observations (measurement outputs) of the model should be regarded as noisy due to our inability to perfectly and accurately (noise-free) measure gene expression levels. Fourth, in biology and medicine, the available time series (e.g., gene expression time series) typically consists of a large number of variables but with a small number of observations. Therefore, the modeling method should be capable of tackling short time series with acceptable accuracy. There have been attempts to reconstruct models for gene regulatory net-

works by taking into account the aforementioned three properties. Dynamic Bayesian networks have been proposed to model gene expression time series data [119, 132, 146]. The merits of dynamic Bayesian networks include the ability to model stochasticity and handle noisy/hidden variables. However, dynamic Bayesian networks need more complex algorithms such as the genetic algorithm [119, 201] to infer gene regulatory networks. Another model is the state space model [22, 173, 222], whose main feature is that the gene expression value depends not only on the current internal state variables but also on the external inputs. It is very interesting that the external input is viewed as the previous time step observation, and the gene regulation matrix is obtained from the relationship between the current measurement, the previous measurement, and internal state variables [22,173]. For the use of state space models, the measurements need to be accurate and a suitable dimension for the internal state variables needs to be determined beforehand, which raises considerable difficulties in experimentation and computation. In this chapter, we view the gene regulatory network as a dynamic stochas-

tic model, which is composed of the gene measurement equation and the gene regulation equation. In order to reflect the reality, we consider the gene measurement from microarray as noisy, and assume that the gene regulation equation is an one-order autoregressive (AR) stochastic dynamic process. Note that it is very important to regard the models as stochastic, since the gene expression is of inherent stochasticity. Stochastic models can help conduct more realistic simulations of biological systems, and also set up a practical criterion for measuring the robustness of the mathematical models against stochastic noises. After specifying the model structure, we apply, for the first time, the EM algorithm for identifying both the model parameters and the actual value of gene expression levels. Note that EM algorithm is a learning algorithm that can handle sparse parameter identification and noisy data very well. It is also shown that the EM algorithm can cope with the microarray gene expression data with large number of variables but a small number of observations. Four real-world gene expression data sets are employed to demonstrate the effectiveness of our algorithm, and some indices are used to evaluate the models of inferred gene regulatory networks from the viewpoint of bioinformatics. The remainder of this chapter is organized as follows. In Section 10.2, a

stochastic dynamic model is described for genetic regulatory network, which

takes into account the noisy measurement as well as the inherently stochastic phenomenon of the genetic regulatory process. The EM algorithm is introduced in Section 10.3 for handling the sparse parameter identification problem and the noisy data analysis. In Section 10.4, our developed algorithm is applied to four real-world gene expression data sets, and the biological significance is discussed in terms of certain criteria. Further discussion is made in Section 10.5 to explain the advantages and shortcomings of our method. Some concluding remarks and future research topics are provided in Section 10.6.