ABSTRACT

Although Shannon entropy shares striking similarities in many mathematical properties with thermodynamic entropy in physics, engineering, and chemistry, there are also some nuanced differences. Thermodynamic entropy is viewed as the amount of disorder in a macroscopic system and the probabilities for entropy calculations are based on the equilibrium distribution between the possible states. Notably, thermodynamic entropy is an extensive measure whose value is proportional to the size of the system. In epidemiological and pharmacogenomics studies, the relationship of the genetic and environmental predictors, X, to the response variable or phenotype, Y, is of primary interest. Shannon entropy can be extended to two random variables, X and Y, as the joint entropy, H(X,Y), which is defined as H X Y p x y p x y x y

( , ) ( , )log ( , )

= -Â (2.2) The summation is now taken over all of the states (x,y) in the joint distribution of (X,Y). 2.2.1.2 Mutual information Mutual information, I(X,Y), is an information theoretic metric that measures the amount of information one variable carries about another. Although the concept of mutual information is embedded in Shannon’s work [7], McGill [8] described its properties and interpretation in detail. Mutual information is defined as

I(X,Y) = H(X,Y) – H(X) – H(Y) (2.3) If two variables are independent, then knowing the value of one variable gives us no information about the second: the mutual infor-mation I(X,Y) is zero when X and Y are independent. Mutual infor-mation is maximal when X and Y are completely dependent, that is, one variable completely describes the other. The mutual information is proportional to the log-likelihood relative to independence. The larger the mutual information, the more informative the variable is about the phenotype. In statistical terminology, mutual information can be viewed as assessing the main effects, which according to our information theoretic approach is a first-order interaction. The Kullback-Leibler divergence (KLD) between two probability mass functions p(x) and q(x) is denoted by KLD(p||q) and is also known as the relative entropy. The definition of the KLD is [9]

KLD( || ) log ( ) ( )

p q p x p x q x

= ( ) Œ  (2.4)

The KLD measures the inefficiency of assuming that the distribution is q when the true distribution is p. The KLD always takes nonnegative values and is zero only if p = q [10]. If the distribution p can be viewed as representing a statistical hypothesis, the KLD is the expected log-likelihood ratio. We now discuss how entropy-based metrics can be used to search, identify, and measure GGIs and GEIs. 2.2.1.3 The k-way interaction informationThe analysis of GGI and GEI requires multivariate extensions of entropy-based metrics. The k-way interaction information (KWII) is a multivariate information theoretic measure that quantitates the information that can only be obtained about the phenotype of interest from specific subsets of variables in a given data set [8, 11]. For the three-variable case (A,B,Y), where A and B are genetic or environmental predictor variables and Y is the phenotype of interest, the KWII is written in terms of the individual entropies H(A), H(B), and H(Y) and of the joint entropies H(A,B), H(A,Y), H(B,Y), and H(A,B,Y) as KWII(A,B,Y) = – H(A) – H(B) – H(Y) + H(A,B) + H(A,Y) + H(B,Y) – H(A,B,Y) (2.5) Figure 2.1 is an information Venn diagram that highlights the relationship of the KWII in the three-variable case to the entropies of the lower-order subsets. The central core of the Venn diagram represents the KWII. For the general case on the set υ = {X1,X2,…,Xk,Y}, containing k predictors and the phenotype, Y, the KWII is written as an alternating sum over all possible subsets T of n. Using the difference-operator notation of Han [12] KWII( ) ( ) ( )| | | |v H Tv T

= - - -

Õ Â 1 (2.6) The symbols, |υ| and |T|, represent the size of the set υ and its subset T, respectively. The number of genetic and environmental variables, k (not including the phenotype), in a combination is called the order of the interaction.