ABSTRACT

Pattern Classication and Diagnostic

Decision

The nal purpose of biomedical image analysis is to classify a given image or

the features that have been detected in the image into one of a few known

categories In medical applications a further goal is to arrive at a diagnos

tic decision regarding the condition of the patient A physician or medical

specialist may achieve this goal via visual analysis of the image and data

presented comparative analysis of the given image with others of known di

agnoses or the application of established protocols and sets of rules assist in

such a decisionmaking process Images taken earlier of the same patient may

also be used when available for comparative or dierential analysis Some

measurements may also be made from the given image to assist in the anal

ysis The basic knowledge clinical experience expertise and intuition of the

physician play signicant roles in this process

When image analysis is performed via the application of computer algo

rithms the typical result is the extraction of a number of numerical features

When the numerical features relate directly to measurements of organs or

features represented by the image such as an estimate of the size of the

heart or the volume of a tumor the clinical specialist may be able to use

the features directly in his or her diagnostic logic However when parame

ters such as measures of texture and shape complexity are derived a human

analyst is not likely to be able to analyze or comprehend the features Fur

thermore as the number of the computed features increases the associated

diagnostic logic may become too complicated and unwieldy for human analy

sis Computer methods would then be desirable to perform the classication

and decision process

At the outset it should be borne in mind that a biomedical image forms but

one piece of information in arriving at a diagnosis the classication of a given

image into one of many categories may assist in the diagnostic procedure

but will almost never be the only factor Regardless pattern classication

based upon image analysis is indeed an important aspect of biomedical image

analysis and forms the theme of the present chapter Remaining within the

realm of CAD as introduced in Figure and Section it would be

preferable to design methods so as to aid a medical specialist in arriving at a

diagnosis rather than to provide a decision

A generic problem statement for pattern classication may be expressed as

follows A number of measures and features have been derived from a biomed

ical image Develop methods to classify the image into one of a few speci

ed categories Investigate the relevance of the features and the classication

methods in arriving at a diagnostic decision about the patient

Observe that the features mentioned above may have been derived manually

or by computer methods Recognize the distinction between classifying the

given image and arriving at a diagnosis regarding the patient the connection

between the two tasks or steps may not always be direct In other words a

pattern classication method may facilitate the labeling of a given image as

being a member of a particular class arriving at a diagnosis of the condition of

the patient will most likely require the analysis of several other items of clinical

information Although it is common to work with a prespecied number of

pattern classes many problems do exist where the number of classes is not

known a priori A special case is screening where the aim is to simply decide

on the presence or absence of a certain type of abnormality or disease The

initial decision in screening may be further focused on whether the subject

appears to be free of the specic abnormality of concern or requires further

investigation

The problem statement and description above are rather generic Several

considerations arise in the practical application of the concepts mentioned

above to medical images and diagnosis Using the detection of breast can

cer as an example the following questions illustrate some of the problems

encountered in practice

Is a mass or tumor present YesNo

If a mass or tumor is present

Give or mark its location

Compare the density of the mass to that of the surrounding tissues

hypodense isodense hyperdense

Describe the shape of its boundary round ovoid irregular ma

crolobulated microlobulated spiculated

Describe its texture homogeneous heterogeneous fatty

Describe its edge sharp wellcircumscribed illdened fuzzy

Decide if it is a benign mass a cyst solid or uidlled or a

malignant tumor

Are calcications present YesNo

If calcications are present

Estimate their number per cm

Describe their shape round ovoid elongated branching rough

punctate irregular amorphous

Describe their spatial distribution or cluster

Describe their density homogeneous heterogeneous

Are there signs of architectural distortion YesNo

Are there signs of bilateral asymmetry YesNo

Are there major changes compared to the previous mammogram of the

patient

Is the case normal YesNo

If the case is abnormal

Is the disease benign or malignant cancer

The items listed above give a selection of the many features of mammo

grams that a radiologist would investigate see Ackerman et al and the

BIRADS

TM

manual for more details Figure shows a graphical user

interface developed by Alto et al for the categorization of breast

masses related to some of the questions listed above Figure illustrates

four segments of mammograms demonstrating masses and tumors of dier

ent characteristics progressing from a wellcircumscribed and homogeneous

benign mass to a highly spiculated and heterogeneous tumor

The subject matter of this book image analysis and pattern classica

tion can provide assistance in responding to only some of the questions

listed above Even an entire set of mammograms may not lead to a nal

decision other modes of diagnostic imaging and means of investigation may

be necessary to arrive at a denite diagnosis

In the following sections a number of methods for pattern classication

decision making and evaluation of the results of classication are reviewed

and illustrated

Note Parts of this chapter are reproduced with permission from RM

Rangayyan Biomedical Signal Analysis A CaseStudy Approach IEEE Press

and Wiley New York NY

c

IEEE

Pattern Classication

Pattern recognition or classication may be dened as the categorization of

the input data into identiable classes via the extraction of signicant features

or attributes of the data from a background of irrelevant detail

In biomedical image analysis after quantitative

features have been extracted from the given images each image or ROI may

be represented by a feature vector x x

x

x

n

T

which is also known

FIGURE

Graphical user interface for the categorization of breast masses Reproduced

with permission from H Alto RM Rangayyan RB Paranjape JEL

Desautels and H Bryant An indexed atlas of digital mammograms for

computeraided diagnosis of breast cancer Annales des Telecommunications

c

GET Lavoisier Figure courtesy of C LeGuil

lou

Ecole Nationale Superieure des Telecommunications de Bretagne Brest

France

a blc b bro c mrc d mlo

f

cc

SI

cf

A

F

FIGURE

Examples of breast mass regions and contours with the corresponding values

of fractional concavity f

cc

spiculation index SI compactness cf acutance

A and sum entropy F

a Circumscribed benign mass b Macrolobulated

benign mass c Microlobulated malignant tumor d Spiculated malignant

tumor Note that the masses and their contours are of widely diering size

but have been scaled to the same size in the illustration The rst letter

of the case identier indicates a malignant diagnosis with m and a benign

diagnosis with b based upon biopsy The symbols after the rst numerical

portion of the identier represent l left r right c craniocaudal view o

mediolateral oblique view x axillary view The last two digits represent

the year of acquisition of the mammogram An additional character of the

identier after the year a f if present indicates the existence of multiple

masses visible in the same mammogram Reproduced with permission from

H Alto RM Rangayyan and JEL Desautels Contentbased retrieval and

analysis of mammographic masses Journal of Electronic Imaging in press

c

SPIE and IS T

as the measurement vector or a pattern vector When the values x

i

are real

numbers x is a point in an ndimensional Euclidean space vectors of similar

objects may be expected to form clusters as illustrated in Figure

FIGURE

Twodimensional feature vectors of two classes C

and C

The prototypes

of the two classes are indicated by the vectors z

and z

The linear decision

function dx shown solid line is the perpendicular bisector of the straight

line joining the two prototypes dashed line Reproduced with permission

from RM Rangayyan Biomedical Signal Analysis A CaseStudy Approach

IEEE Press and Wiley New York NY

c

IEEE

For e!cient pattern classication measurements that could lead to dis

joint sets or clusters of feature vectors are desired This point underlines the

importance of the appropriate design of the preprocessing and feature extrac

tion procedures Features or characterizing attributes that are common to all

patterns belonging to a particular class are known as intraset or intraclass

features Discriminant features that represent the dierences between pattern

classes are called interset or interclass features

The pattern classication problem is that of generating optimal decision

boundaries or decision procedures to separate the data into pattern classes

based on the feature vectors provided Figure illustrates a simple linear

decision function or boundary to separate D feature vectors into two classes

Supervised Pattern Classication

The problem considered in supervised pattern classication may be stated

as follows You are provided with a number of feature vectors with classes

assigned to them Propose techniques to characterize and parameterize the

boundaries that separate the classes

A given set of feature vectors of known categorization is often referred to

as a training set The availability of a training set facilitates the development

of mathematical functions that can characterize the separation between the

classes The functions may then be applied to new feature vectors of unknown

classes to classify or recognize them This approach is known as supervised

pattern classication A set of feature vectors of known categorization that

is used to evaluate a classier designed in this manner is referred to as a test

set After adequate testing and conrmation of the method with satisfactory

results the classier may be applied to new feature vectors of unknown classes

the results may then be used to arrive at diagnostic decisions The following

subsections describe a few methods that can assist in the development of

discriminant and decision functions

Discriminant and decision functions

A general linear discriminant or decision function is of the form

dx w

x

" w

x

" " w

n

x

n

" w

n

w

T

x

where x x

x

x

n

T

is the feature vector augmented by an additional

entry equal to unity and w w

w

w

n

w

n

T

is a correspondingly

augmented weight vector A twoclass pattern classication problem may be

stated as

dx w

T

x

if x C

if x C

where C

and C

represent the two classes The discriminant function may be

interpreted as the boundary separating the classes C

and C

as illustrated

in Figure

In the general case of an M class pattern classication problem we will

need M weight vectors and M decision functions to perform the following

decisions

d

i

x w

T

i

x

if x C

i

otherwise

i M

where w

i

w

i

w

i

w

in

w

in

T

is the weight vector for the class C

i

Three cases arise in solving this problem

Case Each class is separable from the rest by a single decision surface

if d

i

x then x C

i

Case Each class is separable from every other individual class by a distinct

decision surface that is the classes are pairwise separable There are

MM decision surfaces given by d

ij

x w

T

ij

x

if d

ij

x j i then x C

i

Note d

ij

x d

ji

x

Case There exist M decision functions d

k

x w

T

k

x k M

with the property that

if d

i

x d

j

x j i then x C

i

This is a special instance of Case We may dene

d

ij

x d

i

x d

j

x w

i

w

j

T

x w

T

ij

x

If the classes are separable under Case they are separable under Case

the converse is in general not true

Patterns that may be separated by linear decision functions as above are

said to be linearly separable In other situations an innite variety of complex

decision boundaries may be formulated by using generalized decision functions

based upon nonlinear functions of the feature vectors as

dx w

f

x " w

f

x " " w

K

f

K

x " w

K

K

X

i

w

i

f

i

x

Here ff

i

xg i K are real singlevalued functions of x f

K

x

Whereas the functions f

i

x may be nonlinear in the ndimensional space

of x the decision function may be formulated as a linear function by den

ing a transformed feature vector x

y

f

x f

x f

K

x

T

Then

dx w

T

x

y

with w w

w

w

K

w

K

T

Once evaluated ff

i

xg is

just a set of numerical values and x

y

is simply a Kdimensional vector aug

mented by an entry equal to unity Several methods exist for the derivation

of optimal linear discriminant functions

Example of application The ROIs of breast masses are shown in

Figure arranged in the order of decreasing acutance A see Sections

and Figure shows the contours of the masses arranged in

the increasing order of fractional concavity f

cc

see Section Most of the

contours of the benign masses are seen to be smooth whereas most of the con

tours of the malignant tumors are rough and spiculated Furthermore most of

the benign masses have welldened sharp edges and are wellcircumscribed

whereas the majority of the malignant tumors possess illdened and fuzzy

borders It is seen that the shape factor f

cc

facilitates the ordering of the

contours in terms of shape complexity However the contours of a few be

nign masses and a few malignant tumors do not follow the expected trend

In addition the acutance measure has lower values for most of the malignant

tumors than for a majority of the benign masses

The three shape factors cf f

cc

and SI see Chapter the texture

measures as dened by Haralick see Section and four measures

of edge sharpness as dened by Mudigonda et al see Section were

computed for the ROIs and their contours Note The factor SI was divided

by two in this example to reduce it to the range Figure gives a plot

of the D featurevector space f

cc

A F

for the masses The feature F

shows poor separation between the benign and malignant samples whereas

the feature A demonstrates some degree of separation A scatter plot of the

three shape factors f

cc

cf SI of the masses is given in Figure Each

of the three shape factors demonstrates high discriminant capability

Figure shows a D plot of the shapefactor vectors f

cc

SI for a train

ing set formed by selecting the vectors for benign masses and malignant

tumors The prototypes for the benign and malignant classes obtained by av

eraging the vectors over all the members of the two classes in the training set

are marked as B and M respectively on the plot The solid straight line is

the perpendicular bisector of the line joining the two prototypes dashed line

and represents a linear discriminant function The equation of the straight

line is SI " f

cc

The decision function is represented by

the following rule

if SI " f

cc

then

benign mass

else

malignant tumor

end

It is seen that the rule given above will correctly classify all of the training

samples

Figure shows the result of application of the linear discriminant func

tion designed and shown in Figure to a test set of benign masses and

malignant tumors The test set does not include any of the cases from the

training set It is seen that the classier will lead to three false negatives in

the test set

Distance functions

Consider M pattern classes represented by their prototype patterns z

z

z

M

The prototype of a class is typically computed as the average of all

the feature vectors belonging to the class Figure illustrates schematically

the prototypes z

and z

of the two classes shown

FIGURE

ROIs of breast masses including benign masses and malignant tumors The

ROIs are arranged in the order of decreasing acutance A Note that the masses are

of widely diering size but have been scaled to the same size in the illustration For

details regarding the case identiers see Figure Reproduced with permission

from H Alto RM Rangayyan and JEL Desautels Contentbased retrieval and

analysis of mammographic masses Journal of Electronic Imaging in press

c

SPIE and IS T

FIGURE

Contours of breast masses including benign masses and malignant tumors

The contours are arranged in the order of increasing f

cc

Note that the masses and

their contours are of widely diering size but have been scaled to the same size

in the illustration For details regarding the case identiers see Figure See

also Figure Reproduced with permission from H Alto RM Rangayyan and

JEL Desautels Contentbased retrieval and analysis of mammographic masses

Journal of Electronic Imaging in press

c

SPIE and IS T

FIGURE

Plot of the D featurevector space f

cc

A F

for the set of masses in

Figure o benign masses malignant tumors Repro

duced with permission from H Alto RM Rangayyan and JEL Desautels

Contentbased retrieval and analysis of mammographic masses Journal of

Electronic Imaging in press

c

SPIE and IS T

FIGURE

Plot of the D featurevector space f

cc

cf SI for the set of contours in

Figure o benign masses malignant tumors Figure

courtesy of H Alto

The Euclidean distance between an arbitrary pattern vector x and the i

th

prototype is given as

D

i

kx z

i

k

q

x z

i

T

x z

i

A simple rule to classify the pattern vector x would be to choose that class

for which the vector has the smallest distance

if D

i

D

j

j i then x C

i

See Section for the description of an application of the Euclidean dis

tance to the analysis of breast masses and tumors

A simple relationship may be established between discriminant functions

and distance functions as follows

D

i

kx z

i

k

x z

i

T

x z

i

x

T

x x

T

z

i

" z

T

i

z

i

x

T

x

x

T

z

i

z

T

i

z

i

Choosing the minimum of D

i

is equivalent to choosing the minimum of

D

i

because all D

i

Furthermore from the equation above it follows

FIGURE

Plot of the D featurevector space f

cc

SI for the training set of be

nign masses o and malignant tumors x selected from the dataset in

Figure The prototypes of the two classes are indicated by the vectors

marked B and M The solid line shown is a linear decision function obtained

as the perpendicular bisector of the straight line joining the two prototypes

dashed line

FIGURE

Plot of the D featurevector space f

cc

SI for the test set of benign masses

o and malignant tumors x selected from the dataset in Figure

The solid line shown is a linear decision function designed as illustrated in

Figure Three malignant cases are misclassied by the decision function

shown

that choosing the minimum of D

i

is equivalent to choosing the maximum of

x

T

z

i

z

T

i

z

i

Therefore we may dene the decision function

d

i

x x

T

z

i

z

T

i

z

i

i M

A decision rule may then be stated as

if d

i

x d

j

x j i then x C

i

This is a linear discriminant function which becomes obvious from the fol

lowing representation If z

ij

j n are the components of z

i

let

w

ij

z

ij

j n w

in

z

T

i

z

i

and x x

x

x

n

T

Then d

i

x w

T

i

x i M where w

i

w

i

w

i

w

in

T

Therefore distance functions may be formulated as linear discriminant or

decision functions

The nearestneighbor rule

Suppose that we are provided with a set of N sample patterns fs

s

s

N

g of known classication each pattern belongs to one ofM classes fC

C

C

M

g with N M We are then given a new feature vector x whose

class needs to be determined Let us compute a distance measure Ds

i

x

between the vector x and each sample pattern Then the nearestneighbor

rule states that the vector x is to be assigned to the class of the sample that

is the closest to x

x C

i

if Ds

i

x minfDs

l

xg l N

A major disadvantage of the above method is that the classication decision

is made based upon a single sample vector of known classication The nearest

neighbor may happen to be an outlier that is not representative of its class

It would be more reliable to base the classication upon several samples we

may consider a certain number k of the nearest neighbors of the sample to

be classied and then seek a majority opinion This leads to the socalled

knearestneighbor or kNN rule Determine the k nearest neighbors of x and

use the majority of equal classications in this group as the classication of

x See Section for the description of an application of the kNN method

to the analysis of breast masses and tumors

Unsupervised Pattern Classication

Let us consider the situation where we are given a set of feature vectors with

no categorization or classes attached to them No prior training information

is available How may we group the vectors into multiple categories

The design of distance functions and decision boundaries requires a training

set of feature vectors of known classes The functions so designed may then

be applied to a new set of feature vectors or samples to perform pattern

classication Such a procedure is known as supervised pattern classication

due to the initial training step In some situations a training step may not

be possible and we may be required to classify a given set of feature vectors

into either a prespecied or unknown number of categories Such a problem is

labeled as unsupervised pattern classication and may be solved by cluster

seeking methods

Clusterseeking methods

Given a set of feature vectors we may examine them for the formation of

inherent groups or clusters This is a simple task in the case of D vectors

where we may plot them visually identify groups and label each group with

a pattern class Allowance may have to be made to assign the same class

to multiple disjoint groups Such an approach may be used even when the

number of classes is not known at the outset When the vectors have a

dimension higher than three visual analysis will not be feasible It then

becomes necessary to dene criteria to group the given vectors on the basis

of similarity dissimilarity or distance measures A few examples of such

measures are described below

Euclidean distance

D

E

kx zk

x z

T

x z

n

X

i

x

i

z

i

Here x and z are two feature vectors the latter could be a class pro

totype if available A small value of D

E

indicates greater similarity

between the two vectors than a large value of D

E

Manhattan or cityblock distance

D

C

n

X

i

j x

i

z

i

j

The Manhattan distance is the shortest path between x and z with

each segment being parallel to a coordinate axis

Mahalanobis distance

D

M

xm

T

C

xm

where x is a feature vector being compared to a pattern class for which

m is the class mean vector and C is the covariance matrix A small

value of D

M

indicates a higher potential membership of the vector x in

the class than a large value ofD

M

See Section for the description

of an application of the Mahalanobis distance to the analysis of breast

masses and tumors

Normalized dot product cosine of the angle between the vectors x and

z

D

d

x

T

z

kxkkzk

A large dot product value indicates a greater degree of similarity between

the two vectors than a small value

The covariance matrix is dened as

C Ey mym

T

where the expectation operation is performed over all feature vectors y that

belong to the class The covariance matrix provides the covariance of all

possible pairs of the features in the feature vector over all samples belonging

to the given class being considered The elements along the main diagonal

of the covariance matrix provide the variance of the individual features that

make up the feature vector The covariance matrix represents the scatter of

the features that belong to the given class The mean and covariance need

to be updated as more samples are added to a given class in a clustering

procedure

When the Mahalanobis distance needs to be calculated between a sample

vector and a number of classes represented by their mean and covariance

matrices a pooled covariance matrix may be used if the numbers of members

in the various classes are unequal and low If the covariance matrices

of two classes are C

and C

and the numbers of members in the two classes

are N

and N

the pooled covariance matrix is given by

C

N

C

" N

C

N

"N

Various performance indices may be designed to measure the success of a

clustering procedure A measure of the tightness of a cluster is the sum

of the squared errors performance index

J

N

c

X

j

X

xS

j

kxm

j

k

where N

c

is the number of cluster domains S

j

is the set of samples in the j

th

cluster

m

j

N

j

X

xS

j

x

is the sample mean vector of S

j

and N

j

is the number of samples in S

j

A few other examples of performance indices are

Average of the squared distances between the samples in a cluster do

main

Intracluster variance

Average of the squared distances between the samples in dierent cluster

domains

Intercluster distances

Scatter matrices

Covariance matrices

A simple clusterseeking algorithm Suppose we have N sample

patterns fx

x

x

N

g

Let the rst cluster center z

be equal to any one of the samples say

z

x

Choose a nonnegative threshold

Compute the distance D

between x

and z

If D

assign x

to the domain class of cluster center z

otherwise start a new cluster

with its center as z

x

For the subsequent steps let us assume that

a new cluster with center z

has been established

Compute the distances D

and D

from the next sample x

to z

and

z

respectively If D

and D

are both greater than start a new

cluster with its center as z

x

otherwise assign x

to the domain of

the closer cluster

Continue to apply Steps and by computing and checking the distance

from every new unclassied pattern vector to every established cluster

center and applying the assignment or clustercreation rule

Stop when every given pattern vector has been assigned to a cluster

Observe that the procedure does not require a priori knowledge of the

number of classes Recognize also that the procedure does not assign a real

world class to each cluster it merely groups the given vectors into disjoint

clusters A subsequent step is required to label each cluster with a class related

to the actual problem Multiple clusters may relate to the same realworld

class and may have to be merged

A major disadvantage of the simple clusterseeking algorithm is that the

results depend upon

the rst cluster center chosen for each domain or class

the order in which the sample patterns are considered

the value of the threshold and

the geometrical properties or distributions of the data that is the

featurevector space

The maximindistance clustering algorithm This method is

similar to the previous simple algorithm but rst identies the cluster re

gions that are the farthest apart The term maximin refers to the combined

use of maximum and minimum distances between the given vectors and the

centers of the clusters already formed

Let x

be the rst cluster center z

Determine the farthest sample from x

and label it as cluster center z

Compute the distance from each remaining sample to z

and to z

For

every pair of these computations save the minimum distance and select

the maximum of the minimum distances If this maximin distance is

an appreciable fraction of the distance between the cluster centers z

and

z

label the corresponding sample as a new cluster center z

otherwise

stop forming new clusters and go to Step

If a new cluster center was formed in Step repeat Step using a

typical or the average distance between the established cluster centers

for comparison

Assign each remaining sample to the domain of its nearest cluster center

The Kmeans algorithm The preceding simple and maximin

algorithms are intuitive procedures The Kmeans algorithm is based on

iterative minimization of a performance index that is dened as the sum of

the squared distances from all points in a cluster domain to the cluster center

ChooseK initial cluster centers z

z

z

K

K is the number

of clusters to be formed The choice of the cluster centers is arbitrary

and could be the rst K of the feature vectors available The index in

parentheses represents the iteration number

At the k

th

iterative step distribute the samples fxg among theK cluster

domains using the relation

x S

j

k if kxz

j

kk kxz

i

kk i K i j

where S

j

k denotes the set of samples whose cluster center is z

j

k

From the results of Step compute the new cluster centers z

j

k" j

K such that the sum of the squared distances from all points

in S

j

k to the new cluster center is minimized In other words the new

cluster center z

j

k " is computed so that the performance index

J

j

X

xS

j

k

kx z

j

k " k

j K

is minimized The z

j

k " that minimizes this performance index is

simply the sample mean of S

j

k Therefore the new cluster center is

given by

z

j

k "

N

j

k

X

xS

j

k

x j K

where N

j

k is the number of samples in S

j

k The name Kmeans

is derived from the manner in which cluster centers are sequentially

updated

If z

j

k " z

j

k for j K the algorithm has converged

terminate the procedure otherwise go to Step

The behavior of the Kmeans algorithm is inuenced by

the number of cluster centers specied K

the choice of the initial cluster centers

the order in which the sample patterns are considered and

the geometrical properties or distributions of the data that is the

featurevector space

Example Figures to show four cluster plots of the shape

factors f

cc

and SI of the breast mass contours shown in Figure see

Section for details Although the categories of the samples would be

unknown in a practical situation the samples are identied in the plots with

the " symbol for malignant tumors and the symbol for the benign masses

The categorization represents the groundtruth or true classication of the

samples based upon biopsy

The plots in Figures to show the progression of the Kmeans

algorithm from its initial state to the converged state K in this example

representing the benign and malignant categories The only prior knowledge

or assumption used is that the samples are to be split into two clusters that is

there are two classes Figure shows two samples selected to represent the

cluster centers marked with the diamond and asterisk symbols The straight

line indicates the decision boundary which is the perpendicular bisector of

the straight line joining the two cluster centers The Kmeans algorithm

converged in this case at the fth iteration that is there was no change in the

cluster centers after the fth iteration The nal decision boundary results

in the misclassication of four of the malignant samples as being benign It

is interesting to note that even though the two initial cluster centers belong

to the benign category the algorithm has converged to a useful solution

See Section for examples of application of other pattern classication

techniques to the same dataset

Probabilistic Models and Statistical Decision

Pattern classication methods such as discriminant functions are dependent

upon the set of training samples provided Their success when applied to new

cases will depend upon the accuracy of the representation of the various pat

tern classes by the training samples How can we design pattern classication

techniques that are independent of specic training samples and are optimal

in a broad sense

Probability functions and probabilistic models may be developed to rep

resent the occurrence and statistical attributes of classes of patterns Such

functions may be based upon large collections of data historical records or

mathematical models of pattern generation In the absence of information as

above a training step with samples of known categorization will be required

to estimate the required model parameters It is common practice to assume a

Gaussian PDF to represent the distribution of the features for each class and

estimate the required mean and variance parameters from the training sets

When PDFs are available to characterize pattern classes and their features

optimal decision functions may be designed based upon statistical functions

and decision theory The following subsections describe a few methods in this

category

Likelihood functions and statistical decision

Let P C

i

be the probability of occurrence of class C

i

i M this is

known as the a priori prior or unconditional probability The a posteriori

or posterior probability that an observed sample pattern x came from C

i

is

expressed as P C

i

jx If a classier decides that x comes from C

j

when it

actually came from C

i

the classier is said to incur a loss L

ij

with L

ii

or a xed operational cost and L

ij

L

ii

j i

Because x may belong to any one of the M classes under consideration the

expected loss known as the conditional average risk or loss in assigning x to

C

j

is

R

j

x

M

X

i

L

ij

P C

i

jx

FIGURE

Initial state of the Kmeans algorithm The symbols in the cluster plot repre

sent the D feature vectors f

cc

SI for benign and malignant "

breast masses See Figure for the contours of the masses The cluster

centers class means are indicated by the solid diamond and the # symbols

The straight line indicates the decision boundary between the two classes

Figure courtesy of FJ Ayres

FIGURE

Second iteration of theKmeans algorithm Details as in Figure Figure

courtesy of FJ Ayres

FIGURE

Third iteration of the Kmeans algorithm Details as in Figure Figure

courtesy of FJ Ayres

FIGURE

Fourth iteration of the Kmeans algorithm Details as in Figure Figure

courtesy of FJ Ayres

FIGURE

Final state of the Kmeans algorithm after the fth iteration Details as in

Figure Figure courtesy of FJ Ayres

A classier could compute R

j

x j M for each sample x and

then assign x to the class with the smallest conditional loss Such a classier

will minimize the total expected loss over all decisions and is called the Bayes

classier From a statistical point of view the Bayes classier represents the

optimal classier

According to Bayes rule we have

P C

i

jx

P C

i

pxjC

i

px

where pxjC

i

is called the likelihood function of class C

i

or the statecondi

tional PDF of x and px is the PDF of x regardless of class membership

unconditional Note P y is used to represent the probability of occur

rence of an event y py is used to represent the PDF of a random variable y

Probabilities and PDFs involving a multidimensional feature vectors are mul

tivariate functions with dimension equal to that of the feature vector Bayes

rule shows how observing the sample x changes the a priori probability P C

i

to the a posteriori probability P C

i

jx In other words Bayes rule provides a

mechanism to update the a priori probability P C

i

to the a posteriori prob

ability P C

i

jx due to the observation of the sample x Then we can express

the expected loss as

R

j

x

px

M

X

i

L

ij

pxjC

i

P C

i

Because

px

is common for all j we could modify R

j

x to

r

j

x

M

X

i

L

ij

pxjC

i

P C

i

In a twoclass case with M we obtain the following expressions

r

x L

pxjC

P C

" L

pxjC

P C

r

x L

pxjC

P C

" L

pxjC

P C

x C

if r

x r

x

that is

x C

if L

pxjC

P C

" L

pxjC

P C

L

pxjC

P C

" L

pxjC

P C

or equivalently

x C

if L

L

pxjC

P C

L

L

pxjC

P C

This expression may be rewritten as

x C

if

pxjC

pxjC

P C

P C

L

L

L

L

The lefthand side of the inequality above which is a ratio of two likelihood

functions is often referred to as the likelihood ratio

l

x

pxjC

pxjC

Then Bayes decision rule for M is

Assign x to class C

if l

x

where

is a threshold given by

P C

P C

L

L

L

L

Assign x to class C

if l

x

Make an arbitrary or heuristic decision if l

x

The rule may be generalized to the M class case as

x C

i

if

M

X

k

L

ki

pxjC

k

P C

k

M

X

q

L

qj

pxjC

q

P C

q

j M j i

In most pattern classication problems the loss is nil for correct decisions

The loss could be assumed to be equal to a certain quantity for all erroneous

decisions Then L

ij

ij

where

ij

if i j

otherwise

and

r

j

x

M

X

i

ij

pxjC

i

P C

i

px pxjC

j

P C

j

because

M

X

i

pxjC

i

P C

i

px

The Bayes classier will assign a pattern x to class C

i

if

px pxjC

i

P C

i

px pxjC

j

P C

j

j M j i

that is

x C

i

if pxjC

i

P C

i

pxjC

j

P C

j

j M j i

This is nothing more than using the decision functions

d

i

x pxjC

i

P C

i

i M

where a pattern x is assigned to class C

i

if d

i

x d

j

x j i for that

pattern Using Bayes rule we get

d

i

x P C

i

jx px i M

Because px does not depend upon the class index i this can be reduced to

d

i

x P C

i

jx i M

The dierent decision functions given above provide alternative yet equivalent

approaches depending upon whether pxjC

i

or P C

i

jx is used or available

The estimation of pxjC

i

would require a training set for each class C

i

It

is common to assume a Gaussian distribution and estimate its mean and

variance using the training set

Bayes classier for normal patterns

The univariate normal or Gaussian PDF for a single random variable x is

given by

px

p

exp

xm

which is completely specied by two parameters the mean

m Ex

Z

x px dx

and the variance

Exm

Z

xm

px dx

In the case of M pattern classes and pattern vectors x of dimension n

governed by multivariate normal PDFs we have

pxjC

i

n

jC

i

j

exp

xm

i

T

C

i

xm

i

i M where each PDF is completely specied by its mean vector

m

i

and its n n covariance matrix C

i

with

m

i

E

i

x

and

C

i

E

i

xm

i

xm

i

T

Here E

i

denotes the expectation operator over the patterns belonging to

class C

i

Normal distributions occur frequently in nature and have the advantage

of analytical tractability A multivariate normal PDF reduces to a product

of univariate normal PDFs when the elements of x are mutually independent

in which case the covariance matrix is a diagonal matrix

We had earlier formulated the decision functions

d

i

x pxjC

i

P C

i

i M

see Equation Given the exponential in the normal PDF it is convenient

to use

d

i

x ln pxjC

i

P C

i

ln pxjC

i

" lnP C

i

which is equivalent in terms of classication performance because the natural

logarithm ln is a monotonically increasing function Then

d

i

x lnP C

i

n

ln

ln jC

i

j

xm

i

T

C

i

xm

i

i M The second term does not depend upon i therefore we can

simplify d

i

x to

d

i

x lnP C

i

ln jC

i

j

xm

i

T

C

i

xm

i

i M

The decision functions above are hyperquadrics hence the best that a

Bayes classier for normal patterns can do is to place a general secondorder

decision surface between each pair of pattern classes In the case of true nor

mal distributions of patterns the decision functions as above will be optimal

on an average basis they minimize the expected loss with the simplied loss

function L

ij

ij

If all the covariance matrices are equal that is C

i

C i M

we get

d

i

x lnP C

i

" x

T

C

m

i

m

T

i

C

m

i

i M

after omitting terms independent of i The Bayesian classier is now repre

sented by a set of linear decision functions

Before one may apply the decision functions as above it would be appro

priate to verify the Gaussian nature of the PDFs of the variables on hand by

conducting statistical tests Furthermore it would be necessary

to derive or estimate the mean vector and covariance matrix for each class

sample statistics computed from a training set may serve this purpose

Example Figure shows plots of Gaussian PDF models applied to the

shape factor f

cc

fractional concavity of the breast mass contours shown in

Figure see Chapter and Section for details The two Gaussians

represent the stateconditional PDFs of f

cc

for the benign and malignant

categories Also shown are the posterior probabilities that the class of a

sample is benign or malignant given an observed value of f

cc

The posterior

probability functions were derived using Bayes rule as in Equation with

the values of the prior probabilities of the two classes being equal to It

is seen that the posterior probabilities are both equal to at f

cc

the probability of a malignant classication is higher than that of a benign

classication for f

cc

Due to the use of equal prior probabilities the

transition point is the same as the point where the two Gaussian models for

the stateconditional PDFs cross each other

Figure shows the same Gaussian models for the stateconditional

PDFs as in Figure However the posterior probability functions were de

rived using the prior probability value of for the benign category the prior

probability for the malignant category is then In this case the probability

of a malignant classication is higher than that of a benign classication for

f

cc

The prior assumption that $ of all masses encountered will be

benign has pushed the decision threshold on f

cc

to a higher value

Figure illustrates the D cluster plot and Gaussian PDF models for

the shape factors f

cc

and SI for the same dataset as described above see

Chapter and Section for details The decision boundary indicated by

the solid line is the optimal boundary under the assumption of D Gaussian

PDFs for the two features and the two classes Two malignant samples are

misclassied by the decision boundary shown See Section for ex

amples of application of other pattern classication techniques to the same

dataset

An interesting point to note from the examples above is that the Gaussian

PDF models used are not capable of accommodating the prior knowledge that

the shape factors are limited to the range Other PDF models such as

the Rayleigh distribution see Section should be used if this aspect is

important

Logistic Regression

Logistic classication is a statistical technique based on a logistic regression

model that estimates the probability of occurrence of an event

The technique is designed for problems where patterns are to be classi

ed into one of two classes When the response variable is binary theoretical

and empirical considerations indicate that the response function is often curvi

FIGURE

Plots of Gaussian stateconditional PDF models for the shape factor f

cc

for

benign dashed line and malignant dotted line breast masses See Fig

ure for the contours of the masses The f

cc

values of the samples are

indicated on the horizontal axis for benign masses with and for malig

nant tumors as The posterior probability functions for the benign solid

line and malignant dashdot line classes are also shown Equal prior prob

abilities of were used for the two classes See also Figure Figure

courtesy of FJ Ayres

FIGURE

Same as in Figure but with the prior probability of the benign class

equal to Figure courtesy of FJ Ayres

FIGURE

Plots of D Gaussian stateconditional PDF models for the shape factors

f

cc

and SI for benign masses dashdot line and malignant tumors dashed

line See Figure for the contours of the masses Feature values for

benign masses are indicated with and for malignant tumors as

The benign class prototype mean is indicated by the solid diamond that

for the malignant class is indicated by the # symbol The dashed and dash

dot contours indicate two constantMahalanobisdistance contours level sets

each for the two Gaussian PDF models see Equation for the malignant

and benign classes respectively The solid contour indicates the decision

boundary as given by Equation to Equation with the decision

function being equal for the two classes The prior probabilities for the two

classes were assumed to be equal to Figure courtesy of FJ Ayres

linear The typical response function is shaped as a forward or backward tilted

S and is known as a sigmoidal function The function has asymptotes at

and

In logistic pattern classication an event is dened as the membership of

a pattern vector in one of the two classes of concern The method computes

a variable that depends upon the given parameters and is constrained to the

range so that it may be interpreted as a probability The probability

of the pattern vector belonging to the second class is simply the dierence

between unity and the estimated value

For the case of a single feature or parameter the logistic regression model

is given as

P event

expb

" b

x

" expb

" b

x

or equivalently

P event

" expb

" b

x

where b

and b

are coe!cients estimated from the data and x is the inde

pendent feature variable The relationship between the independent vari

able and the estimated probability is nonlinear and follows an Sshaped curve

that closely resembles the integral of a Gaussian function In the case of an

ndimensional feature vector x the model can be written as

P event

" expz

where z is the linear combination

z b

" b

x

" b

x

" " b

n

x

n

b

T

x

that is z is the dot product of the augmented feature vector x with a coe!