ABSTRACT
Pattern Classication and Diagnostic
Decision
The nal purpose of biomedical image analysis is to classify a given image or
the features that have been detected in the image into one of a few known
categories In medical applications a further goal is to arrive at a diagnos
tic decision regarding the condition of the patient A physician or medical
specialist may achieve this goal via visual analysis of the image and data
presented comparative analysis of the given image with others of known di
agnoses or the application of established protocols and sets of rules assist in
such a decisionmaking process Images taken earlier of the same patient may
also be used when available for comparative or dierential analysis Some
measurements may also be made from the given image to assist in the anal
ysis The basic knowledge clinical experience expertise and intuition of the
physician play signicant roles in this process
When image analysis is performed via the application of computer algo
rithms the typical result is the extraction of a number of numerical features
When the numerical features relate directly to measurements of organs or
features represented by the image such as an estimate of the size of the
heart or the volume of a tumor the clinical specialist may be able to use
the features directly in his or her diagnostic logic However when parame
ters such as measures of texture and shape complexity are derived a human
analyst is not likely to be able to analyze or comprehend the features Fur
thermore as the number of the computed features increases the associated
diagnostic logic may become too complicated and unwieldy for human analy
sis Computer methods would then be desirable to perform the classication
and decision process
At the outset it should be borne in mind that a biomedical image forms but
one piece of information in arriving at a diagnosis the classication of a given
image into one of many categories may assist in the diagnostic procedure
but will almost never be the only factor Regardless pattern classication
based upon image analysis is indeed an important aspect of biomedical image
analysis and forms the theme of the present chapter Remaining within the
realm of CAD as introduced in Figure and Section it would be
preferable to design methods so as to aid a medical specialist in arriving at a
diagnosis rather than to provide a decision
A generic problem statement for pattern classication may be expressed as
follows A number of measures and features have been derived from a biomed
ical image Develop methods to classify the image into one of a few speci
ed categories Investigate the relevance of the features and the classication
methods in arriving at a diagnostic decision about the patient
Observe that the features mentioned above may have been derived manually
or by computer methods Recognize the distinction between classifying the
given image and arriving at a diagnosis regarding the patient the connection
between the two tasks or steps may not always be direct In other words a
pattern classication method may facilitate the labeling of a given image as
being a member of a particular class arriving at a diagnosis of the condition of
the patient will most likely require the analysis of several other items of clinical
information Although it is common to work with a prespecied number of
pattern classes many problems do exist where the number of classes is not
known a priori A special case is screening where the aim is to simply decide
on the presence or absence of a certain type of abnormality or disease The
initial decision in screening may be further focused on whether the subject
appears to be free of the specic abnormality of concern or requires further
investigation
The problem statement and description above are rather generic Several
considerations arise in the practical application of the concepts mentioned
above to medical images and diagnosis Using the detection of breast can
cer as an example the following questions illustrate some of the problems
encountered in practice
Is a mass or tumor present YesNo
If a mass or tumor is present
Give or mark its location
Compare the density of the mass to that of the surrounding tissues
hypodense isodense hyperdense
Describe the shape of its boundary round ovoid irregular ma
crolobulated microlobulated spiculated
Describe its texture homogeneous heterogeneous fatty
Describe its edge sharp wellcircumscribed illdened fuzzy
Decide if it is a benign mass a cyst solid or uidlled or a
malignant tumor
Are calcications present YesNo
If calcications are present
Estimate their number per cm
Describe their shape round ovoid elongated branching rough
punctate irregular amorphous
Describe their spatial distribution or cluster
Describe their density homogeneous heterogeneous
Are there signs of architectural distortion YesNo
Are there signs of bilateral asymmetry YesNo
Are there major changes compared to the previous mammogram of the
patient
Is the case normal YesNo
If the case is abnormal
Is the disease benign or malignant cancer
The items listed above give a selection of the many features of mammo
grams that a radiologist would investigate see Ackerman et al and the
BIRADS
TM
manual for more details Figure shows a graphical user
interface developed by Alto et al for the categorization of breast
masses related to some of the questions listed above Figure illustrates
four segments of mammograms demonstrating masses and tumors of dier
ent characteristics progressing from a wellcircumscribed and homogeneous
benign mass to a highly spiculated and heterogeneous tumor
The subject matter of this book image analysis and pattern classica
tion can provide assistance in responding to only some of the questions
listed above Even an entire set of mammograms may not lead to a nal
decision other modes of diagnostic imaging and means of investigation may
be necessary to arrive at a denite diagnosis
In the following sections a number of methods for pattern classication
decision making and evaluation of the results of classication are reviewed
and illustrated
Note Parts of this chapter are reproduced with permission from RM
Rangayyan Biomedical Signal Analysis A CaseStudy Approach IEEE Press
and Wiley New York NY
c
IEEE
Pattern Classication
Pattern recognition or classication may be dened as the categorization of
the input data into identiable classes via the extraction of signicant features
or attributes of the data from a background of irrelevant detail
In biomedical image analysis after quantitative
features have been extracted from the given images each image or ROI may
be represented by a feature vector x x
x
x
n
T
which is also known
FIGURE
Graphical user interface for the categorization of breast masses Reproduced
with permission from H Alto RM Rangayyan RB Paranjape JEL
Desautels and H Bryant An indexed atlas of digital mammograms for
computeraided diagnosis of breast cancer Annales des Telecommunications
c
GET Lavoisier Figure courtesy of C LeGuil
lou
Ecole Nationale Superieure des Telecommunications de Bretagne Brest
France
a blc b bro c mrc d mlo
f
cc
SI
cf
A
F
FIGURE
Examples of breast mass regions and contours with the corresponding values
of fractional concavity f
cc
spiculation index SI compactness cf acutance
A and sum entropy F
a Circumscribed benign mass b Macrolobulated
benign mass c Microlobulated malignant tumor d Spiculated malignant
tumor Note that the masses and their contours are of widely diering size
but have been scaled to the same size in the illustration The rst letter
of the case identier indicates a malignant diagnosis with m and a benign
diagnosis with b based upon biopsy The symbols after the rst numerical
portion of the identier represent l left r right c craniocaudal view o
mediolateral oblique view x axillary view The last two digits represent
the year of acquisition of the mammogram An additional character of the
identier after the year a f if present indicates the existence of multiple
masses visible in the same mammogram Reproduced with permission from
H Alto RM Rangayyan and JEL Desautels Contentbased retrieval and
analysis of mammographic masses Journal of Electronic Imaging in press
c
SPIE and IS T
as the measurement vector or a pattern vector When the values x
i
are real
numbers x is a point in an ndimensional Euclidean space vectors of similar
objects may be expected to form clusters as illustrated in Figure
FIGURE
Twodimensional feature vectors of two classes C
and C
The prototypes
of the two classes are indicated by the vectors z
and z
The linear decision
function dx shown solid line is the perpendicular bisector of the straight
line joining the two prototypes dashed line Reproduced with permission
from RM Rangayyan Biomedical Signal Analysis A CaseStudy Approach
IEEE Press and Wiley New York NY
c
IEEE
For e!cient pattern classication measurements that could lead to dis
joint sets or clusters of feature vectors are desired This point underlines the
importance of the appropriate design of the preprocessing and feature extrac
tion procedures Features or characterizing attributes that are common to all
patterns belonging to a particular class are known as intraset or intraclass
features Discriminant features that represent the dierences between pattern
classes are called interset or interclass features
The pattern classication problem is that of generating optimal decision
boundaries or decision procedures to separate the data into pattern classes
based on the feature vectors provided Figure illustrates a simple linear
decision function or boundary to separate D feature vectors into two classes
Supervised Pattern Classication
The problem considered in supervised pattern classication may be stated
as follows You are provided with a number of feature vectors with classes
assigned to them Propose techniques to characterize and parameterize the
boundaries that separate the classes
A given set of feature vectors of known categorization is often referred to
as a training set The availability of a training set facilitates the development
of mathematical functions that can characterize the separation between the
classes The functions may then be applied to new feature vectors of unknown
classes to classify or recognize them This approach is known as supervised
pattern classication A set of feature vectors of known categorization that
is used to evaluate a classier designed in this manner is referred to as a test
set After adequate testing and conrmation of the method with satisfactory
results the classier may be applied to new feature vectors of unknown classes
the results may then be used to arrive at diagnostic decisions The following
subsections describe a few methods that can assist in the development of
discriminant and decision functions
Discriminant and decision functions
A general linear discriminant or decision function is of the form
dx w
x
" w
x
" " w
n
x
n
" w
n
w
T
x
where x x
x
x
n
T
is the feature vector augmented by an additional
entry equal to unity and w w
w
w
n
w
n
T
is a correspondingly
augmented weight vector A twoclass pattern classication problem may be
stated as
dx w
T
x
if x C
if x C
where C
and C
represent the two classes The discriminant function may be
interpreted as the boundary separating the classes C
and C
as illustrated
in Figure
In the general case of an M class pattern classication problem we will
need M weight vectors and M decision functions to perform the following
decisions
d
i
x w
T
i
x
if x C
i
otherwise
i M
where w
i
w
i
w
i
w
in
w
in
T
is the weight vector for the class C
i
Three cases arise in solving this problem
Case Each class is separable from the rest by a single decision surface
if d
i
x then x C
i
Case Each class is separable from every other individual class by a distinct
decision surface that is the classes are pairwise separable There are
MM decision surfaces given by d
ij
x w
T
ij
x
if d
ij
x j i then x C
i
Note d
ij
x d
ji
x
Case There exist M decision functions d
k
x w
T
k
x k M
with the property that
if d
i
x d
j
x j i then x C
i
This is a special instance of Case We may dene
d
ij
x d
i
x d
j
x w
i
w
j
T
x w
T
ij
x
If the classes are separable under Case they are separable under Case
the converse is in general not true
Patterns that may be separated by linear decision functions as above are
said to be linearly separable In other situations an innite variety of complex
decision boundaries may be formulated by using generalized decision functions
based upon nonlinear functions of the feature vectors as
dx w
f
x " w
f
x " " w
K
f
K
x " w
K
K
X
i
w
i
f
i
x
Here ff
i
xg i K are real singlevalued functions of x f
K
x
Whereas the functions f
i
x may be nonlinear in the ndimensional space
of x the decision function may be formulated as a linear function by den
ing a transformed feature vector x
y
f
x f
x f
K
x
T
Then
dx w
T
x
y
with w w
w
w
K
w
K
T
Once evaluated ff
i
xg is
just a set of numerical values and x
y
is simply a Kdimensional vector aug
mented by an entry equal to unity Several methods exist for the derivation
of optimal linear discriminant functions
Example of application The ROIs of breast masses are shown in
Figure arranged in the order of decreasing acutance A see Sections
and Figure shows the contours of the masses arranged in
the increasing order of fractional concavity f
cc
see Section Most of the
contours of the benign masses are seen to be smooth whereas most of the con
tours of the malignant tumors are rough and spiculated Furthermore most of
the benign masses have welldened sharp edges and are wellcircumscribed
whereas the majority of the malignant tumors possess illdened and fuzzy
borders It is seen that the shape factor f
cc
facilitates the ordering of the
contours in terms of shape complexity However the contours of a few be
nign masses and a few malignant tumors do not follow the expected trend
In addition the acutance measure has lower values for most of the malignant
tumors than for a majority of the benign masses
The three shape factors cf f
cc
and SI see Chapter the texture
measures as dened by Haralick see Section and four measures
of edge sharpness as dened by Mudigonda et al see Section were
computed for the ROIs and their contours Note The factor SI was divided
by two in this example to reduce it to the range Figure gives a plot
of the D featurevector space f
cc
A F
for the masses The feature F
shows poor separation between the benign and malignant samples whereas
the feature A demonstrates some degree of separation A scatter plot of the
three shape factors f
cc
cf SI of the masses is given in Figure Each
of the three shape factors demonstrates high discriminant capability
Figure shows a D plot of the shapefactor vectors f
cc
SI for a train
ing set formed by selecting the vectors for benign masses and malignant
tumors The prototypes for the benign and malignant classes obtained by av
eraging the vectors over all the members of the two classes in the training set
are marked as B and M respectively on the plot The solid straight line is
the perpendicular bisector of the line joining the two prototypes dashed line
and represents a linear discriminant function The equation of the straight
line is SI " f
cc
The decision function is represented by
the following rule
if SI " f
cc
then
benign mass
else
malignant tumor
end
It is seen that the rule given above will correctly classify all of the training
samples
Figure shows the result of application of the linear discriminant func
tion designed and shown in Figure to a test set of benign masses and
malignant tumors The test set does not include any of the cases from the
training set It is seen that the classier will lead to three false negatives in
the test set
Distance functions
Consider M pattern classes represented by their prototype patterns z
z
z
M
The prototype of a class is typically computed as the average of all
the feature vectors belonging to the class Figure illustrates schematically
the prototypes z
and z
of the two classes shown
FIGURE
ROIs of breast masses including benign masses and malignant tumors The
ROIs are arranged in the order of decreasing acutance A Note that the masses are
of widely diering size but have been scaled to the same size in the illustration For
details regarding the case identiers see Figure Reproduced with permission
from H Alto RM Rangayyan and JEL Desautels Contentbased retrieval and
analysis of mammographic masses Journal of Electronic Imaging in press
c
SPIE and IS T
FIGURE
Contours of breast masses including benign masses and malignant tumors
The contours are arranged in the order of increasing f
cc
Note that the masses and
their contours are of widely diering size but have been scaled to the same size
in the illustration For details regarding the case identiers see Figure See
also Figure Reproduced with permission from H Alto RM Rangayyan and
JEL Desautels Contentbased retrieval and analysis of mammographic masses
Journal of Electronic Imaging in press
c
SPIE and IS T
FIGURE
Plot of the D featurevector space f
cc
A F
for the set of masses in
Figure o benign masses malignant tumors Repro
duced with permission from H Alto RM Rangayyan and JEL Desautels
Contentbased retrieval and analysis of mammographic masses Journal of
Electronic Imaging in press
c
SPIE and IS T
FIGURE
Plot of the D featurevector space f
cc
cf SI for the set of contours in
Figure o benign masses malignant tumors Figure
courtesy of H Alto
The Euclidean distance between an arbitrary pattern vector x and the i
th
prototype is given as
D
i
kx z
i
k
q
x z
i
T
x z
i
A simple rule to classify the pattern vector x would be to choose that class
for which the vector has the smallest distance
if D
i
D
j
j i then x C
i
See Section for the description of an application of the Euclidean dis
tance to the analysis of breast masses and tumors
A simple relationship may be established between discriminant functions
and distance functions as follows
D
i
kx z
i
k
x z
i
T
x z
i
x
T
x x
T
z
i
" z
T
i
z
i
x
T
x
x
T
z
i
z
T
i
z
i
Choosing the minimum of D
i
is equivalent to choosing the minimum of
D
i
because all D
i
Furthermore from the equation above it follows
FIGURE
Plot of the D featurevector space f
cc
SI for the training set of be
nign masses o and malignant tumors x selected from the dataset in
Figure The prototypes of the two classes are indicated by the vectors
marked B and M The solid line shown is a linear decision function obtained
as the perpendicular bisector of the straight line joining the two prototypes
dashed line
FIGURE
Plot of the D featurevector space f
cc
SI for the test set of benign masses
o and malignant tumors x selected from the dataset in Figure
The solid line shown is a linear decision function designed as illustrated in
Figure Three malignant cases are misclassied by the decision function
shown
that choosing the minimum of D
i
is equivalent to choosing the maximum of
x
T
z
i
z
T
i
z
i
Therefore we may dene the decision function
d
i
x x
T
z
i
z
T
i
z
i
i M
A decision rule may then be stated as
if d
i
x d
j
x j i then x C
i
This is a linear discriminant function which becomes obvious from the fol
lowing representation If z
ij
j n are the components of z
i
let
w
ij
z
ij
j n w
in
z
T
i
z
i
and x x
x
x
n
T
Then d
i
x w
T
i
x i M where w
i
w
i
w
i
w
in
T
Therefore distance functions may be formulated as linear discriminant or
decision functions
The nearestneighbor rule
Suppose that we are provided with a set of N sample patterns fs
s
s
N
g of known classication each pattern belongs to one ofM classes fC
C
C
M
g with N M We are then given a new feature vector x whose
class needs to be determined Let us compute a distance measure Ds
i
x
between the vector x and each sample pattern Then the nearestneighbor
rule states that the vector x is to be assigned to the class of the sample that
is the closest to x
x C
i
if Ds
i
x minfDs
l
xg l N
A major disadvantage of the above method is that the classication decision
is made based upon a single sample vector of known classication The nearest
neighbor may happen to be an outlier that is not representative of its class
It would be more reliable to base the classication upon several samples we
may consider a certain number k of the nearest neighbors of the sample to
be classied and then seek a majority opinion This leads to the socalled
knearestneighbor or kNN rule Determine the k nearest neighbors of x and
use the majority of equal classications in this group as the classication of
x See Section for the description of an application of the kNN method
to the analysis of breast masses and tumors
Unsupervised Pattern Classication
Let us consider the situation where we are given a set of feature vectors with
no categorization or classes attached to them No prior training information
is available How may we group the vectors into multiple categories
The design of distance functions and decision boundaries requires a training
set of feature vectors of known classes The functions so designed may then
be applied to a new set of feature vectors or samples to perform pattern
classication Such a procedure is known as supervised pattern classication
due to the initial training step In some situations a training step may not
be possible and we may be required to classify a given set of feature vectors
into either a prespecied or unknown number of categories Such a problem is
labeled as unsupervised pattern classication and may be solved by cluster
seeking methods
Clusterseeking methods
Given a set of feature vectors we may examine them for the formation of
inherent groups or clusters This is a simple task in the case of D vectors
where we may plot them visually identify groups and label each group with
a pattern class Allowance may have to be made to assign the same class
to multiple disjoint groups Such an approach may be used even when the
number of classes is not known at the outset When the vectors have a
dimension higher than three visual analysis will not be feasible It then
becomes necessary to dene criteria to group the given vectors on the basis
of similarity dissimilarity or distance measures A few examples of such
measures are described below
Euclidean distance
D
E
kx zk
x z
T
x z
n
X
i
x
i
z
i
Here x and z are two feature vectors the latter could be a class pro
totype if available A small value of D
E
indicates greater similarity
between the two vectors than a large value of D
E
Manhattan or cityblock distance
D
C
n
X
i
j x
i
z
i
j
The Manhattan distance is the shortest path between x and z with
each segment being parallel to a coordinate axis
Mahalanobis distance
D
M
xm
T
C
xm
where x is a feature vector being compared to a pattern class for which
m is the class mean vector and C is the covariance matrix A small
value of D
M
indicates a higher potential membership of the vector x in
the class than a large value ofD
M
See Section for the description
of an application of the Mahalanobis distance to the analysis of breast
masses and tumors
Normalized dot product cosine of the angle between the vectors x and
z
D
d
x
T
z
kxkkzk
A large dot product value indicates a greater degree of similarity between
the two vectors than a small value
The covariance matrix is dened as
C Ey mym
T
where the expectation operation is performed over all feature vectors y that
belong to the class The covariance matrix provides the covariance of all
possible pairs of the features in the feature vector over all samples belonging
to the given class being considered The elements along the main diagonal
of the covariance matrix provide the variance of the individual features that
make up the feature vector The covariance matrix represents the scatter of
the features that belong to the given class The mean and covariance need
to be updated as more samples are added to a given class in a clustering
procedure
When the Mahalanobis distance needs to be calculated between a sample
vector and a number of classes represented by their mean and covariance
matrices a pooled covariance matrix may be used if the numbers of members
in the various classes are unequal and low If the covariance matrices
of two classes are C
and C
and the numbers of members in the two classes
are N
and N
the pooled covariance matrix is given by
C
N
C
" N
C
N
"N
Various performance indices may be designed to measure the success of a
clustering procedure A measure of the tightness of a cluster is the sum
of the squared errors performance index
J
N
c
X
j
X
xS
j
kxm
j
k
where N
c
is the number of cluster domains S
j
is the set of samples in the j
th
cluster
m
j
N
j
X
xS
j
x
is the sample mean vector of S
j
and N
j
is the number of samples in S
j
A few other examples of performance indices are
Average of the squared distances between the samples in a cluster do
main
Intracluster variance
Average of the squared distances between the samples in dierent cluster
domains
Intercluster distances
Scatter matrices
Covariance matrices
A simple clusterseeking algorithm Suppose we have N sample
patterns fx
x
x
N
g
Let the rst cluster center z
be equal to any one of the samples say
z
x
Choose a nonnegative threshold
Compute the distance D
between x
and z
If D
assign x
to the domain class of cluster center z
otherwise start a new cluster
with its center as z
x
For the subsequent steps let us assume that
a new cluster with center z
has been established
Compute the distances D
and D
from the next sample x
to z
and
z
respectively If D
and D
are both greater than start a new
cluster with its center as z
x
otherwise assign x
to the domain of
the closer cluster
Continue to apply Steps and by computing and checking the distance
from every new unclassied pattern vector to every established cluster
center and applying the assignment or clustercreation rule
Stop when every given pattern vector has been assigned to a cluster
Observe that the procedure does not require a priori knowledge of the
number of classes Recognize also that the procedure does not assign a real
world class to each cluster it merely groups the given vectors into disjoint
clusters A subsequent step is required to label each cluster with a class related
to the actual problem Multiple clusters may relate to the same realworld
class and may have to be merged
A major disadvantage of the simple clusterseeking algorithm is that the
results depend upon
the rst cluster center chosen for each domain or class
the order in which the sample patterns are considered
the value of the threshold and
the geometrical properties or distributions of the data that is the
featurevector space
The maximindistance clustering algorithm This method is
similar to the previous simple algorithm but rst identies the cluster re
gions that are the farthest apart The term maximin refers to the combined
use of maximum and minimum distances between the given vectors and the
centers of the clusters already formed
Let x
be the rst cluster center z
Determine the farthest sample from x
and label it as cluster center z
Compute the distance from each remaining sample to z
and to z
For
every pair of these computations save the minimum distance and select
the maximum of the minimum distances If this maximin distance is
an appreciable fraction of the distance between the cluster centers z
and
z
label the corresponding sample as a new cluster center z
otherwise
stop forming new clusters and go to Step
If a new cluster center was formed in Step repeat Step using a
typical or the average distance between the established cluster centers
for comparison
Assign each remaining sample to the domain of its nearest cluster center
The Kmeans algorithm The preceding simple and maximin
algorithms are intuitive procedures The Kmeans algorithm is based on
iterative minimization of a performance index that is dened as the sum of
the squared distances from all points in a cluster domain to the cluster center
ChooseK initial cluster centers z
z
z
K
K is the number
of clusters to be formed The choice of the cluster centers is arbitrary
and could be the rst K of the feature vectors available The index in
parentheses represents the iteration number
At the k
th
iterative step distribute the samples fxg among theK cluster
domains using the relation
x S
j
k if kxz
j
kk kxz
i
kk i K i j
where S
j
k denotes the set of samples whose cluster center is z
j
k
From the results of Step compute the new cluster centers z
j
k" j
K such that the sum of the squared distances from all points
in S
j
k to the new cluster center is minimized In other words the new
cluster center z
j
k " is computed so that the performance index
J
j
X
xS
j
k
kx z
j
k " k
j K
is minimized The z
j
k " that minimizes this performance index is
simply the sample mean of S
j
k Therefore the new cluster center is
given by
z
j
k "
N
j
k
X
xS
j
k
x j K
where N
j
k is the number of samples in S
j
k The name Kmeans
is derived from the manner in which cluster centers are sequentially
updated
If z
j
k " z
j
k for j K the algorithm has converged
terminate the procedure otherwise go to Step
The behavior of the Kmeans algorithm is inuenced by
the number of cluster centers specied K
the choice of the initial cluster centers
the order in which the sample patterns are considered and
the geometrical properties or distributions of the data that is the
featurevector space
Example Figures to show four cluster plots of the shape
factors f
cc
and SI of the breast mass contours shown in Figure see
Section for details Although the categories of the samples would be
unknown in a practical situation the samples are identied in the plots with
the " symbol for malignant tumors and the symbol for the benign masses
The categorization represents the groundtruth or true classication of the
samples based upon biopsy
The plots in Figures to show the progression of the Kmeans
algorithm from its initial state to the converged state K in this example
representing the benign and malignant categories The only prior knowledge
or assumption used is that the samples are to be split into two clusters that is
there are two classes Figure shows two samples selected to represent the
cluster centers marked with the diamond and asterisk symbols The straight
line indicates the decision boundary which is the perpendicular bisector of
the straight line joining the two cluster centers The Kmeans algorithm
converged in this case at the fth iteration that is there was no change in the
cluster centers after the fth iteration The nal decision boundary results
in the misclassication of four of the malignant samples as being benign It
is interesting to note that even though the two initial cluster centers belong
to the benign category the algorithm has converged to a useful solution
See Section for examples of application of other pattern classication
techniques to the same dataset
Probabilistic Models and Statistical Decision
Pattern classication methods such as discriminant functions are dependent
upon the set of training samples provided Their success when applied to new
cases will depend upon the accuracy of the representation of the various pat
tern classes by the training samples How can we design pattern classication
techniques that are independent of specic training samples and are optimal
in a broad sense
Probability functions and probabilistic models may be developed to rep
resent the occurrence and statistical attributes of classes of patterns Such
functions may be based upon large collections of data historical records or
mathematical models of pattern generation In the absence of information as
above a training step with samples of known categorization will be required
to estimate the required model parameters It is common practice to assume a
Gaussian PDF to represent the distribution of the features for each class and
estimate the required mean and variance parameters from the training sets
When PDFs are available to characterize pattern classes and their features
optimal decision functions may be designed based upon statistical functions
and decision theory The following subsections describe a few methods in this
category
Likelihood functions and statistical decision
Let P C
i
be the probability of occurrence of class C
i
i M this is
known as the a priori prior or unconditional probability The a posteriori
or posterior probability that an observed sample pattern x came from C
i
is
expressed as P C
i
jx If a classier decides that x comes from C
j
when it
actually came from C
i
the classier is said to incur a loss L
ij
with L
ii
or a xed operational cost and L
ij
L
ii
j i
Because x may belong to any one of the M classes under consideration the
expected loss known as the conditional average risk or loss in assigning x to
C
j
is
R
j
x
M
X
i
L
ij
P C
i
jx
FIGURE
Initial state of the Kmeans algorithm The symbols in the cluster plot repre
sent the D feature vectors f
cc
SI for benign and malignant "
breast masses See Figure for the contours of the masses The cluster
centers class means are indicated by the solid diamond and the # symbols
The straight line indicates the decision boundary between the two classes
Figure courtesy of FJ Ayres
FIGURE
Second iteration of theKmeans algorithm Details as in Figure Figure
courtesy of FJ Ayres
FIGURE
Third iteration of the Kmeans algorithm Details as in Figure Figure
courtesy of FJ Ayres
FIGURE
Fourth iteration of the Kmeans algorithm Details as in Figure Figure
courtesy of FJ Ayres
FIGURE
Final state of the Kmeans algorithm after the fth iteration Details as in
Figure Figure courtesy of FJ Ayres
A classier could compute R
j
x j M for each sample x and
then assign x to the class with the smallest conditional loss Such a classier
will minimize the total expected loss over all decisions and is called the Bayes
classier From a statistical point of view the Bayes classier represents the
optimal classier
According to Bayes rule we have
P C
i
jx
P C
i
pxjC
i
px
where pxjC
i
is called the likelihood function of class C
i
or the statecondi
tional PDF of x and px is the PDF of x regardless of class membership
unconditional Note P y is used to represent the probability of occur
rence of an event y py is used to represent the PDF of a random variable y
Probabilities and PDFs involving a multidimensional feature vectors are mul
tivariate functions with dimension equal to that of the feature vector Bayes
rule shows how observing the sample x changes the a priori probability P C
i
to the a posteriori probability P C
i
jx In other words Bayes rule provides a
mechanism to update the a priori probability P C
i
to the a posteriori prob
ability P C
i
jx due to the observation of the sample x Then we can express
the expected loss as
R
j
x
px
M
X
i
L
ij
pxjC
i
P C
i
Because
px
is common for all j we could modify R
j
x to
r
j
x
M
X
i
L
ij
pxjC
i
P C
i
In a twoclass case with M we obtain the following expressions
r
x L
pxjC
P C
" L
pxjC
P C
r
x L
pxjC
P C
" L
pxjC
P C
x C
if r
x r
x
that is
x C
if L
pxjC
P C
" L
pxjC
P C
L
pxjC
P C
" L
pxjC
P C
or equivalently
x C
if L
L
pxjC
P C
L
L
pxjC
P C
This expression may be rewritten as
x C
if
pxjC
pxjC
P C
P C
L
L
L
L
The lefthand side of the inequality above which is a ratio of two likelihood
functions is often referred to as the likelihood ratio
l
x
pxjC
pxjC
Then Bayes decision rule for M is
Assign x to class C
if l
x
where
is a threshold given by
P C
P C
L
L
L
L
Assign x to class C
if l
x
Make an arbitrary or heuristic decision if l
x
The rule may be generalized to the M class case as
x C
i
if
M
X
k
L
ki
pxjC
k
P C
k
M
X
q
L
qj
pxjC
q
P C
q
j M j i
In most pattern classication problems the loss is nil for correct decisions
The loss could be assumed to be equal to a certain quantity for all erroneous
decisions Then L
ij
ij
where
ij
if i j
otherwise
and
r
j
x
M
X
i
ij
pxjC
i
P C
i
px pxjC
j
P C
j
because
M
X
i
pxjC
i
P C
i
px
The Bayes classier will assign a pattern x to class C
i
if
px pxjC
i
P C
i
px pxjC
j
P C
j
j M j i
that is
x C
i
if pxjC
i
P C
i
pxjC
j
P C
j
j M j i
This is nothing more than using the decision functions
d
i
x pxjC
i
P C
i
i M
where a pattern x is assigned to class C
i
if d
i
x d
j
x j i for that
pattern Using Bayes rule we get
d
i
x P C
i
jx px i M
Because px does not depend upon the class index i this can be reduced to
d
i
x P C
i
jx i M
The dierent decision functions given above provide alternative yet equivalent
approaches depending upon whether pxjC
i
or P C
i
jx is used or available
The estimation of pxjC
i
would require a training set for each class C
i
It
is common to assume a Gaussian distribution and estimate its mean and
variance using the training set
Bayes classier for normal patterns
The univariate normal or Gaussian PDF for a single random variable x is
given by
px
p
exp
xm
which is completely specied by two parameters the mean
m Ex
Z
x px dx
and the variance
Exm
Z
xm
px dx
In the case of M pattern classes and pattern vectors x of dimension n
governed by multivariate normal PDFs we have
pxjC
i
n
jC
i
j
exp
xm
i
T
C
i
xm
i
i M where each PDF is completely specied by its mean vector
m
i
and its n n covariance matrix C
i
with
m
i
E
i
x
and
C
i
E
i
xm
i
xm
i
T
Here E
i
denotes the expectation operator over the patterns belonging to
class C
i
Normal distributions occur frequently in nature and have the advantage
of analytical tractability A multivariate normal PDF reduces to a product
of univariate normal PDFs when the elements of x are mutually independent
in which case the covariance matrix is a diagonal matrix
We had earlier formulated the decision functions
d
i
x pxjC
i
P C
i
i M
see Equation Given the exponential in the normal PDF it is convenient
to use
d
i
x ln pxjC
i
P C
i
ln pxjC
i
" lnP C
i
which is equivalent in terms of classication performance because the natural
logarithm ln is a monotonically increasing function Then
d
i
x lnP C
i
n
ln
ln jC
i
j
xm
i
T
C
i
xm
i
i M The second term does not depend upon i therefore we can
simplify d
i
x to
d
i
x lnP C
i
ln jC
i
j
xm
i
T
C
i
xm
i
i M
The decision functions above are hyperquadrics hence the best that a
Bayes classier for normal patterns can do is to place a general secondorder
decision surface between each pair of pattern classes In the case of true nor
mal distributions of patterns the decision functions as above will be optimal
on an average basis they minimize the expected loss with the simplied loss
function L
ij
ij
If all the covariance matrices are equal that is C
i
C i M
we get
d
i
x lnP C
i
" x
T
C
m
i
m
T
i
C
m
i
i M
after omitting terms independent of i The Bayesian classier is now repre
sented by a set of linear decision functions
Before one may apply the decision functions as above it would be appro
priate to verify the Gaussian nature of the PDFs of the variables on hand by
conducting statistical tests Furthermore it would be necessary
to derive or estimate the mean vector and covariance matrix for each class
sample statistics computed from a training set may serve this purpose
Example Figure shows plots of Gaussian PDF models applied to the
shape factor f
cc
fractional concavity of the breast mass contours shown in
Figure see Chapter and Section for details The two Gaussians
represent the stateconditional PDFs of f
cc
for the benign and malignant
categories Also shown are the posterior probabilities that the class of a
sample is benign or malignant given an observed value of f
cc
The posterior
probability functions were derived using Bayes rule as in Equation with
the values of the prior probabilities of the two classes being equal to It
is seen that the posterior probabilities are both equal to at f
cc
the probability of a malignant classication is higher than that of a benign
classication for f
cc
Due to the use of equal prior probabilities the
transition point is the same as the point where the two Gaussian models for
the stateconditional PDFs cross each other
Figure shows the same Gaussian models for the stateconditional
PDFs as in Figure However the posterior probability functions were de
rived using the prior probability value of for the benign category the prior
probability for the malignant category is then In this case the probability
of a malignant classication is higher than that of a benign classication for
f
cc
The prior assumption that $ of all masses encountered will be
benign has pushed the decision threshold on f
cc
to a higher value
Figure illustrates the D cluster plot and Gaussian PDF models for
the shape factors f
cc
and SI for the same dataset as described above see
Chapter and Section for details The decision boundary indicated by
the solid line is the optimal boundary under the assumption of D Gaussian
PDFs for the two features and the two classes Two malignant samples are
misclassied by the decision boundary shown See Section for ex
amples of application of other pattern classication techniques to the same
dataset
An interesting point to note from the examples above is that the Gaussian
PDF models used are not capable of accommodating the prior knowledge that
the shape factors are limited to the range Other PDF models such as
the Rayleigh distribution see Section should be used if this aspect is
important
Logistic Regression
Logistic classication is a statistical technique based on a logistic regression
model that estimates the probability of occurrence of an event
The technique is designed for problems where patterns are to be classi
ed into one of two classes When the response variable is binary theoretical
and empirical considerations indicate that the response function is often curvi
FIGURE
Plots of Gaussian stateconditional PDF models for the shape factor f
cc
for
benign dashed line and malignant dotted line breast masses See Fig
ure for the contours of the masses The f
cc
values of the samples are
indicated on the horizontal axis for benign masses with and for malig
nant tumors as The posterior probability functions for the benign solid
line and malignant dashdot line classes are also shown Equal prior prob
abilities of were used for the two classes See also Figure Figure
courtesy of FJ Ayres
FIGURE
Same as in Figure but with the prior probability of the benign class
equal to Figure courtesy of FJ Ayres
FIGURE
Plots of D Gaussian stateconditional PDF models for the shape factors
f
cc
and SI for benign masses dashdot line and malignant tumors dashed
line See Figure for the contours of the masses Feature values for
benign masses are indicated with and for malignant tumors as
The benign class prototype mean is indicated by the solid diamond that
for the malignant class is indicated by the # symbol The dashed and dash
dot contours indicate two constantMahalanobisdistance contours level sets
each for the two Gaussian PDF models see Equation for the malignant
and benign classes respectively The solid contour indicates the decision
boundary as given by Equation to Equation with the decision
function being equal for the two classes The prior probabilities for the two
classes were assumed to be equal to Figure courtesy of FJ Ayres
linear The typical response function is shaped as a forward or backward tilted
S and is known as a sigmoidal function The function has asymptotes at
and
In logistic pattern classication an event is dened as the membership of
a pattern vector in one of the two classes of concern The method computes
a variable that depends upon the given parameters and is constrained to the
range so that it may be interpreted as a probability The probability
of the pattern vector belonging to the second class is simply the dierence
between unity and the estimated value
For the case of a single feature or parameter the logistic regression model
is given as
P event
expb
" b
x
" expb
" b
x
or equivalently
P event
" expb
" b
x
where b
and b
are coe!cients estimated from the data and x is the inde
pendent feature variable The relationship between the independent vari
able and the estimated probability is nonlinear and follows an Sshaped curve
that closely resembles the integral of a Gaussian function In the case of an
ndimensional feature vector x the model can be written as
P event
" expz
where z is the linear combination
z b
" b
x
" b
x
" " b
n
x
n
b
T
x
that is z is the dot product of the augmented feature vector x with a coe!