Pattern Classification and Diagnostic Decision | 19

ABSTRACT

Pattern Classication and Diagnostic

Decision

The nal purpose of biomedical image analysis is to classify a given image or

the features that have been detected in the image into one of a few known

categories In medical applications a further goal is to arrive at a diagnos

tic decision regarding the condition of the patient A physician or medical

specialist may achieve this goal via visual analysis of the image and data

presented comparative analysis of the given image with others of known di

agnoses or the application of established protocols and sets of rules assist in

such a decisionmaking process Images taken earlier of the same patient may

also be used when available for comparative or dierential analysis Some

measurements may also be made from the given image to assist in the anal

ysis The basic knowledge clinical experience expertise and intuition of the

physician play signicant roles in this process

When image analysis is performed via the application of computer algo

rithms the typical result is the extraction of a number of numerical features

When the numerical features relate directly to measurements of organs or

features represented by the image such as an estimate of the size of the

heart or the volume of a tumor the clinical specialist may be able to use

the features directly in his or her diagnostic logic However when parame

ters such as measures of texture and shape complexity are derived a human

analyst is not likely to be able to analyze or comprehend the features Fur

thermore as the number of the computed features increases the associated

diagnostic logic may become too complicated and unwieldy for human analy

sis Computer methods would then be desirable to perform the classication

and decision process

At the outset it should be borne in mind that a biomedical image forms but

one piece of information in arriving at a diagnosis the classication of a given

image into one of many categories may assist in the diagnostic procedure

but will almost never be the only factor Regardless pattern classication

based upon image analysis is indeed an important aspect of biomedical image

analysis and forms the theme of the present chapter Remaining within the

realm of CAD as introduced in Figure and Section it would be

preferable to design methods so as to aid a medical specialist in arriving at a

diagnosis rather than to provide a decision

A generic problem statement for pattern classication may be expressed as

follows A number of measures and features have been derived from a biomed

ical image Develop methods to classify the image into one of a few speci

ed categories Investigate the relevance of the features and the classication

methods in arriving at a diagnostic decision about the patient

Observe that the features mentioned above may have been derived manually

or by computer methods Recognize the distinction between classifying the

given image and arriving at a diagnosis regarding the patient the connection

between the two tasks or steps may not always be direct In other words a

pattern classication method may facilitate the labeling of a given image as

being a member of a particular class arriving at a diagnosis of the condition of

the patient will most likely require the analysis of several other items of clinical

information Although it is common to work with a prespecied number of

pattern classes many problems do exist where the number of classes is not

known a priori A special case is screening where the aim is to simply decide

on the presence or absence of a certain type of abnormality or disease The

initial decision in screening may be further focused on whether the subject

appears to be free of the specic abnormality of concern or requires further

investigation

The problem statement and description above are rather generic Several

considerations arise in the practical application of the concepts mentioned

above to medical images and diagnosis Using the detection of breast can

cer as an example the following questions illustrate some of the problems

encountered in practice

Is a mass or tumor present YesNo

If a mass or tumor is present

Give or mark its location

Compare the density of the mass to that of the surrounding tissues

hypodense isodense hyperdense

Describe the shape of its boundary round ovoid irregular ma

crolobulated microlobulated spiculated

Describe its texture homogeneous heterogeneous fatty

Describe its edge sharp wellcircumscribed illdened fuzzy

Decide if it is a benign mass a cyst solid or uidlled or a

malignant tumor

Are calcications present YesNo

If calcications are present

Estimate their number per cm

Describe their shape round ovoid elongated branching rough

punctate irregular amorphous

Describe their spatial distribution or cluster

Describe their density homogeneous heterogeneous

Are there signs of architectural distortion YesNo

Are there signs of bilateral asymmetry YesNo

Are there major changes compared to the previous mammogram of the

patient

Is the case normal YesNo

If the case is abnormal

Is the disease benign or malignant cancer

The items listed above give a selection of the many features of mammo

grams that a radiologist would investigate see Ackerman et al and the

BIRADS

manual for more details Figure shows a graphical user

interface developed by Alto et al for the categorization of breast

masses related to some of the questions listed above Figure illustrates

four segments of mammograms demonstrating masses and tumors of dier

ent characteristics progressing from a wellcircumscribed and homogeneous

benign mass to a highly spiculated and heterogeneous tumor

The subject matter of this book image analysis and pattern classica

tion can provide assistance in responding to only some of the questions

listed above Even an entire set of mammograms may not lead to a nal

decision other modes of diagnostic imaging and means of investigation may

be necessary to arrive at a denite diagnosis

In the following sections a number of methods for pattern classication

decision making and evaluation of the results of classication are reviewed

and illustrated

Note Parts of this chapter are reproduced with permission from RM

Rangayyan Biomedical Signal Analysis A CaseStudy Approach IEEE Press

and Wiley New York NY

IEEE

Pattern Classication

Pattern recognition or classication may be dened as the categorization of

the input data into identiable classes via the extraction of signicant features

or attributes of the data from a background of irrelevant detail

In biomedical image analysis after quantitative

features have been extracted from the given images each image or ROI may

be represented by a feature vector x x

which is also known

FIGURE

Graphical user interface for the categorization of breast masses Reproduced

with permission from H Alto RM Rangayyan RB Paranjape JEL

Desautels and H Bryant An indexed atlas of digital mammograms for

computeraided diagnosis of breast cancer Annales des Telecommunications

GET Lavoisier Figure courtesy of C LeGuil

lou

Ecole Nationale Superieure des Telecommunications de Bretagne Brest

France

a blc b bro c mrc d mlo

FIGURE

Examples of breast mass regions and contours with the corresponding values

of fractional concavity f

spiculation index SI compactness cf acutance

A and sum entropy F

a Circumscribed benign mass b Macrolobulated

benign mass c Microlobulated malignant tumor d Spiculated malignant

tumor Note that the masses and their contours are of widely diering size

but have been scaled to the same size in the illustration The rst letter

of the case identier indicates a malignant diagnosis with m and a benign

diagnosis with b based upon biopsy The symbols after the rst numerical

portion of the identier represent l left r right c craniocaudal view o

mediolateral oblique view x axillary view The last two digits represent

the year of acquisition of the mammogram An additional character of the

identier after the year a f if present indicates the existence of multiple

masses visible in the same mammogram Reproduced with permission from

H Alto RM Rangayyan and JEL Desautels Contentbased retrieval and

analysis of mammographic masses Journal of Electronic Imaging in press

SPIE and IS T

as the measurement vector or a pattern vector When the values x

are real

numbers x is a point in an ndimensional Euclidean space vectors of similar

objects may be expected to form clusters as illustrated in Figure

FIGURE

Twodimensional feature vectors of two classes C

and C

The prototypes

of the two classes are indicated by the vectors z

and z

The linear decision

function dx shown solid line is the perpendicular bisector of the straight

line joining the two prototypes dashed line Reproduced with permission

from RM Rangayyan Biomedical Signal Analysis A CaseStudy Approach

IEEE Press and Wiley New York NY

IEEE

For e!cient pattern classication measurements that could lead to dis

joint sets or clusters of feature vectors are desired This point underlines the

importance of the appropriate design of the preprocessing and feature extrac

tion procedures Features or characterizing attributes that are common to all

patterns belonging to a particular class are known as intraset or intraclass

features Discriminant features that represent the dierences between pattern

classes are called interset or interclass features

The pattern classication problem is that of generating optimal decision

boundaries or decision procedures to separate the data into pattern classes

based on the feature vectors provided Figure illustrates a simple linear

decision function or boundary to separate D feature vectors into two classes

Supervised Pattern Classication

The problem considered in supervised pattern classication may be stated

as follows You are provided with a number of feature vectors with classes

assigned to them Propose techniques to characterize and parameterize the

boundaries that separate the classes

A given set of feature vectors of known categorization is often referred to

as a training set The availability of a training set facilitates the development

of mathematical functions that can characterize the separation between the

classes The functions may then be applied to new feature vectors of unknown

classes to classify or recognize them This approach is known as supervised

pattern classication A set of feature vectors of known categorization that

is used to evaluate a classier designed in this manner is referred to as a test

set After adequate testing and conrmation of the method with satisfactory

results the classier may be applied to new feature vectors of unknown classes

the results may then be used to arrive at diagnostic decisions The following

subsections describe a few methods that can assist in the development of

discriminant and decision functions

Discriminant and decision functions

A general linear discriminant or decision function is of the form

dx w

" w

" " w

" w

where x x

is the feature vector augmented by an additional

entry equal to unity and w w

is a correspondingly

augmented weight vector A twoclass pattern classication problem may be

stated as

dx w

if x C

where C

and C

represent the two classes The discriminant function may be

interpreted as the boundary separating the classes C

and C

as illustrated

in Figure

In the general case of an M class pattern classication problem we will

need M weight vectors and M decision functions to perform the following

decisions

x w

if x C

otherwise

i M

where w

is the weight vector for the class C

Three cases arise in solving this problem

Case Each class is separable from the rest by a single decision surface

if d

x then x C

Case Each class is separable from every other individual class by a distinct

decision surface that is the classes are pairwise separable There are

MM decision surfaces given by d

x w

if d

x j i then x C

Note d

x d

Case There exist M decision functions d

x w

x k M

with the property that

if d

x d

x j i then x C

This is a special instance of Case We may dene

x d

x w

If the classes are separable under Case they are separable under Case

the converse is in general not true

Patterns that may be separated by linear decision functions as above are

said to be linearly separable In other situations an innite variety of complex

decision boundaries may be formulated by using generalized decision functions

based upon nonlinear functions of the feature vectors as

dx w

x " w

x " " w

x " w

Here ff

xg i K are real singlevalued functions of x f

Whereas the functions f

x may be nonlinear in the ndimensional space

of x the decision function may be formulated as a linear function by den

ing a transformed feature vector x

x f

Then

dx w

with w w

Once evaluated ff

xg is

just a set of numerical values and x

is simply a Kdimensional vector aug

mented by an entry equal to unity Several methods exist for the derivation

of optimal linear discriminant functions

Example of application The ROIs of breast masses are shown in

Figure arranged in the order of decreasing acutance A see Sections

and Figure shows the contours of the masses arranged in

the increasing order of fractional concavity f

see Section Most of the

contours of the benign masses are seen to be smooth whereas most of the con

tours of the malignant tumors are rough and spiculated Furthermore most of

the benign masses have welldened sharp edges and are wellcircumscribed

whereas the majority of the malignant tumors possess illdened and fuzzy

borders It is seen that the shape factor f

facilitates the ordering of the

contours in terms of shape complexity However the contours of a few be

nign masses and a few malignant tumors do not follow the expected trend

In addition the acutance measure has lower values for most of the malignant

tumors than for a majority of the benign masses

The three shape factors cf f

and SI see Chapter the texture

measures as dened by Haralick see Section and four measures

of edge sharpness as dened by Mudigonda et al see Section were

computed for the ROIs and their contours Note The factor SI was divided

by two in this example to reduce it to the range Figure gives a plot

of the D featurevector space f

A F

for the masses The feature F

shows poor separation between the benign and malignant samples whereas

the feature A demonstrates some degree of separation A scatter plot of the

three shape factors f

cf SI of the masses is given in Figure Each

of the three shape factors demonstrates high discriminant capability

Figure shows a D plot of the shapefactor vectors f

SI for a train

ing set formed by selecting the vectors for benign masses and malignant

tumors The prototypes for the benign and malignant classes obtained by av

eraging the vectors over all the members of the two classes in the training set

are marked as B and M respectively on the plot The solid straight line is

the perpendicular bisector of the line joining the two prototypes dashed line

and represents a linear discriminant function The equation of the straight

line is SI " f

The decision function is represented by

the following rule

if SI " f

then

benign mass

else

malignant tumor

end

It is seen that the rule given above will correctly classify all of the training

samples

Figure shows the result of application of the linear discriminant func

tion designed and shown in Figure to a test set of benign masses and

malignant tumors The test set does not include any of the cases from the

training set It is seen that the classier will lead to three false negatives in

the test set

Distance functions

Consider M pattern classes represented by their prototype patterns z

The prototype of a class is typically computed as the average of all

the feature vectors belonging to the class Figure illustrates schematically

the prototypes z

and z

of the two classes shown

FIGURE

ROIs of breast masses including benign masses and malignant tumors The

ROIs are arranged in the order of decreasing acutance A Note that the masses are

of widely diering size but have been scaled to the same size in the illustration For

details regarding the case identiers see Figure Reproduced with permission

from H Alto RM Rangayyan and JEL Desautels Contentbased retrieval and

analysis of mammographic masses Journal of Electronic Imaging in press

SPIE and IS T

FIGURE

Contours of breast masses including benign masses and malignant tumors

The contours are arranged in the order of increasing f

Note that the masses and

their contours are of widely diering size but have been scaled to the same size

in the illustration For details regarding the case identiers see Figure See

also Figure Reproduced with permission from H Alto RM Rangayyan and

JEL Desautels Contentbased retrieval and analysis of mammographic masses

Journal of Electronic Imaging in press

SPIE and IS T

FIGURE

Plot of the D featurevector space f

A F

for the set of masses in

Figure o benign masses malignant tumors Repro

duced with permission from H Alto RM Rangayyan and JEL Desautels

Contentbased retrieval and analysis of mammographic masses Journal of

Electronic Imaging in press

SPIE and IS T

FIGURE

Plot of the D featurevector space f

cf SI for the set of contours in

Figure o benign masses malignant tumors Figure

courtesy of H Alto

The Euclidean distance between an arbitrary pattern vector x and the i

prototype is given as

kx z

x z

A simple rule to classify the pattern vector x would be to choose that class

for which the vector has the smallest distance

if D

j i then x C

See Section for the description of an application of the Euclidean dis

tance to the analysis of breast masses and tumors

A simple relationship may be established between discriminant functions

and distance functions as follows

kx z

x z

x x

" z

Choosing the minimum of D

is equivalent to choosing the minimum of

because all D

Furthermore from the equation above it follows

FIGURE

Plot of the D featurevector space f

SI for the training set of be

nign masses o and malignant tumors x selected from the dataset in

Figure The prototypes of the two classes are indicated by the vectors

marked B and M The solid line shown is a linear decision function obtained

as the perpendicular bisector of the straight line joining the two prototypes

dashed line

FIGURE

Plot of the D featurevector space f

SI for the test set of benign masses

o and malignant tumors x selected from the dataset in Figure

The solid line shown is a linear decision function designed as illustrated in

Figure Three malignant cases are misclassied by the decision function

shown

that choosing the minimum of D

is equivalent to choosing the maximum of

Therefore we may dene the decision function

x x

i M

A decision rule may then be stated as

if d

x d

x j i then x C

This is a linear discriminant function which becomes obvious from the fol

lowing representation If z

j n are the components of z

let

j n w

and x x

Then d

x w

x i M where w

Therefore distance functions may be formulated as linear discriminant or

decision functions

The nearestneighbor rule

Suppose that we are provided with a set of N sample patterns fs

g of known classication each pattern belongs to one ofM classes fC

g with N M We are then given a new feature vector x whose

class needs to be determined Let us compute a distance measure Ds

between the vector x and each sample pattern Then the nearestneighbor

rule states that the vector x is to be assigned to the class of the sample that

is the closest to x

x C

if Ds

x minfDs

xg l N

A major disadvantage of the above method is that the classication decision

is made based upon a single sample vector of known classication The nearest

neighbor may happen to be an outlier that is not representative of its class

It would be more reliable to base the classication upon several samples we

may consider a certain number k of the nearest neighbors of the sample to

be classied and then seek a majority opinion This leads to the socalled

knearestneighbor or kNN rule Determine the k nearest neighbors of x and

use the majority of equal classications in this group as the classication of

x See Section for the description of an application of the kNN method

to the analysis of breast masses and tumors

Unsupervised Pattern Classication

Let us consider the situation where we are given a set of feature vectors with

no categorization or classes attached to them No prior training information

is available How may we group the vectors into multiple categories

The design of distance functions and decision boundaries requires a training

set of feature vectors of known classes The functions so designed may then

be applied to a new set of feature vectors or samples to perform pattern

classication Such a procedure is known as supervised pattern classication

due to the initial training step In some situations a training step may not

be possible and we may be required to classify a given set of feature vectors

into either a prespecied or unknown number of categories Such a problem is

labeled as unsupervised pattern classication and may be solved by cluster

seeking methods

Clusterseeking methods

Given a set of feature vectors we may examine them for the formation of

inherent groups or clusters This is a simple task in the case of D vectors

where we may plot them visually identify groups and label each group with

a pattern class Allowance may have to be made to assign the same class

to multiple disjoint groups Such an approach may be used even when the

number of classes is not known at the outset When the vectors have a

dimension higher than three visual analysis will not be feasible It then

becomes necessary to dene criteria to group the given vectors on the basis

of similarity dissimilarity or distance measures A few examples of such

measures are described below

Euclidean distance

kx zk

x z

Here x and z are two feature vectors the latter could be a class pro

totype if available A small value of D

indicates greater similarity

between the two vectors than a large value of D

Manhattan or cityblock distance

j x

The Manhattan distance is the shortest path between x and z with

each segment being parallel to a coordinate axis

Mahalanobis distance

where x is a feature vector being compared to a pattern class for which

m is the class mean vector and C is the covariance matrix A small

value of D

indicates a higher potential membership of the vector x in

the class than a large value ofD

See Section for the description

of an application of the Mahalanobis distance to the analysis of breast

masses and tumors

Normalized dot product cosine of the angle between the vectors x and

kxkkzk

A large dot product value indicates a greater degree of similarity between

the two vectors than a small value

The covariance matrix is dened as

C Ey mym

where the expectation operation is performed over all feature vectors y that

belong to the class The covariance matrix provides the covariance of all

possible pairs of the features in the feature vector over all samples belonging

to the given class being considered The elements along the main diagonal

of the covariance matrix provide the variance of the individual features that

make up the feature vector The covariance matrix represents the scatter of

the features that belong to the given class The mean and covariance need

to be updated as more samples are added to a given class in a clustering

procedure

When the Mahalanobis distance needs to be calculated between a sample

vector and a number of classes represented by their mean and covariance

matrices a pooled covariance matrix may be used if the numbers of members

in the various classes are unequal and low If the covariance matrices

of two classes are C

and C

and the numbers of members in the two classes

are N

and N

the pooled covariance matrix is given by

" N

Various performance indices may be designed to measure the success of a

clustering procedure A measure of the tightness of a cluster is the sum

of the squared errors performance index

kxm

where N

is the number of cluster domains S

is the set of samples in the j

cluster

is the sample mean vector of S

and N

is the number of samples in S

A few other examples of performance indices are

Average of the squared distances between the samples in a cluster do

main

Intracluster variance

Average of the squared distances between the samples in dierent cluster

domains

Intercluster distances

Scatter matrices

Covariance matrices

A simple clusterseeking algorithm Suppose we have N sample

patterns fx

Let the rst cluster center z

be equal to any one of the samples say

Choose a nonnegative threshold

Compute the distance D

between x

and z

If D

assign x

to the domain class of cluster center z

otherwise start a new cluster

with its center as z

For the subsequent steps let us assume that

a new cluster with center z

has been established

Compute the distances D

and D

from the next sample x

to z

and

respectively If D

and D

are both greater than start a new

cluster with its center as z

otherwise assign x

to the domain of

the closer cluster

Continue to apply Steps and by computing and checking the distance

from every new unclassied pattern vector to every established cluster

center and applying the assignment or clustercreation rule

Stop when every given pattern vector has been assigned to a cluster

Observe that the procedure does not require a priori knowledge of the

number of classes Recognize also that the procedure does not assign a real

world class to each cluster it merely groups the given vectors into disjoint

clusters A subsequent step is required to label each cluster with a class related

to the actual problem Multiple clusters may relate to the same realworld

class and may have to be merged

A major disadvantage of the simple clusterseeking algorithm is that the

results depend upon

the rst cluster center chosen for each domain or class

the order in which the sample patterns are considered

the value of the threshold and

the geometrical properties or distributions of the data that is the

featurevector space

The maximindistance clustering algorithm This method is

similar to the previous simple algorithm but rst identies the cluster re

gions that are the farthest apart The term maximin refers to the combined

use of maximum and minimum distances between the given vectors and the

centers of the clusters already formed

Let x

be the rst cluster center z

Determine the farthest sample from x

and label it as cluster center z

Compute the distance from each remaining sample to z

and to z

For

every pair of these computations save the minimum distance and select

the maximum of the minimum distances If this maximin distance is

an appreciable fraction of the distance between the cluster centers z

and

label the corresponding sample as a new cluster center z

otherwise

stop forming new clusters and go to Step

If a new cluster center was formed in Step repeat Step using a

typical or the average distance between the established cluster centers

for comparison

Assign each remaining sample to the domain of its nearest cluster center

The Kmeans algorithm The preceding simple and maximin

algorithms are intuitive procedures The Kmeans algorithm is based on

iterative minimization of a performance index that is dened as the sum of

the squared distances from all points in a cluster domain to the cluster center

ChooseK initial cluster centers z

K is the number

of clusters to be formed The choice of the cluster centers is arbitrary

and could be the rst K of the feature vectors available The index in

parentheses represents the iteration number

At the k

iterative step distribute the samples fxg among theK cluster

domains using the relation

x S

k if kxz

kk kxz

kk i K i j

where S

k denotes the set of samples whose cluster center is z

From the results of Step compute the new cluster centers z

k" j

K such that the sum of the squared distances from all points

in S

k to the new cluster center is minimized In other words the new

cluster center z

k " is computed so that the performance index

kx z

k " k

j K

is minimized The z

k " that minimizes this performance index is

simply the sample mean of S

k Therefore the new cluster center is

given by

k "

x j K

where N

k is the number of samples in S

k The name Kmeans

is derived from the manner in which cluster centers are sequentially

updated

If z

k " z

k for j K the algorithm has converged

terminate the procedure otherwise go to Step

The behavior of the Kmeans algorithm is inuenced by

the number of cluster centers specied K

the choice of the initial cluster centers

the order in which the sample patterns are considered and

the geometrical properties or distributions of the data that is the

featurevector space

Example Figures to show four cluster plots of the shape

factors f

and SI of the breast mass contours shown in Figure see

Section for details Although the categories of the samples would be

unknown in a practical situation the samples are identied in the plots with

the " symbol for malignant tumors and the symbol for the benign masses

The categorization represents the groundtruth or true classication of the

samples based upon biopsy

The plots in Figures to show the progression of the Kmeans

algorithm from its initial state to the converged state K in this example

representing the benign and malignant categories The only prior knowledge

or assumption used is that the samples are to be split into two clusters that is

there are two classes Figure shows two samples selected to represent the

cluster centers marked with the diamond and asterisk symbols The straight

line indicates the decision boundary which is the perpendicular bisector of

the straight line joining the two cluster centers The Kmeans algorithm

converged in this case at the fth iteration that is there was no change in the

cluster centers after the fth iteration The nal decision boundary results

in the misclassication of four of the malignant samples as being benign It

is interesting to note that even though the two initial cluster centers belong

to the benign category the algorithm has converged to a useful solution

See Section for examples of application of other pattern classication

techniques to the same dataset

Probabilistic Models and Statistical Decision

Pattern classication methods such as discriminant functions are dependent

upon the set of training samples provided Their success when applied to new

cases will depend upon the accuracy of the representation of the various pat

tern classes by the training samples How can we design pattern classication

techniques that are independent of specic training samples and are optimal

in a broad sense

Probability functions and probabilistic models may be developed to rep

resent the occurrence and statistical attributes of classes of patterns Such

functions may be based upon large collections of data historical records or

mathematical models of pattern generation In the absence of information as

above a training step with samples of known categorization will be required

to estimate the required model parameters It is common practice to assume a

Gaussian PDF to represent the distribution of the features for each class and

estimate the required mean and variance parameters from the training sets

When PDFs are available to characterize pattern classes and their features

optimal decision functions may be designed based upon statistical functions

and decision theory The following subsections describe a few methods in this