ABSTRACT

CONTENTS 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Is Data Mining Science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Genesis of Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 The Data Cube and Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Structured Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Statistical Problems with Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.7 Some DM Approaches to Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.8 Prior Distributions in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.9 Some New DM Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

This paper presents an overview of Bayesian and frequentist issues that arise in multivariate statistical modeling involving data mining. We discuss data cubes, structured query language computer commands, and the acquisition of data that violate the usual i.i.d. modeling assumptions. We address problems of multivariate exploratory data analysis, the analysis of non-experimental multivariate data, general statistical problems in data mining, high dimensional issues, graphical models, dimension reduction through conditioning, prediction, Bayesian data mining assuming variable independence, hidden Markov models, and data mining priors. There is also a discussion of some new applications of data mining in the field of Home Security.

1.1 Introduction This paper presents an overview of statistical and related issues associated with data mining (DM), and some newly recognized applications. We will be particularly concerned with multivariate data. We begin by defining what is meant by DM, and discuss some current and potential applications of the methodology. We also discuss

2 Statistical Data Mining and Knowledge Discovery

the gaps that exist between the statistical and other tools that have been developed for mining data, and the problems of implementing the results in applications.