chapter  16
44 Pages

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 16.4 Internet Applications and Novel Graph Algorithms . . . . . . . . . . . . . 445

16.4.1 Motivating Internet Application . . . . . . . . . . . . . . . . . . . . . . . . . 445 16.4.2 A Formalization of and Prior Approaches to This

Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 16.4.3 A Novel Approach to Characterizing Network Structure 452 16.4.4 Community-Identification Applications of This

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 16.4.5 Some General Thoughts on Statistical Issues and Graph

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 16.5 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

In recent years, motivated in large part by technological advances in both scientific and Internet domains, there has been a great deal of interest in what may be broadly termed large-scale data analysis. In this chapter, I will focus on what seems to me to be a remarkable and inevitable trend in this increasinglyimportant area. This trend has to do with the convergence of two very different perspectives or worldviews on what are appropriate or interesting or fruitful ways to view the data. At the risk of oversimplifying a large body of diverse work, I would like to draw a distinction between what I will loosely term the algorithmic perspective and the statistical perspective on large-scale data analysis problems. By the former, I mean roughly the approach that one trained in computer science might adopt. From this perspective, primary concerns include database issues, algorithmic questions such as models of data access, and the worst-case running time of algorithms for a given objective function. By the latter, I mean roughly the approach that one trained in statistics (or some application area where strong domain-specific assumptions about the data may be made) might adopt. From this perspective, primary concerns include questions such as how well the objective functions being considered conform to the phenomenon under study and whether one can make reliable predictions about the world from the data at hand.