ABSTRACT

Problem Setting and Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Toward Distributed Learning in Constrained, Distributed Environments. . . . . . . . . . . . 4

Heterogeneous Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Nonstationary Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

In sensor networks one often desires a global view of the environment being recorded based on sensory data gathered simultaneously from geographically disparate sources. However, such situations also often impose constraints stemming from data ownership or computational/memory/power limitations that prevent all the data from being gathered at a central location before standard data mining tools can be applied. In this chapter we argue that one can adopt a probabilistic viewpoint to reconcile these conflicting goals and constraints, and outline a general framework based on this viewpoint that efficiently allows (semi-) supervised learning in sensor networks without being substantially affected by the domain constraints. The proposed approach has implications for design and analysis of future large-scale, distributed sensor networks.

Data mining and pattern recognition algorithms invariably operate on centralized data, usually in the form of a single flat file. But in a sensor network, data is acquired and possibly stored in geographically distributed locations. Centralization of such data

before analysis may not be desirable because of computational or bandwidth costs. In some cases, it may not even be possible due to a variety of real-life constraints including security, privacy, or proprietary nature of data/sensors and the accompanying ownership and legal issues. A fundamental issue to be addressed in such situations is how to do meaningful data mining on such distributed data while respecting the constraints on data sharing. Another closely related issue is how to quantify the loss in quality of the mined results because of the imposed restrictions. Note that restrictions will have at least one of these two flavors: (a) the amount of sharable data is restricted, for example, due to bandwidth or energy limitations; or (b) the nature of the shared information may be constrained, for example, actual values of certain attributes cannot be conveyed because of privacy restrictions.