ABSTRACT

As organizational researchers, many of us received methods training at a time when datasets were time consuming and expensive to obtain, and, as a result, petite. We coveted our carefully gleaned dozens of survey responses and celebrated our acquisition of hundreds of records from an HR database, but we rarely, if ever, had access to thousands or millions of data records. The evolution of information technology and the Internet, however, has generated the opposite problem: datasets so large and unruly that our normal methods of thinking about data analysis break down. To illustrate, the federal government has established a site called data.gov, where it publishes the raw data from thousands of agency-conducted studies. In one small category with relevance to organizational research, “Labor Force, Employment, and Earnings,” there are 32 data sources, where a typical data source contains 150,000 records obtained from a careful, representative sampling process of U.S. businesses. We might choose to ignore these data sources, because the variables are questionable or because there are no multi-item scales or because we feel that the dataset won’t match to our theory very well, or because it is just too big a job to tackle. Even without a good match to theory or exactly the right variables, however, there might be something important to learn from this or some other large dataset.