Data Quality and Inference Errors

doi:10.1201/9780429324383-10

ABSTRACT

This chapter addresses inference and the errors associated with big data. It focuses on the accuracy of the data and the validity of the inference, other data quality dimensions, such as timeliness, comparability, coherence, and relevance. The framework parses the total error into bias and variance components that may be further subdivided into subcomponents that map the specific types of errors to unique components of the total mean squared error. The most common type of column error in survey data analysis is caused by inaccurate or erroneous labeling of the column data—an example of metadata error. The chapter also focuses on content errors and considers two types of error, variable errors and correlated errors, the latter a subcategory of systematic errors. The massiveness, high dimensionality, and accelerating pace of data, combined with the risks of variable and systematic data errors, require new robust approaches to data analysis.