ABSTRACT

Abundant data and inexpensive computing have supercharged the traditional tools of computer science and statistics. "Big" Data promises to help transcend these shortcomings by checking our intuitions. Fundamentally, a linear regression is concerned with finding a best fit line. The author going to let you in on a secret: most of a data scientist's time is spent cleaning and joining data, sometimes called data wrangling or data munging. The convenience of these curves raises a somewhat obvious challenge to the model's fit. As we imagine increasingly curvy lines, what stops us from fitting a curve that goes right through the center of each of our data points? Judgment. Data scientists might express skepticism about such a fit, specifically that it is unlikely to generalize. It's worth noting that our race and sex data don't look like our other data in that they're not part of a number scale.