ABSTRACT

Machine learning is expected to handle structured and unstructured data better. Structured data often come from surveys, epidemiological studies, or experiments such as clinical trials. This chapter discusses randomized clinical trials, as the gold standard for experimentation. A combination of subject matter expertise, strong statistical ability, and software engineering acumen is what people commonly expect a data scientist’s skillset should comprise. Classical statistics (frequentist) focuses on the hypothesis testing with type-I error rate control. The factors included in a model must be statistically significant. Therefore, their model predictions are also constructed on the basis of statistical significance. Modern statistical learning and model selection shift the focus from error rate control to the impact of error (decision problem) or the prediction. The chapter discusses important concepts in data science, including internal and external validity, different bias types, confounding, regression to the mean, multiplicity, and different data sources and structures.