ABSTRACT
Chapter 6 surveys the modern digital infrastructure that underpins data-intensive research. It begins by examining data repositories and archives – institutional or discipline-specific platforms that preserve datasets and make them discoverable. Such repositories enable long-term access and reuse of research data. The chapter then distinguishes between data warehouses and data lakes. This distinction is crucial as social scientists increasingly encounter unstructured data that do not fit neatly into relational tables. Next, the role of application programming interfaces (APIs) is highlighted: researchers can programmatically retrieve online data – from social media feeds to economic indicators – via web services. The chapter provides examples of using APIs to access real-time data and discusses handling issues like rate limits and data formats. Finally, it addresses the advent of cloud computing and high-performance computing (HPC) in academia. As datasets grow to terabyte scale, analyses that once ran on a personal computer may require distributed computing frameworks (like Hadoop/Spark) or cloud/HPC resources. The chapter’s abstract emphasizes that engaging with such digital platforms – from secure data repositories to scalable compute environments – has become integral to the research process, ensuring that even massive and complex data can be managed and analyzed effectively in the pursuit of social insights.
