Agentic AI for Data Discovery in the Social Sciences and Humanities

doi:10.4324/9781003666530-3

ABSTRACT

Recent advances in large language models have enabled the development of multi-agent systems capable of handling increasingly sophisticated and domain-specific tasks. This chapter explores CORDIAL-AI, a retrieval-augmented, API-driven project that provides natural language access to highly complex and granular UK census flow data. Layered agents parse user intent, retrieve relevant metadata, and generate executable queries that navigate extensive code lists, hierarchical and nested geographies, and multi-dimensional variables. Particular attention is given to explainability, provenance, and metadata integration. Fine-tuning via low-rank adaptation demonstrates that small open-source LLMs can approach the data retrieval performance of proprietary models when provided with sufficient synthetic and expert-annotated training corpora. Preliminary empirical benchmarks show improvements in variable selection, geographic recognition, and API-call accuracy. These findings are situated within broader methodological discussions on digital research infrastructures, emphasising the importance of critical system design, rigorous data curation, and user-centred evaluation in advancing responsible practices in data-intensive research. Systems like CORDIAL-AI make it easier to work with complex datasets, lowering technical barriers and opening new possibilities for research across the social sciences and humanities. In fields where context, interpretation, and transparency matter, such tools can help researchers ask more nuanced questions and more clearly trace how findings are produced.