Data accounting

doi:10.1201/9780367854690-6

ABSTRACT

Can we infer sources of errors from the outputs of the complex data analytics software? Bidirectional programming promises that we can reverse the flow of software and translate corrections of output into corrections of either input or data analysis. This approach allows us to achieve the holy grail of automated techniques of debugging, risk reporting, and large scale distributed error tracking. Since the processing of risk reports and data analysis pipelines can be frequently expressed using sequence relational algebra operations, we propose replacing this traditional approach with a data summarisation algebra that helps determine the impact of errors. It works by defining data analysis of a necessarily complete summarisation of a dataset, possibly in multiple ways along multiple dimensions. We also present a description to better communicate how the complete summarisations of the input data facilitates easier debugging and more efficient development of analysis pipelines. This approach can also be described as a generalisation of axiomatic theories of accounting into data analytics, thus dubbed data accounting. We also propose formal properties that allow for transparent assertions about the impact of individual records on the aggregated data and ease debugging by allowing to find of minimal changes that change the behaviour of data analysis on a per-record basis.