ABSTRACT

This chapter presents a single-instruction multiple-thread (SIMT)-based approach to deliver near real-time record linkage solution in the context of document-oriented databases. Not only structured query language (NoSQL) databases are scheme free, easily replicable, simple to use, and able to handle a huge amount of data. NoSQL databases are especially useful when businesses need to access and analyze massive volumes of data, which is unstructured or semi structured in nature. The chapter also presents NoSQL databases and definitions on document frequency and distance metrics. It provides a background on hash table and general-purpose graphics processing unit (GPGPU), and tree data model. GPGPUs are advancing the landscape of high-performance computing operating in an single-instruction multiple-thread mode. The chapter describes the process of record linkage in semi structured data sets on a GPGPU or GPGPU-like parallel hardware. O. Hassanzadeh et al. propose SMaSh framework, which discovers linkage point in a scalable, online manner.