ABSTRACT

Data quality is one of the primary tasks of data management and it involves the data conflict detection, that means find tuple of Certain Rules in database D which violates CFDs (Conditional Functional Dependencies). For instance, a given relational schema R (id, CC, AC, city, area, zip, Tel, title, salary). Where in, id (employee identifier, primary key), contact information (CC, AC, Tel) and home address (city, area, zip). Table 1 and Figure 1 given a relationship example D0 and definition of CFDs in R respectively. CFD1 declared functional dependencies: when

CC = 86, area depends on zip, the tuple set {ti∈D0|i∈[1,4]} in the relationship instance D0 meet CFD1; CFD3 is a traditional functional dependencies, explains that employees of the same country, title determine salary; CFD4 declares any CN (CC = 86 represented countries) employee, if the area code is 05, the city must is WH. The example requires seeking tuple in D0 which violates of CFD1-CFD4. Assuming ti represent id = i corresponding tuple in D0, conflict set is composed of t2-t6. The D0 meet CFD3, but t2-t4 violate of CFD1: Because, t2-t4 three triples zip values are equal (i.e., zip = 410), but the area have different value JH and JJ; Similarly t5-t6 violates of CFD2; the t4 violate of CFD4: CC = 86, AC = 05, but, city≠WH. When D0 is a centralized database, SQL-based technology to find the conflict is very effective [1,2]. However, in a distributed system, a relationship often be splited and assigned on different nodes, when detect the conditions functional dependency conflict set CFDs of the entire distributed database, often need to move data from one node to another node. How to reduce the number of mobile data and reduce the cost of network transmission, faster system responsiveness has a very important practical value. Previously, research on distributed database data conflict detection technology: Local effectiveness studies (i.e. no data the mobile collision detection), the inconsistent due to update by using trigger processing[3]. The CFDs conflict found[4]; using the CFDs correct data[5]; CFD spread through the view, and to minimize the communication cost as the goal, detecting constraint conflict[7] in the distributed database system etc. But all of these can’t directly apply to the distributed database CFD conflict detection.