A complete solution for duplication detection over uncertain data
Peng Pan, Xiaojun Cai
School of Computer Science and Technology, Shandong University, Jinan, P R China
As the problem of uncertainty for duplication is increasingly prominent with the sharp growth of amount and scale for data sources, we need to pay more attention on it. However, the research on uncertainty about duplicated data is still on its start. In this paper, we propose a complete method for duplication detection with probability, which is efficient and suitable for large-scale dataset. Considering the large-scale background, firstly, we adopt the rapid cluster algorithm based on canopies to get blocks. Secondly, in order to generate the record sets, which represent entity, we provide one fuzzy cluster method over each block by assigning two thresholds. By doing these, we balance the complexity and accuracy. Finally, we assign the probability for each record in one block. The experiments show advantages over other present algorithms for performances.