Entity Resolution with Heavy Indexing

    Entity resolution (ER), or deduplication is a computationally hard problem with O(n2 ) time complexity. We reformulate ER as a search problem, and develop algorithms using efficient indices. Indices can enhance algorithm scalability, facilitate distributed processing, but require additional storage space. We study the performance and tradeoffs between index update and search in ER algorithms, and show that significant performance gain can be obtained by using indices. We also demonstrate the strength of our algorithms in the real-world scenario of an insurance customer master data creation procedure.

    Csaba István Sidló
    In Proceedings of the 2011 International Conference on Advances in Databases and Information Systems (ADBIS 2011), CEUR Workshop Proceedings