Infrastructures and Bounds for Distributed Entity Resolution

    Entity resolution (ER), deduplication or record linkage is a computationally hard problem with distributed implementations typically relying on shared memory architectures. We show simple reductions to communication complexity is the key and data streaming lower bounds to illustrate the difficulties with a distributed implementation: If the data records are split among servers, then basically all data must be transferred.

    As a key result, we demonstrate that ER can be solved using algorithms with three different distributed computing paradigms:
    • Distributed key-value stores;
    • Map-Reduce;
    • Bulk Synchronous Parallel.

    We measure our algorithms in the real-world scenario of an insurance customer master data integration procedure. We show how the algorithms can be modified for non-Boolean fuzzy merge functions and similarity indexes.

    Csaba István Sidló, András Garzó, András Molnár and András A. Benczúr.
    In Proceedings of the 9th International Workshop on Quality in Databases In conjunction with VLDB 2011 (QDB 2011).