Flexible and Efficient Distributed Resolution of Large Entities

    Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records.
    In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.

    Molnár András J., Sidló Csaba I., Benczúr András A.
    T. Lukasiewicz and A. Sali (Eds.): FoIKS 2012, LNCS 7153