Project SHERLOCK @ UCI: Entity Resolution, Data Cleaning, and Data Quality Project at UC Irvine.

Main Page
Projects
News
People
Publications
Datasets
Software
Internal

SHERLOCK @ UCI: Entity Resolution and Data Quality Project at UC Irvine.

Overview

Welcome to the homepage of Sherlock@UCI: a UC Irvine project on Entity Resolution and Data Quality!

The significance of data quality research is motivated by the observation that the effectiveness of data-driven technologies such as decision support tools, data exploration, analysis, and scientific discovery tools is closely tied to the quality of data on which such techniques are applied. It is well recognized that the outcome of the analysis is only as good as the data on which the analysis is performed. That is why today organizations spend a tangible percent of their budgets on cleaning tasks such as removing duplicates, correcting errors, filling missing values, to improve data quality prior to pushing data through the analysis pipeline.

Given the critical importance of the problem, many efforts, in both industry and academia, have explored systematic approaches to addressing the cleaning challenges. The work of our group focuses primarily on the entity resolution challenge that arises because objects in the real world are referred to using references or descriptions that are not always unique identifiers of the objects, leading to ambiguity.

The traditional approach for entity resolution uses features associated with a reference (or a record) to find references that co-refer. In our project we are exploring which other sources and types of information could be used, in addition to features, to better disambiguate among references. This information could be present in that data being cleaned itself or can be obtained from external data sources, including ontologies, encyclopedias, and the Web. We are also looking into ways to guide and fine-tune the data cleaning process based on the type of analysis that will be done on the data being cleaned for it to reach higher disambiagution quality as well as efficiency.

Past Work

As part of our project, in the past we have pioneered a novel entity resolution methodology which we refer to as Relationship-Based Data Cleaning (RelDC). RelDC relies upon the observation that many real-world datasets are relational in nature and contain not only information about entities but also relationships among them, knowledge of which can be used to disambiguate among representations more effectively. RelDC is a principled, domain-independent framework that exploits the entity-relationship graph of the dataset, and specifically relationships, for high-quality entity resolution that is self-tuning and requires minimal intervention by analysts.

Keywords

Entity Resolution, Data Cleaning, Data Cleansing, Information Quality, Data Quality.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 1118114. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.