|
|
Department of Computer Science
University of California, Irvine
Welcome
Welcome to the homepage of SHERLOCK, a UCI project on Entity Resolution and Data Quality!
Overview
The effectiveness of data-driven technologies as decision support tools,
data exploration and scientific discovery tools is closely tied to the
quality of data on which such techniques are applied. It is well recognized
that the outcome of the analysis is only as good as the data on which
the analysis is performed. That is why today organizations spend a tangible
percent of their budgets on cleaning tasks such as removing duplicates,
correcting errors, filling missing values, to improve data quality prior
to pushing data through the analysis pipeline.
Given the critical importance of the problem, many efforts, in both industry
and academia, have explored systematic approaches to addressing the cleaning
challenges. Our work focuses primarily on the entity resolution challenge
that arises because objects in the real world are referred to using references
or descriptions that are not always unique identifiers of the objects,
leading to ambiguity.
The traditional approach for entity resolution uses features associated with
a reference (or a record) to find references that co-refer. In our project we are exploring
which other sources and types of information could be used, in addition to features, to better
disambiguate among references. This information could be present in that data being cleaned itself
or can be obtained from external data sources, including ontologies, encyclopedias, and the Web.
We are also looking into ways to guide and fine-tune the data cleaning process based
on the type of analysis that will be done on the data being cleaned for it to reach higher
disambiagution quality as well as efficiency.
As part of our project, we have pioneered a novel entity resolution methodology which we refer to as
Relationship-Based Data Cleaning (RelDC). RelDC relies upon the observation that many real-world
datasets are relational in nature
and contain not only information about entities but also relationships among them,
knowledge of which can be used to disambiguate among representations more effectively.
RelDC is a principled, domain-independent framework that
exploits the entity-relationship graph of the dataset, and specifically
relationships, for high-quality entity resolution that is self-tuning and requires minimal
intervention by analysts.
Keywords
Entity Resolution, Data Cleaning, Data Cleansing, Information Quality, Data Quality.
© 2012 SHERLOCK @ UCI. All Rights Reserved.
|
|