|
Overview
Welcome to the homepage of Sherlock@UCI: a UC Irvine project on Entity Resolution and Data Quality!
The significance of data quality research is motivated by the observation that
the effectiveness of data-driven technologies such as decision support tools,
data exploration, analysis, and scientific discovery tools is closely tied to the
quality of data on which such techniques are applied. It is well recognized
that the outcome of the analysis is only as good as the data on which
the analysis is performed. That is why today organizations spend a tangible
percent of their budgets on cleaning tasks such as removing duplicates,
correcting errors, filling missing values, to improve data quality prior
to pushing data through the analysis pipeline.
Given the critical importance of the problem, many efforts, in both industry
and academia, have explored systematic approaches to addressing the cleaning
challenges. The work of our group focuses primarily on the entity resolution challenge
that arises because objects in the real world are referred to using references
or descriptions that are not always unique identifiers of the objects,
leading to ambiguity.
The traditional approach for entity resolution uses features associated with
a reference (or a record) to find references that co-refer. In our project we are exploring
which other sources and types of information could be used, in addition to features, to better
disambiguate among references. This information could be present in that data being cleaned itself
or can be obtained from external data sources, including ontologies, encyclopedias, and the Web.
We are also looking into ways to guide and fine-tune the data cleaning process based
on the type of analysis that will be done on the data being cleaned for it to reach higher
disambiagution quality as well as efficiency.
Past Work
As part of our project, in the past we have pioneered a novel entity resolution methodology which we refer to as
Relationship-Based Data Cleaning (RelDC). RelDC relies upon the observation that many real-world
datasets are relational in nature
and contain not only information about entities but also relationships among them,
knowledge of which can be used to disambiguate among representations more effectively.
RelDC is a principled, domain-independent framework that
exploits the entity-relationship graph of the dataset, and specifically
relationships, for high-quality entity resolution that is self-tuning and requires minimal
intervention by analysts.
Keywords
Entity Resolution, Data Cleaning, Data Cleansing, Information Quality, Data Quality.
Acknowledgement
This material is based upon work supported by the National Science Foundation under Grant No. 1118114. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
© 2013 SHERLOCK @ UCI. All Rights Reserved.
|
|