Project SHERLOCK @ UCI: Data Cleaning Datasets, Entity Resolution Datasets.

Main Page
Projects
News
People
Publications
Datasets
Software
Internal

SHERLOCK @ UCI: Entity Resolution and Data Quality Project at UC Irvine.

Useful Data Cleaning Data Sets and Entity Resolution Data Sets

arXive hep-th: KDD Cup 2003 publication dataset: hep-th portion of arXive. Fully labeled, 29.5K unique papers, 13K unique authors

CiteSeer: a collection of research publications

Cora: a citation dataset from RIDDLE data repository

Cora: a citation dataset from Andrew McCallum's data repository

DBLP: a collection of bibliographic entries

DMOZ ontology: a large downloadable ontology

Enron Email Dataset: a dataset of Enron emails

FEBRL Database: Freely Extensible Biomedical Record Linkage

Freedb CD Dataset: Info on various CDs

IMDb: a collection of movie-related entries

Leipzig DB Group Datasets: publication data and product table data

PubMed/MEDLINE: over 20 Million bibliographic entries for biomedical literature see PubMed online for detail. Need to license it from the NIH. Texts of some publications are here: PMC.

RIDDLE Repository: various data cleaning-related datasets

SPOKE Challenge: a collection of labeled webpages for SPOKE Challenge

Stanford Movie Dataset: a collection of movie-related entries

UC Irvine Machine Learning Repository: a collection of various ML datasets

UIS Database Generator: generates synthetic names/addresses by adding errors into records

U.S. Census Names: frequently occurring first names and surnames from the 1990 Census

Web Disambiguation: a collection of labeled webpages used by McCallum et al. in WWW'05

WEPS Corpus: a collection of labeled webpages used by Artiles et al. in SIGIR'05

Wikilinks: Google's dataset of 11 Million webpages for cross-document entity resolution: info.

Wiktionary: a downloadable free-content multilingual dictionary