Useful Data Cleaning Data Sets and Entity Resolution Data Sets
arXive hep-th: KDD Cup 2003 publication dataset: hep-th portion of arXive. Fully labeled, 29.5K unique papers, 13K unique authors
CiteSeer: a collection of research publications
Cora: a citation dataset from RIDDLE data repository
Cora: a citation dataset from Andrew McCallum's data repository
DBLP: a collection of bibliographic entries
DMOZ ontology: a large downloadable ontology
Enron Email Dataset: a dataset of Enron emails
FEBRL Database: Freely Extensible Biomedical Record Linkage
Freedb CD Dataset: Info on various CDs
IMDb: a collection of movie-related entries
Leipzig DB Group Datasets: publication data and product table data
PubMed/MEDLINE: over 20 Million bibliographic entries for biomedical literature see PubMed online for detail. Need to license it from the NIH. Texts of some publications are here: PMC.
RIDDLE Repository: various data cleaning-related datasets
SPOKE Challenge: a collection of labeled webpages for SPOKE Challenge
Stanford Movie Dataset: a collection of movie-related entries
UC Irvine Machine Learning Repository: a collection of various ML datasets
UIS Database Generator: generates synthetic names/addresses by adding errors into records
U.S. Census Names: frequently occurring first names
and surnames from the 1990 Census
Web Disambiguation: a collection of labeled webpages used by McCallum et al. in WWW'05
WEPS Corpus: a collection of labeled webpages used by Artiles et al. in SIGIR'05
Wikilinks: Google's dataset of 11 Million webpages for cross-document entity resolution: info.
Wiktionary: a downloadable free-content multilingual dictionary
© 2013 SHERLOCK @ UCI. All Rights Reserved.