University of California
Main Page
SHERLOCK @ UCI:  Entity Resolution and Data Quality Project at UC Irvine.

Useful Data Cleaning Data Sets and Entity Resolution Data Sets

  • arXive hep-th: KDD Cup 2003 publication dataset: hep-th portion of arXive. Fully labeled, 29.5K unique papers, 13K unique authors
  • CiteSeer: a collection of research publications
  • Cora: a citation dataset from RIDDLE data repository
  • Cora: a citation dataset from Andrew McCallum's data repository
  • DBLP: a collection of bibliographic entries
  • DMOZ ontology: a large downloadable ontology
  • Enron Email Dataset: a dataset of Enron emails
  • FEBRL Database: Freely Extensible Biomedical Record Linkage
  • Freedb CD Dataset: Info on various CDs
  • IMDb: a collection of movie-related entries
  • Leipzig DB Group Datasets: publication data and product table data
  • PubMed/MEDLINE: over 20 Million bibliographic entries for biomedical literature see PubMed online for detail. Need to license it from the NIH. Texts of some publications are here: PMC.
  • RIDDLE Repository: various data cleaning-related datasets
  • SPOKE Challenge: a collection of labeled webpages for SPOKE Challenge
  • Stanford Movie Dataset: a collection of movie-related entries
  • UC Irvine Machine Learning Repository: a collection of various ML datasets
  • UIS Database Generator: generates synthetic names/addresses by adding errors into records
  • U.S. Census Names: frequently occurring first names and surnames from the 1990 Census
  • Web Disambiguation: a collection of labeled webpages used by McCallum et al. in WWW'05
  • WEPS Corpus: a collection of labeled webpages used by Artiles et al. in SIGIR'05
  • Wikilinks: Google's dataset of 11 Million webpages for cross-document entity resolution: info.
  • Wiktionary: a downloadable free-content multilingual dictionary

  • © 2013 SHERLOCK @ UCI. All Rights Reserved.