Stop Thinking, Just Do!

Sung-Soo Kim's Blog

Three Approaches to Scalable Data Curation


31 July 2015

Three Approaches to Scalable Data Curation

Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process. It involves:

  • Identifying data sources of interest (whether from inside or outside the enterprise)
  • Verifying the data (to ascertain its composition)
  • Cleaning the incoming data (for example, 99999 is not a legal zip code)
  • Transforming the data (for example, from European date format to US date format)
  • Integrating it with other data sources of interest (into a composite whole)
  • Deduplicating the resulting composite data set.

The more data you need to curate for analytics and other business purposes, the more costly and complex curation becomes – mostly because humans (domain experts, or data owners) aren’t scalable. As such, most enterprises are “tearing their hair out” as they try to cope with data curation at scale. We call this problem “Big Data Variety.”

This talk compares three approaches to Big Data Variety:

  • ETL (Extract-Transform-Load) tools
  • Data Science tools
  • Enterprise curation tools

Two case studies, one from an Information Services company and one from a Biopharmaceutical company, will showcase why the third approach to data curation at scale is the preferred option.

comments powered by Disqus