What Is Data Science? - The Current Landscape
So, what is data science? Is it new, or is it just statistics or analytics rebranded? Is it real, or is it pure hype? And if it’s new and if it’s real, what does that mean?
This is an ongoing discussion, but one way to understand what’s going on in this industry is to look online and see what current discussions are taking place. This doesn’t necessarily tell us what data science is, but it at least tells us what other people think it is, or how they’re perceiving it. For example, on Quora there’s a discussion from 2010 about “What is Data Science?” and here’s Metamarket CEO Mike Driscoll’s answer:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics.
But data science is not merely hacking—because when hackers finish debugging their Bash one-liners and Pig scripts, few of them care about non-Euclidean distance metrics.
And data science is not merely statistics, because when statisticians finish theorizing the perfect model, few could read a tab-delimited file into R if their job depended on it.
Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled with a theoretical understanding of what’s possible.
Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010, shown in Figure 1-1.
He also mentions the sexy skills of data geeks from Nathan Yau’s 2009 post, “Rise of the Data Scientist”, which include:
- Statistics (traditional analysis you’re used to thinking about)
- Data munging (parsing, scraping, and formatting data)
- Visualization (graphs, tools, etc.)
But wait, is data science just a bag of tricks? Or is it the logical extension of other fields like statistics and machine learning?
For one argument, see Cosma Shalizi’s posts here and here, and Cathy’s posts here and here, which constitute an ongoing discussion of the difference between a statistician and a data scientist. Cosma basically argues that any statistics department worth its salt does all the stuff in the descriptions of data science that he sees, and therefore data science is just a rebranding and unwelcome takeover of statistics.
For a slightly different perspective, see ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’ word”, in which she defends statistics:
We need to tell people that Statisticians are the ones who make sense of the data deluge occurring in science, engineering, and medicine; that statistics provides methods for data analysis in all fields, from art history to zoology; that it is exciting to be a Statistician in the 21st century because of the many challenges brought about by the data explosion in all of these fields.
Though we get her point—the phrase “art history to zoology” is supposed to represent the concept of A to Z—she’s kind of shooting herself in the foot with these examples because they don’t correspond to the high-tech world where much of the data explosion is coming from. Much of the development of the field is happening in industry, not academia. That is, there are people with the job title data scientist in companies, but no professors of data science in academia. (Though this may be changing.)
Not long ago, DJ Patil described how he and Jeff Hammerbacher— then at LinkedIn and Facebook, respectively—coined the term “data scientist” in 2008. So that is when “data scientist” emerged as a job title. (Wikipedia finally gained an entry on data science in 2012.)
It makes sense to us that once the skill set required to thrive at Google —working with a team on problems that required a hybrid skill set of stats and computer science paired with personal characteristics including curiosity and persistence—spread to other Silicon Valley tech companies, it required a new job title. Once it became a pattern, it deserved a name. And once it got a name, everyone and their mother wanted to be one. It got even worse when Harvard Business Review declared data scientist to be the “Sexiest Job of the 21st Century”.
But we can go back even further. In 2001, William Cleveland wrote a position paper about data science called “Data Science: An action plan to expand the field of statistics.”
So data science existed before data scientists? Is this semantics, or does it make sense?
This all begs a few questions: can you define data science by what data scientists do? Who gets to define the field, anyway? There’s lots of buzz and hype—does the media get to define it, or should we rely on the practitioners, the self-appointed data scientists? Or is there some actual authority? Let’s leave these as open questions for now, though we will return to them throughout the book .
 Rachel Schutt and Cathy O’Neil, Doing Data Science, O’Reilly Media, Inc., 2014.