What Is Data Science?

Over the past few years, there’s been a lot of hype in the media about “data science” and “Big Data.” A reasonable first reaction to all of this might be some combination of skepticism and confusion.

Big Data and Data Science Hype

Let’s get this out of the way right off the bat, because many of you are likely skeptical of data science already for many of the reasons we were. We want to address this up front to let you know: we’re right there with you. If you’re a skeptic too, it probably means you have something useful to contribute to making data science into a more legitimate field that has the power to have a positive impact on society.

So, what is eyebrow-raising about Big Data and data science? Let’s count the ways:

There’s a lack of definitions around the most basic terminology. What is “Big Data” anyway? What does “data science” mean? What is the relationship between Big Data and data science? Is data science the science of Big Data? Is data science only the stuff going on in companies like Google and Facebook and tech companies? Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to data science as only taking place in tech? Just how big is big? Or is it just a relative term? These terms are so ambiguous, they’re well-nigh meaningless.
There’s a distinct lack of respect for the researchers in academia and industry labs who have been working on this kind of stuff for years, and whose work is based on decades (in some cases, centuries) of work by statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the media describes it, machine learning algorithms were just invented last week and data was never “big” until Google came along. This is simply not the case. Many of the methods and techniques we’re using—and the challenges we’re facing now—are part of the evolution of everything that’s come before. This doesn’t mean that there’s not new and exciting stuff going on, but we think it’s important to show some basic respect for everything that came before.
The hype is crazy—people throw around tired phrases straight out of the height of the pre-financial crisis era like “Masters of the Universe” to describe data scientists, and that doesn’t bode well. In general, hype masks reality and increases the noise-to-signal ratio. The longer the hype goes on, the more many of us will get turned off by it, and the harder it will be to see what’s good underneath it all, if anything.
Statisticians already feel that they are studying and working on the “Science of Data.” That’s their bread and butter. Maybe you, dear reader, are not a statisitican and don’t care, but imagine that for the statistician, this feels a little bit like how identity theft might feel for you. Although we will make the case that data science is not just a rebranding of statistics or machine learning but rather a field unto itself, the media often describes data science in a way that makes it sound like as if it’s simply statistics or machine learning in the context of the tech industry.
People have said to us, “Anything that has to call itself a science isn’t.” Although there might be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of course what it represents may not be science but more of a craft.

Getting Past the Hype

Rachel’s experience going from getting a PhD in statistics to working at Google is a great example to illustrate why we thought, in spite of the aforementioned reasons to be dubious, there might be some meat in the data science sandwich. In her words:

It was clear to me pretty quickly that the stuff I was working on at Google was different than anything I had learned at school when I got my PhD in statistics. This is not to say that my degree was useless; far from it—what I’d learned in school provided a framework and way of thinking that I relied on daily, and much of the actual content provided a solid theoretical and practical foundation necessary to do my work.
But there were also many skills I had to acquire on the job at Google that I hadn’t learned in school. Of course, my experience is specific to me in the sense that I had a statistics background and picked up more computation, coding, and visualization skills, as well as domain expertise while at Google. Another person coming in as a computer scientist or a social scientist or a physicist would have different gaps and would fill them in accordingly. But what is important here is that, as individuals, we each had different strengths and gaps, yet we were able to solve problems by putting ourselves together into a data team well-suited to solve the data problems that came our way.

Here’s a reasonable response you might have to this story. It’s a general truism that, whenever you go from school to a real job, you realize there’s a gap between what you learned in school and what you do on the job. In other words, you were simply facing the difference between academic statistics and industry statistics.

We have a couple replies to this:

Sure, there’s is a difference between industry and academia. But does it really have to be that way? Why do many courses in school have to be so intrinsically out of touch with reality?
Even so, the gap doesn’t represent simply a difference between industry statistics and academic statistics. The general experience of data scientists is that, at their job, they have access to a larger body of knowledge and methodology, as well as a process, which we now define as the data science process, that has foundations in both statistics and computer science.

Around all the hype, in other words, there is a ring of truth: this is something new. But at the same time, it’s a fragile, nascent idea at real risk of being rejected prematurely. For one thing, it’s being paraded around as a magic bullet, raising unrealistic expectations that will surely be disappointed.

References

[1] Rachel Schutt and Cathy O’Neil, Doing Data Science, O’Reilly Media, Inc., 2014.

Stop Thinking, Just Do!