Stop Thinking, Just Do!

Sung-Soo Kim's Blog

A Data Science Profile


16 March 2014

Data Science Jobs

Columbia just decided to start an Institute for Data Sciences and Engineering with Bloomberg’s help. There are 465 job openings in New York City alone for data scientists last time we checked. That’s a lot. So even if data science isn’t a real field, it has real jobs.

And here’s one thing we noticed about most of the job descriptions: they ask data scientists to be experts in computer science, statistics, communication, data visualization, and to have extensive domain expertise. Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise—together, as a team, they can specialize in all those things. We’ll talk about this more after we look at the composite set of skills in demand for today’s data scientists.

A Data Science Profile

In the class, Rachel handed out index cards and asked everyone to profile themselves (on a relative rather than absolute scale) with respect to their skill levels in the following domains:

  • Computer science
  • Math
  • Statistics
  • Machine learning
  • Domain expertise
  • Communication and presentation skills
  • Data visualization

As an example, Figure 1-2 shows Rachel’s data science profile.

data science profile

We taped the index cards to the blackboard and got to see how everyone else thought of themselves. There was quite a bit of variation, which is cool—lots of people in the class were coming from social sciences, for example.

Where is your data science profile at the moment, and where would you like it to be in a few months, or years?

As we mentioned earlier, a data science team works best when different skills (profiles) are represented across different people, because nobody is good at everything. It makes us wonder if it might be more worthwhile to define a “data science team”—as shown in Figure 1-3— than to define a data scientist.

data science profile

Data Science with Hadoop

Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis.

In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects:

  • What is data science?
  • How can techniques like classification, regression, clustering and outlier detection help your organization?
  • What questions do you ask and which problems do you go after?
  • How do you instrument and prepare your organization for applied data science with Hadoop?
  • Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop.


[1] Rachel Schutt and Cathy O’Neil, Doing Data Science, O’Reilly Media, Inc., 2014.

comments powered by Disqus