Scales of Big Data

Big Data

It’s worth a digression into a set of questions we asked about “big data.” In our view, one of the reasons why “data science” and other buzzwords have come about recently is the advent of new technologies and techniques for inexpensively working with very large data sets. However, we also view big data as somewhat tangential to the value of data scientists to organizations. Our survey data [1] supports this; most data scientists rarely work with terabyte or larger data. Figure 3-4 shows how often respondents worked with data of kilobyte, megabyte, gigabyte, terabyte, and petabyte scale, broken down by Skills Group (which is clearer, here, than Self-ID Group):

Respondents whose top Skills Group was ML/Big Data were most likely to work with larger data sets, with over half at least occasionally working on TB-scale problems, compared with about a quarter for other respondents. Even in the ML/Big Data Skills group, however, the vast majority rarely or never worked with PB-scale data. True big data work seems limited to a relatively small subset of data scientists.

References

[1] Harlan D. Harris, Sean Patrick Murphy, and Marck Vaisman, Analyzing the Analyzers, O’Reilly Media, Inc., 2013.

Stop Thinking, Just Do!