- Title: 5 technologies that will help big data cross the chasm
- Authors: Derrick Harris
Big data has been a buzzword for years, but it’s a lot more than just buzz. There are now so many tools and technologies for creating, collecting and analyzing data that almost anything is possible if you know where to look.
photo: bestdesigns / Thinkstock
We’re on the cusp of a real turning point for big data. Its applications are becoming clearer, its tools are getting easier and its architectures are maturing in a hurry. It’s no longer just about log files, clickstreams and tweets. It’s not just about Hadoop and what’s possible (or not) with MapReduce.
With each passing day, big data is becoming more about creativity — if someone can think of an application, they can probably build it. That makes the concept of big data a lot more tangible and a lot more useful to a lot more companies, and it makes the market for big data a lot more lucrative.
Here are five technologies helping spur a shift in thinking from “Why would I want to use some technology that Yahoo built? And how?” to “We have problem that needs solving. Let’s find the right tool to solve it.”
When it comes to open source big data projects, they don’t get much hotter than Apache Spark. The data-processing framework is garnering a lot of users and a lot of supporters — including from Hadoop vendors MapR and Cloudera — because it promises to be almost everything for Hadoop deployments (arguably the foundation of most enterprise big data environments) that MapReduce wasn’t. It’s fast, it’s easy to program and it’s flexible.
Right now, Spark is getting a lot of attention as an engine for machine-learning workloads — for example, Cloudera Oryx and even Apache Mahout are porting their code bases to Spark — as well as for interactive queries and data analysis. As the project’s community grows, the list of target workloads should expand, as well.
Spark’s popularity is aided by the YARN resource manager for Hadoop and the Apache Mesos cluster-management software, both of which make it possible to run Spark, MapReduce and other processing engines on the same cluster using the same Hadoop storage layer. I wrote in 2012 about the move away from MapReduce as one of five big trends helping us rethink big data, and Spark has stepped up as the biggest part of that migration.
This might seem obvious — we’ve been talking about the convergence of cloud computing and big data for years — but cloud computing offerings have advanced significantly in the just the past year. There are bigger, faster and ever-cheaper raw compute options, many offering high memory capacity, solid-state drives or even GPUs. All of this makes it much easier, and much more economically feasible, to run myriad types of data-processing workloads in the cloud.
The market for managed Hadoop and database services continues to grow, as well as the market for analytics services. They’re quickly adding new capabilities and, as the technologies underpinning them advance, they’re becoming faster and more scalable.
Amazon CTO Werner Vogels announcing Kinesis in November.
Cloud providers are also targeting emerging use cases, such as stream processing, the internet of things and artificial intelligence. Amazon Web Services offers a service called Kinesis for processing data as it crosses the wire. Microsoft is previewing a service designed specifically to capture and store datastreaming off of sensors. A handful of vendors, including IBM, Expect Labs and AlchemyAPI are providing various flavors of artificial intelligence via API, meaning developers can build intelligent applications without first mastering machine learning.
We’ll talk a lot more about the future of cloud computing at out Structure conference June 18 and 19 in San Francisco. Speakers include Amazon CTO Werner Vogels, Google SVP and Technical Fellow Urs Hölzle, and Microsoft EVP Scott Guthrie. Also, Airbnb VP Mike Curtis will discuss how that company runs big data workloads in the cloud, and New York Times Chief Data Scientist Chris Wiggins will talk about the newspaper’s work in machine learning.
A lot of talk about sensors focuses on the volume and speed at which they generate data, but what’s often ignored is the strategic decisions that go into choosing the right sensors to gather the right data. If there’s are real-world measurements that need to be taken, or events that need to be logged, there’s probably a fairly inexpensive sensor available to do the job. Sensors are integral to smarter cars, of course, but also to everything from agriculture to hospital sanitation.
And if there’s not a usable sensor commercially available, it’s not inconceivable to build one from scratch. A team of university researchers, for example, built a cheap sensor to measures the wing speed of insects using a cheap laser pointer and digital recorder. It helped them capture more, better data than previous researchers, resulting in a significantly more-accurate model for classifying bugs.
The setup used to measure the insects’ data.
That type of creativity highlights what’s possible thanks to the convergence of sensors, consumer electronics, big data, and, presumably, the maker movement and 3-D printing. If more, different and better data will lead to better analysis, it’s easier than ever to collect it yourselfrather than wait for someone else to do it.
Thanks to the proliferation of data in the form of photos, videos, speech and text, there’s now an incredible amount of effort going into building algorithms and systems that can help computers understand those inputs. From a big data perspective, the interesting thing about these approaches — whether they’re called deep learning, cognitive computing or some other flavor of artificial intelligence — is that they’re not yet really about analytics in the same way so many other big data projects are.
AI researchers aren’t so concerned — yet — with uncovering trends or finding the needle in the haystack as they are with automating tasks that humans can already do. The big difference, of course, is that, done right, the systems can perform tasks such as object or facial recognition, or text analysis, much faster and at a much greater scale than humans can. As they get more accurate and require less training, these systems could power everything from intelligent ad platforms to much smarter self-driving cars.
Remarkably, the techniques for doing all this stuff are being democratized at rapid clip and will soon be accessible to a lot more people via software, open source libraries and even APIs. Google and Facebook are spending hundreds of millions of dollars advancing the state of the art in AI, but anyone brave enough to give it a whirl can get their hands on similar capabilities for very little money, if not free.
Commercial quantum computing is still a way off, but we can already see what might be possible when it arrives. According to D-Wave Systems, the company that has sold prototype versions of its quantum computer to Google, NASA and Lockheed Martin, it’s particularly good at advanced machine learning tasks and difficult optimization problems. Google is testing out computer vision algorithms that could eventually run on smartphones; Lockheed is trying to improve software validation for flight systems.
It’s powerful stuff that could help companies of all stripes solve some difficult computing and analytic tasks that today’s most-advanced systems and techniques can’t. Or, at least, quantum computing should be able to solve those problems faster and more efficiently.
Before that can happen, though, mainstream businesses will need access to quantum resources and some knowledge in how to use them. D-Wave is vowing to make the resources available via the cloud, and is working on compilers to simplify the programming aspect. There’s a lot of ground to cover before that happens, but the technology is moving fast and quantum computer instances delivered via the Amazon Web Services or Google clouds isn’t out of the realm of possibility.