Article Source

Authors: Steven J Vaughan-Nichols

Drilling into Big Data with Apache Drill

Apache’s Drill goal is striving to do nothing less than answer queries from petabytes of data and trillions of records in less than a second.

You can’t claim that the Apache Drill programmers think small. Their design goal is for Drill to scale to 10,000 servers or more and to process petabyes of data and trillions of records in less than a second.

If this sounds impossible, or at least very improbable, consider that the NSA already seems to be doing exactly the same kind of thing. If they can do it, open-source software can do it.

In at interview at OSCon, the major open source convention in Portland, OR, Ted Dunning, the chief application architect for MapR, a big data company, and a Drill mentor and committer, explained the reason for the project. “There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data in such formats as Avro; Apache Hadoop data serialization system; JSON (JavaScript Object Notation); and Protocol Buffers Google’s data interchange format.”

As Dunning explained, big business wants fast access to big data and none of the traditional solutions, such as a relational database management system (RDBMS), MapReduce, or Hive, can deliver those speeds.

Dunning continued, “This need was identified by Google and addressed internally with a system called Dremel.” Dremel was the inspiration for Drill, which also is meant to complement such open-source big data systems as Apache Hadoop. The difference between Hadoop and Drill is that while Hadoop is designed to achieve very high throughput, it’s not designed to achieve the sub-second latency needed for interactive data analysis and exploration.

Drill’s architecture is made up of four components:

Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel andGoogle BigQuery. It will also support Full ANSI SQL:2003.
Low-latency distributed execution engine: This is Drill’s heart. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines such as Dremel, Dryad, Hyracks, CIEL, Stratosphere, and columnar storage.
Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON (Binary JSON,) andYAML.
Scalable data sources: This layer is responsible for supporting data sources. The initial focus is to leverage Hadoop as a data source.

The distributed execution engine is written in Java. Dunning explained, “We’re pushing the envelope in terms of high-performance execution in Java. If you optimize Java the way you would with C++, you can get extraordinary optimization. Java programmers have been coddled so they assume they don’t have to optimize.”

To further speed up the queries, a specialized transaction remote procedure call (RPC) layer is designed to work at wire speeds. Once the requested data is in RAM, it’s kept in a column format in an array using Java’s ValueVectors and this, according to Dunning, is where you really get the query speed improvements.

Jacques Nadeau, MapR’s lead developer on the Apache Drill open-source project, explained at an OSCon session that the columnar data structure is better for execution in memory because it improves cache locality and CPU pipelining. In storage, it can be used to save space thanks to the fact that it lends itself to improved data compression.

In Drill, the database schema is very late binding. Indeed, Drill doesn’t assume that the data has a schema at all, though it uses one if it’s there.

As you might have guessed by now, Drill loves to have as much RAM as possible. To avoid problems with cache coherency, Drill uses pure data flow through the memory and a lock-free architecture.

From a deployment viewpoint, Drillbits, each node’s instance of Drill, uses local RAM and data. Queries can be made from any such instance. The co-ordination, query planning and optimization, scheduling, and execution are then distributed.

Dunning is well aware of other projects to bring scalable parallel database real-time queries to Hadoop, such as Cloudera Impala. He thinks, though, that Drill’s developers realize, “We may not be the most clever guys in the woods, so we’re open to other programmers’ ideas.”

Speaking of which, Drill is more than ready to welcome programmers. In particular, Dunning said, “We’re a very welcoming, very fast moving community. We want more query optimizing types and Java programmers. We also need people who want to learn, who can write documentation, build Web front-ends and RPC layers.”

At this point, Drill is very much a work in progress. “It’s not quite production quality at this point, but by third or fourth quarter of 2013 it will become quite usable.” Specifically, Drill should be in beta by the third quarter.

So, if Drill sounds interesting to you, you can start contributing as soon as you get up to speed. To do that, there’s a weekly Google Hangout on Tuesdays at 9am Pacific time and a Twitter feed at @ApacheDrill. And, of course, there’s an Apache Drill Wiki and users’ and developers’ mailing lists.

Stop Thinking, Just Do!