Queries with Bounded Errors and Bounded Response Times on Very Large Data
BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time, and (2) A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements. We have evaluated BlinkDB on the well-known TPC-H benchmarks, a real-world analytic workload derived from Conviva Inc. and are in the process of deploying it at Facebook Inc.
BlinkDB has been demonstrated live at VLDB 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.
The current version of BlinkDB supports a slightly constrained set of SQL-style declarative queries and provides approximate results for standard SQL aggregate queries, specifically queries involving COUNT, AVG, SUM and PERCENTILE and is being extended to support any User-Defined Functions (UDFs). Queries involving these operations can be annotated with either an error bound, or a time constraint, based on which the system selects an appropriate sample to operate on.
We evaluated BlinkDB’s performance on a 100 node EC2 cluster using a workload of media access traces from Conviva Inc. on 17TB of data. We compared BlinkDB to distributed query execution frameworks (Hive and Spark) that operate on full-sized datasets, evaluated the error convergence properties of multi-dimensional stratified samples used by BlinkDB and demonstrated the effectiveness of our cost models and error projections in meeting the user’s accuracy/response time requirements.
BlinkDB Vs. No Sampling
To demonstrate the advantages of sampling, we ran on two subsets of Conviva data of size 7.5 TB and 2.5 TB respectively, spread across 100 machines (each with 60GB of RAM). The smaller 2.5 TB dataset was completely cached in memory while datasets larger than 6 TB in size have to be (at least partially) spilled to disk. To demonstrate the significance of sampling even for the simplest analytical queries, we ran a simple query that computed the average of user session times with a filtering predicate on the date column and a GROUP BY on the city column. We compared the response time of the full (accurate) execution of this query in Hive on Hadoop and Hive on Spark (Shark) – both with and without caching, against its (approximate) execution on BlinkDB with a 1% error bound for each city at 95% confidence. In all cases, BlinkDB significantly outperformed its counterparts (by a factor of 10 − 200x), because it is able to read far less data to compute a fairly accurate answer.
Error Convergence Properties
To demonstrate the convergence properties of multi-dimensional stratified samples used by BlinkDB, we compared a query execution on three sets of samples- the multi-dimensional stratified sample used by BlinkDB, single-dimensional stratified samples and non-stratified (i.e., uniform) samples taken over 17 TB of Conviva data. Over this data, we ran multiple queries to calculate average session time for a particular ISP’s customers in 5 US Cities and determined the latency for achieving a particular error bound with 95% confidence. Results from this experiment show that error bars from running queries over multi-dimensional stratified samples converge orders-of-magnitude faster than random sampling, and are significantly faster to converge than single-dimensional stratified samples.
To evaluate BlinkDB’s effectiveness in meeting different time/error bounds requested by the user, we picked a sample of 20 Conviva queries, ran each of them 10 times, with a time bound from 1 to 10 seconds (left graph) and with an error bound from 1% - 32% (right graph). The left graph shows the results run on the same 17 TB data set, where each bar represents the minimum, maximum and average response times of these queries. Similarly, the right graph shows results from the same set of queries, also on the 17 TB data set, demonstrating our ability to meet specified error constraints. The bars again represent the minimum, maximum and average errors across different runs of the queries.
Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, Ion Stoica. Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In ACM SIGMOD 2014, Snowbird, Utah.
Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael Jordan. A General Bootstrap Performance Diagnostic. In ACM KDD 2013, Chicago, Illinois.
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys 2013, Prague, Czech Republic (Best Paper Award).
Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Anand P. Iyer, Samuel Madden, Ion Stoica. Blink and It’s Done: Interactive Queries on Very Large Data. In PVLDB 5(12): 1902-1905, 2012, Istanbul, Turkey.
BlinkDB is being developed by Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Aurojit Panda, Prof. Michael I. Jordan and Prof. Ion Stoica at the University of California, Berkeley in collaboration with Prof. Barzan Mozafari at the University of Michigan and Prof. Samuel Madden at the Massachusetts Institute of Technology.