Article Source

Title: Running BlinkDB Locally
Authors: Sameer Agarwal

Running BlinkDB Locally

This wiki is closely mirrored after the Shark Wiki and describes how to get BlinkDB running locally. It creates a small custom BlinkDB Hive installation on one machine and allows you to execute simple queries. The only prerequisite for this guide is that you have Java and Scala 2.9.3 installed on your machine. If you don’t have Scala 2.9.3, you can download it by running:

$ wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz
$ tar xvfz scala-2.9.3.tgz

Get the latest version of BlinkDB.

$ git clone -b alpha-0.1.0 https://github.com/sameeragarwal/blinkdb.git

BlinkDB requires the (patched) development package of BlinkDB Hive which is added as a submodule in the BlinkDB repository. Clone it from github and package it:

$ cd blinkdb
$ git submodule init
$ git submodule update
$ cd hive_blinkdb
$ ant package

ant package builds all Hive jars and put them into build/dist directory. If you are trying to build Hive on your local machine and (a) your distribution doesn’t have yum or (b) the above yum commands don’t work out of the box with your distro, then you probably want to upgrade to a newer version of ant. ant >= 1.8.2 should work. Download ant binaries at http://ant.apache.org/bindownload.cgi. You might also be able to upgrade to a newer version of ant using a package manager, however on older versions of CentOS, e.g. 6.4, yum can’t install ant 1.8 out of the box so installing ant by downloading the binary installation package is recommended.

The BlinkDB code is in the blinkdb/ directory. To setup your environment to run BlinkDB locally, you need to set HIVE_HOME and SCALA_HOME environmental variables in a file blinkdb/conf/blinkdb-env.sh to point to the folders you just downloaded. BlinkDB comes with a template file blinkdb-env.sh.template that you can copy and modify to get started:

$ cd blinkdb/conf
$ cp blinkdb-env.sh.template blinkdb-env.sh

Edit blinkdb/conf/blinkdb-env.sh and set the following for running local mode:

#!/usr/bin/env bash
export SHARK_MASTER_MEM=1g
export HIVE_DEV_HOME="/path/to/hive"
export HIVE_HOME="$HIVE_DEV_HOME/build/dist"
SPARK_JAVA_OPTS="-Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS
export SCALA_VERSION=2.9.3
export SCALA_HOME="/path/to/scala-home-2.9.3"
export SPARK_HOME="/path/to/spark"
export HADOOP_HOME="/path/to/hadoop-1.2.0"
export JAVA_HOME="/path/to/java-home-1.7_21-or-newer"

Next, package and publish Spark and BlinkDB

$ cd $SPARK_HOME
$ sbt/sbt publish-local
$ cd $BLINKDB_HOME
$ sbt/sbt package

Next, create the default Hive warehouse directory. This is where Hive will store table data for native tables.

$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod 0777 /user/hive/warehouse  # Or make your username the owner

You can now start the BlinkDB CLI:

$ ./bin/blinkdb

To verify that BlinkDB is running, you can try the following example, which creates a table with sample data:

CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src;
SELECT COUNT(1) FROM src;
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;

In addition to the BlinkDB CLI, there are several executables in blinkdb/bin:

bin/blinkdb-withdebug: Runs BlinkDB CLI with DEBUG level logs printed to the console.
bin/blinkdb-withinfo: Runs BlinkDB CLI with INFO level logs printed to the console.

Stop Thinking, Just Do!