Article Source
Explore In-Memory Data Store Tachyon
Memory is the key to fast Big Data processing. This has been realized by many, and frameworks such as Spark already leverage memory performance. As data sets continue to grow, storage is increasingly becoming a critical bottleneck in many workloads.
To address this need, we have developed Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. The result of over two years of research, Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.
Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code changes. Tachyon is the default off-heap option in Spark, which means that RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads. The project is open source and is already deployed at multiple companies. In addition, Tachyon has more than 50 contributors from over 20 institutions, including Yahoo, Intel, Redhat, and Pivotal. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.
In this chapter we first go over basic operations of Tachyon, and then run a Spark program on top of it. For more information, please visit Tachyon’s website or Github repository. We also host regular meetups in the bay area.
Prerequisites
Assumptions
- You have a laptop
- Your laptop has Java 6 or 7 installed
- Mac OS or Linux (Windows is not supported)
Launch Tachyon
Configurations
All system’s configuration is under tachyon/conf
folder. Please find
them, and see how much memory is configured on each worker node.
$ grep "TACHYON_WORKER_MEMORY_SIZE=" conf/tachyon-env.sh
export TACHYON_WORKER_MEMORY_SIZE=1GB
You can also read the through the file and try to understand those parameters. For more information on configuration, you can visit Tachyon Configuration Settings webpage.
Format the storage
Note that if you are running Linux, Tachyon will need root permission to
create and use a RAM disk. To start a superuser shell, run sudo su
and
enter your password.
Before starting Tachyon for the first time, we need to format the
system. It can be done by using tachyon
script in the tachyon/bin
folder. Please type the following command:
$ ./bin/tachyon format
Connection to localhost... Formatting Tachyon Worker @ HYMac-2.local
Removing local data under folder: /Users/haoyuan/Downloads/test/tachyon/libexec/../ramdisk/tachyonworker/
Formatting Tachyon Master @ localhost
Formatting JOURNAL_FOLDER: /Users/haoyuan/Downloads/test/tachyon/libexec/../journal/
Formatting UNDERFS_DATA_FOLDER: /Users/haoyuan/Downloads/test/tachyon/libexec/../../data/tmp/tachyon/data
Formatting UNDERFS_WORKERS_FOLDER: /Users/haoyuan/Downloads/test/tachyon/libexec/../../data/tmp/tachyon/workers
Start the system
After formatting the storage, we can try to start the system. This can
be done by using tachyon/bin/tachyon-start.sh
script.
$ ./bin/tachyon-start.sh local
Killed 0 processes
Killed 0 processes
Connection to localhost... Killed 0 processes
Starting master @ localhost
Starting worker @ HYMac-2.local
Interacting with Tachyon
In this section, we will go over three approaches to interact with Tachyon:
- Command Line Interface
- Application Programming Interface
- Web User Interface
Command Line Interface
You can interact with Tachyon using the following command:
$ ./bin/tachyon tfs
Then, it will return a list of options:
Usage: java TFsShell
[cat <path>]
[count <path>]
[ls <path>]
[lsr <path>]
[mkdir <path>]
[rm <path>]
[tail <path>]
[touch <path>]
[mv <src> <dst>]
[copyFromLocal <src> <remoteDst>]
[copyToLocal <src> <localDst>]
[fileinfo <path>]
[location <path>]
[report <path>]
[request <tachyonaddress> <dependencyId>]
[pin <path>]
[unpin <path>]
Please try to put the local file tachyon/LICENSE
into Tachyon file
system as /LICENSE using command line.
$ ./bin/tachyon tfs copyFromLocal LICENSE /LICENSE
Copied LICENSE to /LICENSE
You can also use command line interface to verify this:
$ ./bin/tachyon tfs ls /
11.40 KB 02-07-2014 23:23:44:008 In Memory /LICENSE
Now, you want to check out the content of the file:
$ ./bin/tachyon tfs cat /LICENSE
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
....
Application Programming Interface
After using command line to interact with Tachyon, you can also use its API. We have several sample applications. For example, BasicOperations.java shows how to user file create, write, and read operations.
You have put these into our script, you can simply use the following command to run this sample program. The following command runs [BasicOperations.java], and also verifies Tachyon’s installation.
$ ./bin/tachyon runTests
$ /root/tachyon/bin/tachyon runTest Basic MUST_CACHE
$ /BasicFile_MUST_CACHE has been removed
$ 2014-02-07 23:46:57,529 INFO (TachyonFS.java:connect) - Trying to connect master @ ec2-23-20-202-253.compute-1.amazonaws.com/10.91.151.150:19998
$ 2014-02-07 23:46:57,599 INFO (MasterClient.java:getUserId) - User registered at the master ec2-23-20-202-253.compute-1.amazonaws.com/10.91.151.150:19998 got UserId 6
$ 2014-02-07 23:46:57,600 INFO (TachyonFS.java:connect) - Trying to get local worker host : ip-10-91-151-150.ec2.internal
$ 2014-02-07 23:46:57,618 INFO (TachyonFS.java:connect) - Connecting local worker @ ip-10-91-151-150.ec2.internal/10.91.151.150:29998
$ 2014-02-07 23:46:57,661 INFO (CommonUtils.java:printTimeTakenMs) - createFile with fileId 3 took 133 ms.
$ 2014-02-07 23:46:57,707 INFO (TachyonFS.java:createAndGetUserTempFolder) - Folder /mnt/ramdisk/tachyonworker/users/6 was created!
$ 2014-02-07 23:46:57,714 INFO (BlockOutStream.java:<init>) - /mnt/ramdisk/tachyonworker/users/6/3221225472 was created!
$ Passed the test!
$ /root/tachyon/bin/tachyon runTest BasicRawTable MUST_CACHE
$ /BasicRawTable_MUST_CACHE has been removed
$ 2014-02-07 23:46:58,633 INFO (TachyonFS.java:connect) - Trying to connect master @ ec2-23-20-202-253.compute-1.amazonaws.com/10.91.151.150:19998
$ 2014-02-07 23:46:58,705 INFO (MasterClient.java:getUserId) - User registered at the master ec2-23-20-202-253.compute-1.amazonaws.com/10.91.151.150:19998 got UserId 8
$ 2014-02-07 23:46:58,706 INFO (TachyonFS.java:connect) - Trying to get local worker host : ip-10-91-151-150.ec2.internal
$ 2014-02-07 23:46:58,725 INFO (TachyonFS.java:connect) - Connecting local worker @ ip-10-91-151-150.ec2.internal/10.91.151.150:29998
$ 2014-02-07 23:46:58,859 INFO (TachyonFS.java:createAndGetUserTempFolder) - Folder /mnt/ramdisk/tachyonworker/users/8 was created!
$ 2014-02-07 23:46:58,866 INFO (BlockOutStream.java:<init>) - /mnt/ramdisk/tachyonworker/users/8/8589934592 was created!
$ 2014-02-07 23:46:58,904 INFO (BlockOutStream.java:<init>) - /mnt/ramdisk/tachyonworker/users/8/9663676416 was created!
$ 2014-02-07 23:46:58,914 INFO (BlockOutStream.java:<init>) - /mnt/ramdisk/tachyonworker/users/8/10737418240 was created!
$ Passed the test!
$ ...
Web User Interface
After using commands and API to interact with Tachyon, let’s take a look
at its web user interface. The URI is http://localhost:19999
.
The first page is the overview of the running system. The second page is the system configuration
If you click on the Browse File System
, it shows you all the files you
just created and copied.
You can also click a particular file or folder. e.g. /LICENSE
file,
and then you will see the detailed information about it.
Run Spark on Tachyon
Input/Output with Tachyon
In this section, we run a Spark program to interact with Tachyon. The
first one is to do a word count on /LICENSE
file. In /root/spark
folder, execute the following command to start Spark shell.
$ ./bin/spark-shell
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
var file = sc.textFile("tachyon://localhost:19998/LICENSE")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("tachyon://localhost:19998/result")
JavaRDD<String> file = spark.textFile("tachyon://localhost:19998/LICENSE");
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("tachyon://localhost:19998/result");
file = sc.textFile("tachyon://localhost:19998/LICENSE")
counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("tachyon://localhost:19998/result")
The results are stored in /result
folder. You can verfy the results
through Web UI or commands. Because /LICENSE
is in memory, when a new
Spark program comes up, it can load in memory data directly from
Tachyon. In the meantime, we are also working on other features to make
Tachyon further enhance Spark’s performance.
Store RDD OFF_HEAP in Tachyon
Storing RDD as OFF_HEAP storage in Tachyon has several advantages (more info):
- It allows multiple executors to share the same pool of memory in Tachyon.
- It significantly reduces garbage collection costs.
- Cached data is not lost if individual executors crash.
Please try to the following example:
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
var file = sc.textFile("tachyon://localhost:19998/LICENSE")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
counts.take(10)
counts.take(10)
You will notice the second time take(10) is much faster than the first time because that the counts RDD has been stored OFF_HEAP in Tachyon.
This brings us to the end of the Tachyon chapter of the tutorial. We encourage you to continue playing with the code and to check out the project website, Github repository, and meetup group for further information.
Bug reports and feature requests are welcomed.