Article Source
Discardable Memory and Materialized Queries
Put your memory into its right place in the storage hierarchy for efficient queries
Julian Hyde will present the following talks at the Hadoop Summit:
- Discardable In-Memory, Materialized Query for
Hadoop (June 3rd,
11:15-11:55 am)
- Cost-based Query Optimization in
Hive,
(June 4th, 4:35 pm-5:15 pm**)
What to do with all that
memory in a Hadoop cluster? The question is frequently heard. Should we
load all of our data into memory to process it? Unfortunately the answer
isn’t quite that simple.
The goal should be to put memory into its right place in the storage
hierarchy, alongside disk and solid-state drives (SSD). Data should
reside in the right place for how it is being used, and should be
organized appropriately for where it resides. These capabilities should
be available to all applications that use Hadoop, and should not require
a lot of configuration to make them work efficiently.
These are ambitious goals. In this article, I propose a solution, a new
kind of data set called the Discardable, In-Memory, Materialized
Query (DIMMQ) and its three key underlying concepts:
-
Materialized queries,
-
Memory-resident data, and
-
Discardable data.
I shall explain how these concepts build on existing Hadoop facilities,
are important and useful individually, combine to exploit memory, and
balance the needs of various applications. Taken together they radically
alter the performance characteristics and manageability of a Hadoop
cluster.
Implementing DIMMQs will require changes at the query level
(optimization queries in SQL and other languages such as Pig and
Cascading) and also at the storage level. In an companion blog, Sanjay
Radia proposes a Discardable Distributed
Memory(DDM) for HDFS, a
mechanism intended to support DIMMQs.
By the way, don’t ask me how to pronounce DIMMQ! I won’t mind at all if
the name goes away, and the concepts surface in mundane-sounding
features like “CREATE MATERIALIZED VIEW”. The important part is the
concepts, and how they fit together.
Before I describe the concepts and vision in detail, let’s look at the
trends in hardware, workloads, and data lifecycles that are driving
this.
Hardware trends
The trend in hardware is towards heterogenous
storage,
a balance of memory, SSD and disk.
Table 1 shows the hardware configuration of a server in a Hadoop
cluster, now versus 5 years ago. (The configurations are anecdotal, but
fairly typical; both servers cost approximately $5,000.)
The data shows a gradual move towards memory, but there are no signs
that disk is going away. We are not moving to an “all-memory
architecture”; if anything, with the arrival of SSD, the storage
hierarchy is becoming more complex, and memory will be an important part
of it.
Table 1: Hardware Trends</p>
Year |
Cores |
Memory |
SSD |
Disk |
Disk : memory ratio |
2009 |
4-8 |
8 GB |
None |
4 x 1 TB |
512:1 |
2014 |
24 |
128 GB |
1 TB |
12 x 4 TB |
384:1 |
## Workloads and data lifecycles
Traditionally Hadoop has been used for storing data, and for batch
analytics to retrieve that data. But increasingly it is used for other
workloads, such as interactive analytics, machine learning, and
streaming. There is a continuum of desired latency, from hours to
milliseconds.
If we look at the life cycle of a particular data item, we also see a
continuum. Some data might live on disk for years until it is structured
and analyzed; other data might need to be acted upon within seconds or
even milliseconds. An analyst or machine-learning algorithm might start
using a subset of that data for analysis, creating derived or cleaned
data sets and combining with the original data.
In general, fresh data is more likely to be read and modified, and
activity drops off exponentially over time. But a subset of historic
data might become “hot” for a few minutes or days, before becoming
latent again. Clearly we would like the hot data to be in memory, but
without rewriting our application or excessive tuning.
## A pool of resources
Hadoop’s strength is that it brings all of the data in a cluster of
computers, and the resources to process it, into a pool. This yields
economies of scale. If you and I store our data in the same cluster, and
I am not accessing my data right now, you can use my slice of the
cluster’s resources to process your data faster.
Along with CPU cycles and network and disk I/O, memory is one of those
key resources. Memory can help deliver high performance on a wide
variety of workloads, but to maximize the use of memory, we need to add
it to the pool, and make it easy for tasks to share memory according to
their need, and make it easy for data to migrate between memory and
other storage media.
What is the right model for memory?
Data management systems traditionally use memory in two ways. First, a
buffer-cache stores copies of disk blocks in memory so that if a block
is used multiple times, it only needs to be read from disk once. Second,
query-processing algorithms need operating memory. Most algorithms use
modest amounts of memory, but algorithms that work on collections (sort,
aggregation and join) require large amounts of memory for short periods
in order to operate efficiently.
For many analytic workloads, a buffer cache is not an effective use of
memory. Consider a table that is 10x larger than available memory. A
query that sums every row in the table will read every block once. If
you run the query again, no matter how smartly you cache blocks, you
will need at least 90% of the blocks again. If all the data is accessed
uniformly, random reads will also experience a cache hit-rate of only
10%.
A Discardable, In-Memory, Materialized Query (DIMMQ) allows a new mode
of memory use.
- A materialized query is a dataset whose contents are guaranteed to
be the same as executing a particular query, called the defining
query of the DIMMQ. Therefore any query that could be satisfied
using that defining query can also be satisfied using the DIMMQ,
possibly a lot faster.
- Discardable means that the system can throw it away.
- In-memory means that the contents of the dataset reside in the
memory of one or more nodes in the Hadoop cluster.
Let’s look at the advantages DIMMQs bring.
- The defining query provides a link between the DIMMQ and the
underlying dataset that the system can understand. The system can
rewrite queries to use the DIMMQ without the application explicitly
referencing it.
- Because the rewrite to use the DIMMQ is performed automatically by
the system, not by the application, it makes it OK for the system to
discard the DIMMQ.
- Because DIMMQs are discardable, we don’t have to worry about
creating too many of them. Various sources (users,
applications, agents) continually suggest DIMMQs, and the system
continually throws out the ones that are not pulling their weight.
This dynamic process optimizes the system without requiring any part
of it be omniscient or prescient.
## Building DIMMQs
How to introduce these capabilities to Hadoop? The architectural
approach is compelling because we can build the core concepts
separately, and we can evolve existing Hadoop systems such as HDFS,
HCatalog, Tez and Hive.
It is tempting to make HDFS intelligent enough to recognize patterns and
to rebuild DIMMQs that it had previously discarded. But that would
introduce too much coupling into the system — in particular, HDFS would
become dependent on high level concepts like HCatalog and a
query-optimizer.
Instead, the design is elegantly stupid. The low-level system, HDFS,
stores and retrieves DIMMQ data sets, and is allowed to discard them. A
query optimizer in the high-level system (such as Hive or Pig) processes
incoming queries and rewrites them in terms of DIMMQs. A user or agent
builds DIMMQs speculatively, not directly controlling whether they are
discarded, but knowing that HDFS will discard them if they do not pull
their weight.
The core concepts — materialized queries, discardable data, and
in-memory data — are loosely coupled and can be developed and improved
independently of each other.
- Work is already underway in
[Optiq](https://github.com/julianhyde/optiq "Optiq") to support
materialized views, or more specifically, materialized
query recognition. Optiq is a query-optimization engine and is
already used in Hive as part of the
[CBO](https://cwiki.apache.org/confluence/download/attachments/27362075/CBO-2.pdf "CBO") project.
- Support for in-memory data is being planned in [JIRA
HDFS-5851](https://issues.apache.org/jira/browse/HDFS-5851 "HDFS JIRA").
- Discardable data is an extension to HDFS’s long-standing support for
replication (multiple copies of data on disk) and caching
(additional copies in memory).
- Sanjay Radia describes HDFS [Discardable Distributed
Memory](http://hortonworks.com/blog/ddm/ "DMM") (DDM), a mechanism
that combines in-memory data, replication and caching, in a an
upcoming blog post.
- The Stinger vectorization initiative makes memory access more
efficient by organizing data in column-oriented ranges. This reduces
memory usage and makes for more efficient use of processor cache.
Other components, such as agents to gather statistics, recommend, build
and maintain DIMMQs, can be built around the system without affecting
its core parts.
## Queries, materialized views, and caching
When a data management system such as Hadoop loads a data set into
memory for more efficient processing, it is doing something that
databases have always done: create a copy of the data, organized in a
way that is more efficient for the task at hand, and that can be added
or removed without the application’s knowledge.
B-tree indexes are perhaps the most familiar example, but there are also
hash clusters, aggregate tables, remote snapshots, projections.
Sometimes the copy is in a different medium (memory versus disk);
sometimes the copy is organized differently (a b-tree index is sorted on
a particular key, whereas the underlying table is not sorted); and
sometimes the copy is a derived data set, for example a subset over a
given time range or a summary.
Why create these copies? If the system knows about the copies of a data
set, then it can use a copy to answer a query rather than the original
data set. Queries answered that way can be several orders of magnitude
faster, especially when the copy is in-memory and/or significantly
smaller than the original.
The major databases (Oracle, IBM DB2, Teradata, Microsoft SQL Server)
all have a feature called (with a few syntactic variations) materialized
views. A materialized view consists of a SQL query and a table that
contains the result of that query. For instance, here is how a
materialized view might be defined in Hive:
```sql
CREATE MATERIALIZED VIEW emp_summary AS
SELECT deptno, gender, COUNT(*) AS c, SUM(salary) AS s
FROM emp
GROUP BY deptno, gender;
```
A materialized view is a table, so you can query it directly:
```sql
SELECT deptno
FROM emp_summary
WHERE gender = ‘M’ AND c > 20;
```
More importantly, it can be used to answer queries on other tables.
Given a query on the emp table,
```sql
SELECT deptno, AVG(salary) AS average_sal
FROM emp
WHERE gender = ‘F' GROUP BY deptno;
```
The planner can rewrite to use the emp_summary table, as follows:
```sql
SELECT deptno, s / c AS average_sal
FROM emp_summary
WHERE gender = ‘F’ GROUP BY deptno;
```
emp_summary has done much of the work required to answer the query, so
the results come back faster. It is also significantly smaller, so the
memory budget required to keep it in cache is smaller.
## From materialized views to DIMMQs
DIMMQs are an extension to materialized views.
First, we need to make the materialized query accessible to all
applications written in all languages, so we convert it to Optiq’s
language-independent relational algebra and store its definition in
HCatalog.
Next, we tell HDFS that the materialized query (a) should be kept in
memory, (b) can be discarded. This can be accomplished using hints on
the file that underlies the table.
Other possible hints might tell HDFS whether to consider copying a DIMMQ
to disk before discarding it, and estimates of the number of reads over
the next hour, day, and month, to predict the DIMMQ’s usefulness. A
materialized view that is likely to be used very soon is a good
candidate to be stored in memory; if after a few hours the reads decline
to a trickle, it might be worth paging it to disk rather than discarding
if it is much smaller than the original data.
Lastly, we need a mechanism to suggest, create and populate DIMMQs. Here
are a few:
1. Materialized views can be created explicitly, using CREATE
MATERIALIZED VIEW syntax.
2. Perform incremental updates to materialized views using [Apache
Falcon](http://falcon.incubator.apache.org "Apache Falcon")
3. The system caches query results (or intermediate results) as DIMMQs.
4. An automatic recommender (based on ideas such as [Data
Cubes](http://www.cs.sfu.ca/CourseCentral/459/han/papers/harinarayan96.pdf "Data Cubes")
) could suggest and build DIMMQs.
Respectively, these provide control to an administrator, the ability to
adapt to new workloads, and optimization of the system based on past
activity. We would recommend that systems use a blend of all three.
## Variations on a theme
There are many ways that core concepts behind DIMMQs can be used and
extended. Here are a few initial ideas, and we trust that the Hadoop
community will find many more.
Materialized queries don’t have to be in-memory. A materialized query
stored on disk would still be useful if, for example, the source dataset
rarely changes and the materialized query is much smaller.
Materialized queries don’t have to be discardable, especially if they
are on disk, where space is not generally a scarce resource. They will
typically be deleted if they are out of sync with their source data.
Materialized queries don’t have to be specified in SQL. Other languages,
such as [Pig](https://pig.apache.org/ "Pig"),
[Cascading](http://www.cascading.org/ "Cascading"), and
[Scalding](https://github.com/twitter/scalding "Scalding"), and in
fact any application that uses
[Tez](http://tez.incubator.apache.org// "Tez"), should be able to use
this facility.
Materialized query recognition is just part of the problem. It would be
useful if Hadoop helped maintain the materialized query as the
underlying data changes, or if you could tell Hadoop that the
materialized query was no longer valid. We can build on [ACID
transactions](http://hortonworks.com/blog/adding-acid-to-apache-hive/ "ACID") work
already started.
Materialized queries allow a wide variety of derived data structures to
be described: summary tables, b-tree indexes (basically sorted
projections), partitioned tables and remote snapshots are a few
examples. Using the materialized query mechanism, applications can
design their own derived data structures and have them automatically
recognized by the system.
In-memory tables don’t have to be materialized queries. There are other
good reasons to support in-memory tables. In a streaming scenario, for
instance, you would write to an in-memory table first, and periodically
flush to disk.
Materialized queries can help with data aging. As data gets older, it is
accessed less frequently, and so you might wish to store it in slower
and cheaper storage, at a lower replication level, or with less coverage
by aggregate tables.
## Conclusion
Discardable In-memory Materialized Query (DIMMQ) data sets express how
our applications use data in a way that allows Hadoop to automatically
optimize and adapt how that data is stored, in particular, storing
copies of that data in memory.
DIMMQs are superior to alternative uses of memory. Unlike low-level
buffer cache, DIMMQ caches high-level results, which can be much smaller
and are closer to the required result. And unlike non-declarative
materializations like Spark RDDs, an application can use a DIMMQ even if
it doesn’t know that it exists.
DIMMQ is built from three concepts: materialized query recognition,
in-memory data sets, and discardable data sets. Together, they allow
applications to seamlessly use heterogeneous storage — disk, SSD and
memory — and quickly adapt to changing patterns of data use. And they
provide a new level of data independence that will allow the Hadoop
ecosystem to develop novel data organizations.
Materializations combined with HDFS Discardable Distributed Memory (DDM)
storage are a major advance in Hadoop architecture that build on
Hadoop’s existing strengths and make Hadoop as the place to store and
process all of your enterprise’s data.
>