Article Source
- Title: RAMCloud
RAMCloud
What is RAMCloud?
RAMCloud is a new class of storage for large-scale datacenter applications. It is a key-value store that keeps all data in DRAM at all times (it is not a cache like memcached). Furthermore, it takes advantage of high-speed networking such as Infiniband or 10Gb Ethernet to provide very high performance. Applications running in the same datacenter as a RAMCloud cluster can access small objects in about 5μs, which is1000x faster than disk-based storage systems. Small writes take about 15μs. At the same time, RAMCloud storage is durable: data is automatically replicated on nonvolatile secondary storage such as disk or flash, so it is not lost when servers crash. One of RAMCloud’s unique features is that it recovers very quickly from server crashes (only 1-2 seconds) so the availability gaps after crashes are almost unnoticeable. Finally, RAMCloud is designed to scale: it can support clusters containing thousands of storage servers, with total capacities up to a few petabytes.
From a practical standpoint, RAMCloud enables a new class of applications that manipulate large data sets very intensively. Using RAMCloud, an application can combine tens of thousands of items of data in real time to provide instantaneous responses to user requests. Unlike traditional databases, RAMCloud scales to support very large applications, while still providing a high level of consistency. We believe that RAMCloud, or something like it, will become the primary storage system for structured data in cloud computing environments such as Amazon’s AWS or Microsoft’s Azure. We have built the system not as a research prototype, but as a production-quality software system, suitable for use by real applications.
RAMCloud is also interesting from a research standpoint. Its two most important attributes are latency and scale. The first goal is to provide the lowest possible end-to-end latency for applications accessing the system from within the same datacenter. We currently achieve latencies of around 5μs for reads and 15μs for writes, but hope to improve these in the future. In addition, the system must scale, since no single machine can store enough DRAM to meet the needs of large-scale applications. We have designed RAMCloud to support at least 10,000 storage servers; the system must automatically manage all the information across the servers, so that clients do not need to deal with any distributed systems issues. The combination of latency and scale creates a large number of interesting research issues. To date we have addressed several of these, such as how to ensure data durability without sacrificing the latency of reads and writes, how to take advantage of the scale of the system to recover very quickly after crashes, and how to manage storage in DRAM. Many more issues remain, such as whether we can provide higher-level features such as secondary indexes and multiple-object transactions without sacrificing the latency or scalability of the system. We are currently exploring several of these issues.
The RAMCloud project is based in the Department of Computer Science at Stanford University.
Learning About RAMCloud
General information about RAMCloud, such as talks and papers. Much of the information here is related to the research aspects of the project, as opposed to information on how to use RAMCloud.
- Introductory talk on RAMCloud by John Ousterhout, given at LinkedIn on October 12, 2011.
- The RAMCloud Storage System: draft of a comprehensive paper describing RAMCloud, including the log-structured storage mechanism, RAMCloud’s thread architecture and approach to low latency, and its crash recovery mechanisms. This draft is from October 2014.
- The Case for RAMCloud: an early position paper that discusses the motivation for RAMCloud, the new kinds of applications it may enable, and some of the research issues that will have to be addressed to create a working system. Appeared in CACM in July 2011.
- An earlier and slightly longer version of the position paper, which appeared in Operating Systems Review in December 2009.
- Fast Recovery in RAMCloud: describes RAMCloud’s mechanism for recovering crashed servers in 1-2 seconds. Appeared in SOSP in October, 2011
- Log-Structured Memory for DRAM-based Storage: describes how RAMCloud manages the storage of objects both in DRAM and on disk. Appeared in FAST in February, 2014; won Best Paper Award.
- Toward Common Patterns for Distributed, Concurrent, Fault-Tolerant Code: HotOS 2013 workshop paper describing a rules-based approach for building “DCFT” systems.
-
Articles about RAMCloud (Web and print media, written by people outside the RAMCloud group)
- RAMCloud Papers (complete listing of all papers written by the RAMCloud group)
- RAMCloud Presentations (Slides from talks about RAMCloud)
- Glossary of RAMCloud Terms
How to Deploy and Use RAMCloud
RAMCloud has now reached a level of maturity where it is suitable for production use with real applications. The links below provide information on how to set up a RAMCloud cluster and on the RAMCloud APIs for applications.
- Deciding Whether to Use RAMCloud
- Supported Platforms
- Setting Up a RAMCloud Cluster
- Creating a RAMCloud Client
- Application APIs (what features are available to applications)
- Python Bindings
- Service Locators
- Technical Support
RAMCloud Performance
Measurements of RAMCloud performance, as well as comparisons between RAMCloud and other systems.
- clusterperf benchmarks (benchmarks run on a cluster to measure basic things such as read and write latency and throughput)
- How To Run Clusterperf
- Perf benchmarks (microbenchmarks measuring various low-level operations on a single machine, such as atomic increment)
- Performance Improvement Log
- Recovery Performance Benchmark
- Latency Patterns in Infiniband (talk by Alex Modkovich, May 2012)
- RPC Latency Profile (the lifetime of a write operation, measured January 2012)
- SSD Experiments (July 2011)
- Redis vs. RAMCloud
- Older Performance Measurements
Information for RAMCloud Developers
Information for people who are working on the RAMCloud code base; it is intended primarily for the internal use of the RAMCloud team at Stanford, but may be useful to other people as well.
- General Information for Developers (how to get started as a RAMCloud developer)
- Build System Structure
- RAMCloud Tech Talks (Videos of RAMCloud developers describing the internals of various system components)
- Want to Contribute to RAMCloud? (notes for people who would like to contribute code to RAMCloud)
- Running Recoveries with recovery.py
- Coding Conventions
- Style Guide
- Documentation Guidelines
- Writing Unit Tests
- Amendments to Current Documentation and Testing Guidelines
- Software Design Philosophy – John Ousterhout’s pet peeves
- How To Measure Performance: John’s pet peeves (and ideas for a possible paper)
- RAMCloud C Style for EMACS
- Vim Settings
- Copyright Notice
- Mfence – x86 instructions for limiting instruction reordering
- Inside Concurrency Primitives
- Wireshark PluginDallyFastNetwork.pdf
- NetBeans IDE tips
- Measuring RAMCloud Performance
- Code review tool
- Git repo: see General Information for Developers
- IRC channel: #ramcloud on freenode.
-
- See rcres for coordinating usage of RAMCloud cluster.
This is used to coordinate usage of the RAMCloud cluster. Anytime you are using the cluster you should be listening on this channel; if you don’t respond to comments on the channel, your jobs may be killed. - Transcripts of this channel may be found here
- See rcres for coordinating usage of RAMCloud cluster.
-
- RAMCloud Cluster Resource manager (rcres) : rcres is a shell command available on the “rcmaster” machine of the RAMCloud cluster. Any time you are using the cluster you should ensure that you lease the machines you are using using rcres.
- Dumpstr tool for viewing reports (mostly performance data)
- Documentation, generated nightly from the source code
The RAMCloud Test Cluster
Information about the cluster we use for RAMCloud testing at Stanford. Unfortunately not all of this information is completely up to date.
- Cluster Intro – information about our cluster for newcomers
- New Contributor Checklist (how to set up access for new team members)
- Cluster Configuration – for sysadmins
- Cluster Custodian - rotatiing responsibility for managing the cluster and providing technical support
- Cluster Issues - central location for keeping track of problems in the cluster
- Cluster Inventory - includes notes about cluster setup and spare components
- Intel 530 Performance recent performance issues with Intel 530 SSDs
- Cluster Tasks - (not so) recent issues with cluster machines
- Machine Evaluations
- Compiling RAMCloud on CentOS
- Tips from Charlie & Co
- Reimaging a Cluster Machine
- Installing New Software on the Cluster
- Controlling Machines Remotely via IPMI
- Updating BIOS automatically with PXE and FreeDOS
- Infiniband Tools and Debugging
- Updating Mellanox NIC Firmware (to eliminate limit on timeouts)
- Dead Machines
- New Infiniband Fabric Notes
- Mellanox HW and Infiniband Notes
- IPMI and Virtual KVM access to cluster machines (for BIOS and boot-time configuration)
New Cluster
- ATOM Cluster : Micro Modular Server Cluster – 132 ATOM servers
Design Notes
These documents were used at various points in the project to record our early ideas about various parts of the system. Most of these pages are now out of date (they typically are not updated once serious coding begins) but they may still provide useful background information as well as alternatives that we considered.
- Nanoscheduling
- Transaction Protocol (sinfonia style)
- Reliable Conditional Write
- Higher Level Data Models
- Transaction proposal: by Satoshi
- Old Design Documents
Project History, Schedules, Milestones
- Project History
- Linearizable RPC & TX progress
- RAMCloud 1.0 (used in 2012-2013 to track progress towards first usable release)
- Least Usable System - Candidates for the “next major goal” (early April 2011).
- Milestones from 2010
Ideas for Future Work
- HotOS 2015 Ideas and Votes
- Future Projects and PhD Topics
- Usability Features and Research Topics (April 2013)
- Rotation and CURIS Ideas
Related Topics
- LogCabin and Raft
- Garbage Collection Resources (April 2013)
- The Fastest Possible Datacenter Network (slides from Bill Dally talk, March 2012)
- Facebook Information
- References
- Interesting Links
Miscellaneous Topics
- Distributed Systems Reading Group
- Team Members
- Lunch Ideas
- Current Applications (applications that are using RAMCloud or considering it)
- SEDCL Retreat - Industrial Feedback
- Server Prices: sample server configurations and prices
- Memory Prices
- Interesting Statistics
- Old Miscellaneous Topics