Distributed Systems

A (hopefully) curated list on awesome material on distributed systems, inspired by other awesome frameworks like awesome-python. Most links will tend to be readings on architecture itself rather than code itself.

Bootcamp

Read things here before you start.

CAP Theorem, Also plain english explanation
Fallacies of Distributed Computing, expect things to break, everything
Distributed systems theory for the distributed engineer, most of the papers/books in the blog might reappear in this list again. Still a good BFS approach to distributed systems.
FLP Impossibility Result (paper), an easier blog post to follow along
An Introduction to Distributed Systems @aphyr’s excellent introduction to distributed systems

Books

Distributed Computing, By Hagit Attiya and Jennifer Welch
Impossibility Results for Distributed Computing
Distributed Algorithms, Nancy Lynch
Distributed Systems for fun and profit [Free]
Distributed Systems Principles and Paradigms, Andrew Tanenbaum [Amazon Link]
Scalable Web Architecture and Distributed Systems [Free]
Principles of Distributed Systems [ETH Zurich University]
Making reliable distributed systems in the presence of software errors, Joe Amstrong’s (Author of Erlang) PhD thesis
Designing Data Intensive Applications [Amazon Link]

Papers

Must read papers on distributed systems. While nearly all of Lamport’s work should feature here, just adding a few that must be read.

Times, Clocks and Ordering of Events in Distributed Systems Lamport’s paper, the Quintessential distributed systems primer
Session Guarantees for Weakly Consistent Replicated Data a ‘94 paper that talks about various recommendations for session guarantees for eventually consistent systems, many of this would be standard vocabulary in reading other dist. sys papers, like monotonic reads, read your writes etc.

Storage & Databases

Dynamo: Amazon’s Highly Available Key Value Store
Bigtable: A Distributed Storage System for Structured Data
The Google File System
Cassandra: A Decentralized Structured Storage System Inspired heavily by Dynamo
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data, the algorithm for the basis of Ceph distributed storage system, for the architecture itself read RADOS

Messaging systems

The Log: What every software engineer should know about real-time data’s unifying abstraction, a somewhat long read, but covers brilliantly on logs, which are at the heart of most distributed systems
Kafka: a Distributed Messaging System for Log Processing

Distributed Consensus and Fault-Tolerance

Practicle Byzantine Fault Tolerance
The Byzantine Generals Problem
The Part Time Parliament Paxos, Lamport’s original Paxos paper, a bit difficult to understand, may require multiple passes
Paxos Made Simple, a more terse readable Paxos paper by Lamport himself. Shorter and more easier compared to the original.
The Chubby Lock Service for loosely coupled distributed systems Google’s lock service used for loosely coupled distributed systems. Sort of Paxos as a Service for building other distributed systems. Primary inspiration behind other Service Discovery & Coordination tools like Zookeeper, etcd, Consul etc.
Paxos made live - An engineering perspective Google’s learning while implementing systems atop of Paxos. Demonstrates various practical issues encountered while implementing a theoritical concept.
Raft Consensus Algorithm An alternative to Paxos for distributed consensus, that is much simpler to understand. Do checkout an interesting visualization of raft

Testing, monitoring and tracing

While designing distributed systems are hard enough, testing them is even harder.

Dapper, Google’s large scale distributed-systems tracing infrastructure, this was also the basis for the design of open source projects such as Zipkin, Pinpoint and HTrace.

Programming Model

Courses

Reliable Distributed Algorithms, Part 1, KTH Sweden
Reliable Distributed Algorithms, Part 2, KTH Sweden
Cloud Computing Concepts, University of Illinois
CMU: Distributed Systems in Go Programming Language
Software Defined Networking , Georgia Tech.
ETH Zurich: Distributed Systems
ETH Zurich: Distributed Systems Part 2, covers Distributed control algorithms, communication models, fault-tolerance among other things. In particular fault tolerence issues (models, consensus, agreement) and replication issues (2PC,3PC, Paxos), which are critical in understanding distributed systems are explained in great detail.

Blogs and other reading links

Notes on Distributed Systems for Young Bloods
High Scalability Several architectures of huge internet services, for eg twitter, whatsapp
There is No Now, Problems with simultaneity in distributed systems
Turing Lecture: The Computer Science of Concurrency: The Early Years, An article by Leslie Lamport on concurrency
The Paper Trail blog, a very readable blog covering various aspects of distributed systems
aphyr, Posts on jepsen series are pretty awesome
All Things Distributed - Wernel Vogel’s (Amazon CTO) blog on distributed systems
Distributed Systems: Take Responsibility for Failover
The C10K problem
On Designing and Deploying Internet-Scale Services
Files are hard A blog post on filesystem consistency, pretty important to read if you are into distributed storage or databases.
Distributed Systems Testing: The Lost World Testing distributed systems are hard enough, a well researched blog post which again covers a lot of links to various approaches and other papers

Meta Lists

Other lists like this one

Readings in distributed systems
Distributed Systems meta list
List of required readings for Distributed Systems Part of CMU’s Engineering Distributed Systems course
The Distributed Reader
A Distributed Systems Reading List, A collection of material, mostly papers on Distributed Systems Theory as well as seminal industry papers
Distributed Systems Readings, A comprehensive list of online courses related to distributed systems

Stop Thinking, Just Do!

Distributed Systems

Tags

28 February 2018

Distributed Systems

Bootcamp

Books

Papers

Storage & Databases

Messaging systems

Distributed Consensus and Fault-Tolerance

Testing, monitoring and tracing

Programming Model

Courses

Blogs and other reading links

Meta Lists