Everything I know
Getting a million users is infinitely harder than scaling a system to handle a million users. Most systems could run comfortably on a Raspberry Pi
Fault-tolerant designs treat failures as routine. In large-scale systems, the assumption is that component failures will happen sooner or later. Any individual failure must be presumed imminent and component failures must be expected to be continuous.
Setting up containers, load balancing, and service discovery on light hardware
Ask HN: Any recommended resources to develop system thinking? (2018)
Distributed Systems in One Lesson by Tim Berglund (2017)
- Modern HTTP reverse proxy and load balancer that makes deploying microservices easy. (
Hello World with Traefik
Traefik Training course resources
- Standard library for microservices written in Go.
Fear and Loathing in Lock-Free Programming (2017)
Reliable Systems Series: Model-Based Testing (2018)
Awesome Distributed Systems
- Cloud-Native API Gateway & Service Mesh.
- Distributed message broker.
- Tool for building distributed applications.
- Raft distributed consensus algorithm implemented in Rust.
- Hashicorp's Raft implementation.
In Search of an Understandable Consensus Algorithm
- Technical specifications for the libp2p networking stack.
Class materials for a distributed systems lecture series
Raft Consensus Algorithm
- Global dataset version control system (GDVCS) built on the distributed web.
- Meaningful control of data in distributed systems.
- Collection of modules for building realtime client-server networked applications.
- Framework for formally verifying distributed systems implementations in Coq.
PingCAP Talent Plan
- Series of training courses about writing distributed systems in Go and Rust.
- Build protocols, systems, and tools to improve internet.
- Open source R&D affinity. Exploring the potential of new and existing technologies in crypto-space to encourage horizontal group collaboration.
- Web developers, facilitators, crypto-engineers. Experts in Node.js & distributed systems.
- Build highly concurrent, distributed, and resilient message-driven applications on the JVM. (
- Provides reusable infrastructure for formally verifying distributed systems using the Coq proof assistant.
Practical Networked Applications in Rust, Part 1: Non-Networked Key-Value Store
- Fully Decentralized Fully Replicated Key/Value Store.
- Curated selection of artisanal consensus algorithms and hand-crafted distributed lock services.
- Tool for collecting detailed systems performance telemetry and exposing burst patterns through high-resolution telemetry.
- Distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
- Open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
- Fault tolerant, protocol-agnostic RPC system.
How To Build A Modern Distributed Compute Platform (2018)
- Resiliency tool that helps applications tolerate random instance failures.
- Python Stream Processing.
"Consistency without consensus in production systems" by Peter Bourgon (2014)
Distributed consensus reading list
- Community version of fully distributed, highly scalable and fault tolerant workflow orchestration platform for JVM.
- Helps you deploy and run Linkerd, the fully open source, ultralight service mesh.
- Runtime system for scaling irregular applications on commodity clusters.
MIT Distributed Systems course (2020)
Correctness proofs of distributed systems with Isabelle/HOL (2019)
- Cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks.
- Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Learning Distributed Systems - Cloud Native Podcast
- Distributed reliable key-value store for the most critical data of a distributed system.
- Command-line tool for operating an etcd cluster. It makes it easy to create a new cluster, add a member to, or remove a member from an existing cluster.
Learning to build distributed systems (2019)
- Toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
How to get started with infrastructure and distributed systems (2016)
Advanced Napkin Math: Estimating System Performance from First Principles (2019)
- Uber ringpop based distributed and decentralized rate limiter.
System Design lectures (2020)
- Patterns of Scalable, Reliable, and Performant Large-Scale Systems.
LeetCode System Design Questions
Grokking the System Design Interview
Amazon Builders' Library
- How Amazon builds and operates software.
Distributed Systems Wiki
- Distributed Systems Safety Research.
- Distributed RTC system written by pure go and flutter.
Challenges with distributed systems
Systems design for Advanced Beginners (2020)
Performance Under Load (2018)
- Distributed, fault-tolerant pipeline for runtime data.
List of distributed systems reading lists
Complexities of Capacity Management for Distributed Services (2020)
Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol (2020)
WormSpace: A Modular Foundation for Simple, Verifiable Distributed Systems
Paxos vs Raft: Have we reached consensus on distributed consensus? (2020)
Teleforking a process onto a different computer! (2020)
Debugging Distributed Systems
Distributed systems for fun and profit
- Open source microservices orchestration engine for running mission critical code at any scale. (
Why I joined Temporal
- Distribution of Temporal that runs as a single process with zero runtime dependencies.
- Model checker for implementing distributed systems. (
Arvind Krishnamurthy's research
Distributed Services with Go
Fully asynchronous C implementation of the Raft consensus protocol
Notes on Distributed Systems for Young Bloods (2013)
- Rust implementation of a distributed consensus algorithm based on Leslie Lamport's Paxos.
- Network event stream processing system, in Clojure.
Collection of the papers, conference talks, articles, blog posts, interesting Twitter threads, HN/reddit comments on systems engineering
Tess Rinearson - All Together Now: An Introduction to Distributed Consensus (2019)
- Open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. (
- Lightweight tool for submitting Python functions for computation within a Slurm cluster.
Readings in Distributed Systems
Control theory for fun and profit (2020)
Understanding Replication in Databases and Distributed Systems (2018)
A plain English introduction to CAP theorem
Debugging Incidents in Google's Distributed Systems (2020)
- Programmable, observable and distributed job orchestration system which allows for the scheduling, management and unattended background execution of user created tasks on Linux based systems. (
Verifying Strong Eventual Consistency in Distributed Systems (2017)
Patterns of Distributed Systems (2020)
Keeping CALM: When Distributed Consistency Is Easy (2020)
Distributed Systems Notes
Avoiding fallback in distributed systems
The Reactive Principles
- Design Principles for Distributed Applications.
- Framework that implements WPaxos and other Paxos protocol variants.
- Learn about network programming, concurrency, distributed systems, and more as you tackle the challenge of implementing the Raft distributed consensus algorithm.
Resources for learning distributed systems (2020)
Workload isolation using shuffle-sharding (2020)
Consensus is Harder Than It Looks (2020)
The Little Strangler
A Review of Consensus Protocols (2020)
Disel: Distributed Separation Logic
- Separation-style logic for compositional verification of distributed systems.
- Implementation of the Raft consensus algorithm on top of the act-zero actor framework.
- Application to simulate and test a Raft cluster, using raft-zero.
Building Netflix’s Distributed Tracing Infrastructure (2020)
Wikipedia's self-hosted CDN (2020)
Infinite Parallel Universes: State at the Edge (2020)
Awesome Chaos Engineering
How you could have come up with Paxos yourself (2020)
- Open source, easy-to-use and high-scale distributed tracing backend. (
Principles of chaos engineering
Chaos Experimentation, an open-source framework built on top of Envoy Proxy (2021)
Testing Distributed Systems
- Curated list of resources on testing distributed systems. (
Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories (2020)
Notes on Paxos (2020)
This is why distributed systems are useful (and I am building one) (2020)
Distributed Systems lecture series by Martin Kleppmann (2020)
- Distributed, fault tolerant job scheduling system for cloud native environments. (
- Industrial-grade C++ implementation of the RAFT consensus algorithm.
Distributed Systems course (2020)
- Consensus library implementing the Mir consensus protocol.
Fairness in multi-tenant systems (2020)
Advanced Distributed Systems Design course
Raft implementation in Go
Loading Shedding Strategies
- Demonstration of load shedding and how it can make your services more resilient in outages and come back online quicker.
A Byzantine failure in the real world (2020)
Byzantine Eventual Consistency
Interval Tree Clocks (2020)
Distributed Systems Reading List
- Decentralized shared state.
Understanding Connections & Pools (2021)
Awesome distributed transactions
Rystsov's Blog on distributed systems
- Scaling Replicated State Machines with Compartmentalization. (
DistSys Reading Group
CASPaxos: Replicated State Machines without logs (2018)
Consensus: Bridging Theory and Practice
- PhD dissertation on the Raft consensus algorithm.
The Fundamental Mechanism of Scaling (2021)
- Simple, universal API for building distributed applications. Accelerating machine learning workloads. (
- Framework for distributed systems verification, with fault injection. Clojure library.
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (2019)
Distributed Systems in Rust
- Training course about the distributed systems in Rust.
- Raft implementation in Rust.
Implementing Raft's Leader Election in Rust (2021)
Effective Fallbacks (2020)
Ask HN: Recommended books and papers on distributed systems? (2021)
Raft implementation in Rust language
- Fast linearizability checker for testing the correctness of distributed systems.
Testing Distributed Systems for Linearizability (2017)
- Programmable Fuzzy Scheduler for Testing Distributed Systems.
Engineering Dependability and Fault Tolerance in a Distributed System (2021)
Autopilot: workload autoscaling at Google (2020)
- Byzantine-fault-tolerant protocol for synchronizing time among a group of peers, without reliance on any external time authority.
Foundational Distributed Systems Papers (2021)
Making reliable distributed systems in presence of software errors by Joe Armstrong (2003)
- Distributed chat system which can be used as chat rooms or state synchronization.
- Workbench for learning distributed systems by writing your own.
An introduction to lockless algorithms (2021)
Distributed Systems Course
Sundial: Fault-tolerant Clock Synchronization for Data Centers (2021)
Achieving reliable dual writes in distributed systems (2021)
Paxos Made Simple (2016)
- Distributed Computing for AI Made Simple. (
Raft Implementation & CLI Visualization in Rust
Ask HN: Learning Distributed Systems as a Junior Engineer (2021)
The Distributed Reading List
- Library that simplifies writing distributed programs by seamlessly launching them on a variety of different platforms.
The Problem of Distributed Consensus (2021)
A robust distributed locking algorithm based on Google Cloud Storage (2021)
- Build share and run your distributed applications.
- Guides, Articles, Podcasts, Videos and Notes to Build Reliable Large-Scale Distributed Systems.
Building a Raft (2021)
Time, clocks, and order. (2020)
- Look at the notion of time in a distributed system, and its effects on ordering.
The Generals (2020)
- Look at the Two Generals' and Byzantine Generals' problem, two popular consensus problems.
Impossibility of Distributed Consensus with One Faulty Process (2020)
The CAP Theorem (2020)
Metastability and Distributed Systems (2021)
Distributed Systems Course (2021)
Metastable Failures in Distributed Systems (2021)
Distributed Systems Engineering Course Notes (2015)
- High performance, distributed and low latency publish-subscribe platform. (
Patterns of Distributed Systems: Lamport Clock (2021)
Make your cluster SWIM (2020)
- Tool for designing complex distributed systems, allowing you to simulate data flow with customizable components. (
Patterns of Distributed Systems: Follower Reads (2021)
Getting To Know Logical Clocks By Implementing Them (2021)
Paxos vs Raft: Have we reached consensus on distributed consensus? (2021)
Consistency and Consensus – How Do Paxos and Raft Work? (2021)
Summer Blog Backlog: Distributed Systems (2021)
Fanouts and Percentiles (2020)
Distributed Tracing — we’ve been doing it wrong (2019)
How To Design A Reliable Distributed Timer (2021)
- WAL-is-data engine that used to store multi-raft log.
Three Clocks are Better than One
RAMP up your distributed transactions (2021)
Errors found in distributed protocols
Python for Distributed Systems (2021)
- High-Performance Byzantine Fault Tolerant Settlement.
Distributed consensus made simple (for real this time!) (2021)
Hints and Principles for Computer System Design (2021)