Principles of Reliable Systems

Lessons from Erlang

Strange Loop 2012

Garrett Smith, CloudBees

@gar1t

Presenter Notes

Reliability

Presenter Notes

Quality Wars, Circa 1980s

Presenter Notes

Things that break suck

Presenter Notes

Things that keep going are awesome

Presenter Notes

Introducing Erlang

Presenter Notes

99.9999999% Uptime

Presenter Notes

Erlang's Roots: PLEX

Pseudo-parallel, event-driven real-time programming language
Dedicated for AXE telephone exchanges
Built in the 1970s by Göran Hemdahl at Ericsson
Effective, but expensive to use (low level, complex)

Presenter Notes

A New Language!

OS independent virtual machine
Massive fine grained concurrency
Asynchronous message passing
Reliability over performance
Functional pragmatism over purity

Presenter Notes

Concurrency So Easy...

Presenter Notes

The Principles

Isolation
Fault detection and recovery
Separation of concerns
Back box design
State management
Avoid complexity

Presenter Notes

Isolation

Presenter Notes

Isolation All Around Us

Memory
Threads
Files
Disks
CPU Cores
Network Interfaces
Networks
Racks
Data Centers

Presenter Notes

Fault Detection and Recovery

Presenter Notes

Failing

Have to be able to detect failure
"Fail fast"
Avoid defensive measures
Limit the scope of failure

Presenter Notes

Recovery

Courtesy of South Park

Unplug the Internet
Wait five seconds
Plug Internet back in

Presenter Notes

Reboot Fixes Lots of Things

Presenter Notes

Even an F1 Front Wing!

Presenter Notes

Separation of Concerns

Presenter Notes

Small, Focused, Independent

Easier to reason about
Easier to test
Isolation effect - limited scope for change

Presenter Notes

Black Box Design

Presenter Notes

Appliances FTW!

Easy to setup (just plug in?)
Start button
Minimal controls
Reboot to fix

Presenter Notes

State Management

"

Presenter Notes

The Thing About State

Durability -> Recovery
Replication -> Failover
Integrity -> Repair
Consistency -> Synchronization

Presenter Notes

Four Stages of State Management

Presenter Notes

Session Failover

Courtesy Oracle

Presenter Notes

Session Punted

Presenter Notes

Avoid Complexity

Presenter Notes

Signs of Complexity

Dependencies
Nesting / Hierarchies
Resource Sharing
Lots of Code
Fear

Presenter Notes

Simple = Reliable

Presenter Notes

Step-by-Step Guide to All This

Presenter Notes

OS Processes Isolation

No shared memory
Communicate via "message passing" (stdio, sockets, pipes)
Process terminate (i.e. "fault") detection
Techniques
- Standard IO "servers"
- 0MQ (light weight inter process communication via messages)
- TCP / HTTP

Presenter Notes

Actors

No shared memory (semantically)
Queues to process messages
Inter thread communication via queue inserts (message passing)
Direct language support: Scala, Go, Erlang
Libraries: Kilim (Java), Pykka (Python), Celluloid (Ruby), libcppa (C++)

Presenter Notes

Fail Fast

Avoid defensive practices
Let exceptions propagate as far as possible
Use assertions and leave them in!
Exiting the process is not a bad idea

Presenter Notes

Process Supervision

Process monitors: runit, launchd
Standard IO "servers"

Presenter Notes

Think Small

Narrowing the scope of an “application”
Appliance oriented development
Micro SOA
Functional style programming (e.g. limit avg functions to < 4 lines)

Presenter Notes

Invest in Simplicty

If it's not obvious, work until it becomes obvious
Take small steps, doing what's clearly the next thing
Avoid building for the "future"

Presenter Notes

And In Conclusion...

Presenter Notes

Twitter FTW!

@gar1t

Presenter Notes

Table of Contents

Table of Contents
Principles of Reliable Systems	1
Reliability	2
Quality Wars, Circa 1980s	3
Things that break suck	4
Things that keep going arels awesome	5
Introducing Erlang	6
99.9999999% Uptime	7
Erlang's Roots: PLEX	8
A New Language!	9
Concurrency So Easy...	10
The Principles	11
Isolation	12
Isolation All Around Us	13
Fault Detection and Recovery	14
Failing	15
Recovery	16
Reboot Fixes Lots of Things	17
Even an F1 Front Wing!	18
Separation of Concerns	19
Small, Focused, Independent	20
Black Box Design	21
Appliances FTW!	22
State Management	23
The Thing About State	24
Four Stages of State Management	25
Session Failover	26
Session Punted	27
Avoid Complexity	28
Signs of Complexity	29
Simple = Reliable	30
Step-by-Step Guide to All This	31
OS Processes Isolation	32
Actors	33
Fail Fast	34
Process Supervision	35
Think Small	36
Invest in Simplicty	37
And In Conclusion...	38
Twitter FTW!	39

Help

Help
Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h