Principles of Reliable Systems

Lessons from Erlang

Strange Loop 2012

Garrett Smith, CloudBees

@gar1t

Presenter Notes

Reliability

Presenter Notes

Quality Wars, Circa 1980s

Presenter Notes

Things that break suck

Presenter Notes

Things that keep going are awesome

Presenter Notes

Introducing Erlang

Presenter Notes

99.9999999% Uptime

Presenter Notes

Erlang's Roots: PLEX

  • Pseudo-parallel, event-driven real-time programming language
  • Dedicated for AXE telephone exchanges
  • Built in the 1970s by Göran Hemdahl at Ericsson
  • Effective, but expensive to use (low level, complex)

Presenter Notes

A New Language!

  • OS independent virtual machine
  • Massive fine grained concurrency
  • Asynchronous message passing
  • Reliability over performance
  • Functional pragmatism over purity

Presenter Notes

Concurrency So Easy...

Presenter Notes

The Principles

  • Isolation
  • Fault detection and recovery
  • Separation of concerns
  • Back box design
  • State management
  • Avoid complexity

Presenter Notes

Isolation

Presenter Notes

Isolation All Around Us

  • Memory
  • Threads
  • Files
  • Disks
  • CPU Cores
  • Network Interfaces
  • Networks
  • Racks
  • Data Centers

Presenter Notes

Fault Detection and Recovery

Presenter Notes

Failing

  • Have to be able to detect failure
  • "Fail fast"
  • Avoid defensive measures
  • Limit the scope of failure

Presenter Notes

Recovery

Courtesy of South Park

  • Unplug the Internet
  • Wait five seconds
  • Plug Internet back in

Presenter Notes

Reboot Fixes Lots of Things

Presenter Notes

Even an F1 Front Wing!

Presenter Notes

Separation of Concerns

Presenter Notes

Small, Focused, Independent

  • Easier to reason about
  • Easier to test
  • Isolation effect - limited scope for change

Presenter Notes

Black Box Design

Presenter Notes

Appliances FTW!

  • Easy to setup (just plug in?)
  • Start button
  • Minimal controls
  • Reboot to fix

Presenter Notes

State Management

"

Presenter Notes

The Thing About State

  • Durability -> Recovery
  • Replication -> Failover
  • Integrity -> Repair
  • Consistency -> Synchronization

Presenter Notes

Four Stages of State Management

Presenter Notes

Session Failover

Courtesy Oracle

Presenter Notes

Session Punted

Presenter Notes

Avoid Complexity

Presenter Notes

Signs of Complexity

  • Dependencies
  • Nesting / Hierarchies
  • Resource Sharing
  • Lots of Code
  • Fear

Presenter Notes

Simple = Reliable

Presenter Notes

Step-by-Step Guide to All This

Presenter Notes

OS Processes Isolation

  • No shared memory
  • Communicate via "message passing" (stdio, sockets, pipes)
  • Process terminate (i.e. "fault") detection
  • Techniques
    • Standard IO "servers"
    • 0MQ (light weight inter process communication via messages)
    • TCP / HTTP

Presenter Notes

Actors

  • No shared memory (semantically)
  • Queues to process messages
  • Inter thread communication via queue inserts (message passing)
  • Direct language support: Scala, Go, Erlang
  • Libraries: Kilim (Java), Pykka (Python), Celluloid (Ruby), libcppa (C++)

Presenter Notes

Fail Fast

  • Avoid defensive practices
  • Let exceptions propagate as far as possible
  • Use assertions and leave them in!
  • Exiting the process is not a bad idea

Presenter Notes

Process Supervision

  • Process monitors: runit, launchd
  • Standard IO "servers"

Presenter Notes

Think Small

  • Narrowing the scope of an “application”
  • Appliance oriented development
  • Micro SOA
  • Functional style programming (e.g. limit avg functions to < 4 lines)

Presenter Notes

Invest in Simplicty

  • If it's not obvious, work until it becomes obvious
  • Take small steps, doing what's clearly the next thing
  • Avoid building for the "future"

Presenter Notes

And In Conclusion...

Presenter Notes

Twitter FTW!

@gar1t

Presenter Notes