Introduction

MIT 6.824: Distributed Systems

Lectures

    1. GFS
    1. Fault Tolerance Raft (2)
    1. Zookeeper
    1. Cloud Replicated DB, Aurora
    1. Distributed Transactions
    1. Spanner
    1. Big Data: Spark

Lecture 1: Introduction

Overview

  • Infrastructure
    • Storage
    • Communication
    • Computation
  • Implementations
    • RPC
    • threads
    • concurrency

  • performance and scalability

    • double computers, double throughputs
    • Fault Tolerance
      • Availability
      • Recoverability
      • Non-volatile Storage
      • Replication
  • Consistency

    • example key value service

Google MapReduce

  • simple for a programmer to write a parallel distributed computation
    • programmer gives a map function and a reduce function
    • the MapReduce program distributes the workload to Google servers
      • the programmer does not need to worry about the distributed system itself!

  • word count is the classic example

  • no need to worry about fault-tolerance!
    • if a single worker fails, rerun it!
  • GFS - file system that is distributed

  • have to do column store
    • This is a shuffle, but expensive to do

Lecture 3: GFS

  • GFS published in 2003, MapReduce in 2004, Hadoop/HDFS in 2006

Big Storage

  • performance - sharding
  • fault - tolerance
  • tolerance - replication
  • replication - inconsistency
  • consistency - low performance
  • strong vs weak conssitency

  • last write corrupt

GFS

  • big, fast
  • global
  • sharding
  • automatic recovery
  • single data center (really :O)
  • internal use
  • big sequential accces (not random)
  • single master!
    • map reduce has as single master too, but failure is so unlikely its fine to rerun all operations

  • master data
    • file name
    • __ handles__
      • list of chunk server (cs)
      • primary version number (v)
      • lease expiration
    • LOG, CHECKPOINT, DISK
      • append to log efficiently
  • READ
    • name of master
    • master ot list of servers
    • gets chunk server which sends data back

  • WRITE
    • no primary - on master
    • find up to date replicas
    • pick p, s
    • increment version #
  • problem of split brain
    • network partition
    • give a primary a lease (has a timer)
      • primary know who has the lease and can wait for it to expire

  • these are secondaries
  • mostly appends
  • ask if they can do it
    • only write if they promise they can
    • what if primary crashes

Lecture 6: Fault Tolerance: Raft (1)

  • RAFT
    • state machine replication correct
  • MapReduce and GFS require single master to choose
    • simple, but single point of failure :/
  • Split Brain

  • two clients and two servers
    • what if a client crash?
    • what if a server crash?

  • split brain c1-s1, c2-s2, the data: 1 can be inconsistent
    • this is a network partition
      • solution: majority vote with odd number of servers
        • example 2 out of 3 majority
        • quorum
        • with 2f+1, you can have f failures, because f+1 agree to a majority
      • Paxos
      • VSR

Raft

  • K/V server
    • table
    • Raft layer
      • log of operations
    • replicated
  • Clients: C1, C2
    • put(k, v)
    • get(k)

  • for an operation
    • enters master server, enters RAFT layer
      • when have majority vote coordination
        • send up to the key/value state

  • AE: append entry
    • with 1 master and two other servers, only need one to reply

  • go interface
  • applyChannel, applyMsg with index
  • election timer -> start election
  • leader election
    • term++, request votes

  • choosing random times
  • newly elected leader, but has divergent logs
  • handling server crashes

  • S1 lost 3 on log

  • can become leader and crash

Visualization

  • green client sends value to blue server
  • distributed consensus is agreeing on a value with multiple nodes
  • nodes can be in 3 states

  • follower state

  • candidate state

  • leader state
  • all nodes start in follower state

  • it sends a request vote to the other nodes
    • they reply with their vote

  • if it gets a majority it becomes leader
    • this is leader election
    • all changes now go through the leader

  • each change is written to a log (uncommitted)

  • to commit the change is replicated to followers

  • after the leader commits, it notifies the followers the entry is committed
    • they commit in the process, called log replication

Leader Election

  • first election timeout
    • amount of time follower waits until becoming candidate
    • randomized between 150ms and 300ms

  • and votes for itself

  • the nodes also reset election timeout

  • like a server crashes

  • no heartbeat sent. one node starts and election because its timeout was randomly faster

  • example of a split vote

  • when this occurs a revote happens and Node D is leader

Log Replication

  • done using append entries messages

  • a response is sent to the client

  • network partitions

  • the log entry stays uncommitted!

  • once the partition is healed, B steps down

Lecture 7: Fault Tolerance: Raft (2)

  • Why not longest log as leader?

  • example vot yes only if high log entry
  • AE Reply
  • Persistent
    • log
    • correct Term
    • voted For
  • synchronize disk update - persisting
    • write(fd, _)
    • fsync(fd)

  • storing a log
    • snapshot every once in a while for efficiency

  • linearizability
    • linear execution history
    • operations that finish first, and reads must come after start of writes
  • reads and writes in lineariable order
  • not linearizable - has a cycle

Lecture 8: Zookeeper

  • The history is linearizable

  • not linearizable - cycle in the graph

  • Zookeeper layer
    • ZAB layer
  • many clients
  • more servers -> slower
  • allow client to send any read only request to any replica

  • log

  • guarantees
    • linearizable writes
    • cheap read
    • want fresh data
      • can send a sync operation

  • write order

  • read order?
    • exist("ready")

  • if it ever exists, PLEASE send me a notification
    • at the correct point in the log to the client's reads!

Lecture 10: Cloud Replicated DB, Aurora

  • Amazon Web Services
    • Custom Backend! - Not AWS Services
    • EC2 - general server/vm
    • S3 - store snapshots

  • persist to EBS
    • can load a new EC2
  • Availability Zones

  • Transaction (XACTION)
    • crash recovery
    • to log

  • RDS
    • replicated to another AZ

  • replicas with quorum
    • just log entries
    • only need 4 out of 6!
  • doing this instead of having to replicate the page table (like RDS) 35x performance!

  • goals of survival of failure of datacenter (AZ)
    • transient slow replicas (?)

  • quorum replication
    • n replicas
    • read and write quorum must overlap
    • W and R

  • databases vm/servers can crash
    • spin up new server log says redo these in the transactions
    • many many pages

  • copy of A, B, etc on different servers
    • replacement, pick 100 segment services
    • pick 1 replica from each
  • reads more common than writes
  • atomicly

Lecture 12: Distributed Transactions

  • Distributed Transactions
    • Concurrency Control
    • Atomic Commit
    • Above done in an abstraction call a transaction

  • BEGIN_X
  • END_X
  • example of a read only transaction

  • ACID
    • Atomic - all or nothing vs failure
    • Consistent
    • Isolated - Serializable
    • Durable

  • serial order of execution of transactions

  • Concurrency Control
    • pessimistic - locking
    • optimistic - occ

  • concurrency control with 2 phase locking

  • distributed transactions
    • want atomic all or nothing - but what if failure
    • two-phase commit!

  • transaction coordinator
    • sends prepare messages to everybody
    • yes/no
    • commit
    • ack
    • lock/unlock - but can crash
      • use a log
      • that is why we have yes/no and ack response

  • individually replicated - raft/group
  • sharding?

Lecture 13: Spanner

  • globally replicated database

  • each data center

    • copies for redundancy and to be near clients
    • sharded - throughput
    • has a leader/follower
  • R/W transactions

    • BEGIN
      • x = x + 1
      • y = y + 1
    • END

  • 2 Phase Commit and 2 Phase Locking in Paxos Groups

  • do a read transaction
    • to master of paxos group - replicate to others
    • 2PC with other master of paxos group

  • commits and leaders of paxos group can replicate to copies

  • simple wishes - wanted constraints
  • serializable
  • external consistency

  • can't just read latest value rip

  • Snapshot Isolation
    • imagine all time is synchronized

  • can now do timestamp

  • need clock synchronization
  • gps master
  • Google has TrueTime
    • returns (earliest, latest)

  • get latest
  • start
  • commit
  • delay time stamp such that the earliest it could possibly be is greater that the latest another transaction could be

  • the latest for the T1 timestamp is 10
    • so read after 12

Lecture 15: Big Data: Spark

  • PageRank in Spark
    • scala
    • graph
    • simulation in parallel iteratively

  • run example

  • page1 has highest rank

val lines = spark.read.textFile("in").rdd

// MapPartitionRDD[5]
// not actually executed! lazy lineage graph!

// get an array rep
lines.collect()

val links1 = lines.map{ s => val parts = s.split("\\s+"); (parts(0), parts(1)) }

val links2 = links1.distinct()

val links3 = links2.groupByKey()

// persist to disk ?
val links4 = links3.cache()

var ranks = links4.mapValues(v => 1.0)

val jj = links4.join(ranks)

val contribs = jj.values.flatMap{ case (usls, rank) => urls.map(url => (url, rank / urls.size)) }

ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)

val output = rank.collect()

output.foreach(tup => println(s"${tup._1} has rank: ${tup._2} ."))

  • the lineage graph

  • store in HDFS (like GFS)
  • parallel RDD
    • read
    • map
  • shuffle
    • expensive
    • distinct
    • wide dependencies

  • complete

  • failed worker
    • wide dependencies problem!

  • store on HDFS - fault tolerant storage
    • checkpoint
    • if a worker fail
    • have the lineage graph so we can reconstruct

Lecture 16: Cache Consistency: Memcached at Facebook

  • Facebook's goal is to making features and selling ads
    • Only scaling if they need to

1.

  • Simple, single server

2.

  • Multiple Front End Servers
  • Still a Single backend database
    • Don't have to worry about things like distributed transactions

3.

  • Multiple Front End Servers, Multiple Databases Shared
    • a-g for database 1, g-q for database 2, q-z for database 3

4.

  • MEMCACHE Servers
    • get(key, value)
    • put(value)
    • a cache for data that is hit often "hot" data
  • but if a MEMCACHE fails, the database will fail
    • the database usually only sees a few percent that the caching layers sees!!!

5.

  • Stale data doesn't really matter
    • just some pictures from friends from last hour, etc.
  • Two datacenters
    • Try to avoid cross nation communication

Read and Write Lookaside Caching

  • delete for overwriting, because the data in the cache is stale
    • performance by:
      • partition
        • RAM efficient
        • not efficient for hot keys
      • replication
        • good for hot keys
        • few TCP connections

  • replication between datacenters

In One Region

  • sharded database
  • multiple clusters
    • FE to MC connections is \( O(N^2) \) so have to limit it and make it in clusters
    • Regional Pool Cache

Thundering Herd

  • a problem where a miss causes other front end servers to overload the database

  • instead use a lease
    • a number
    • invalidate the lease
    • when trying to write note lease is invalid, write is invalid

  • when deleting, use gutter memcache

Consistency

  • we get into a data race
    • problem - data is out of date forever!
    • use lease again
      • grant a lease (a number)

Extras

Extra 1: Scalability Harvard CS 75

  • VPS Service
    • for hosting
    • Amazon Web Services :)

Vertical Scaling

  • Just throw a bigger computers
  • more RAM, more cores, more storage space
  • SAS and SSD drives for write heavy operations like databases

Horizontal Scaling

  • a bunch of cheaper slower machines

Load Balancer

  • load balancer has its own IP address
    • from client to load balancer to server 1
    • could do a copy of everything on each server
    • could do instead images.host.com and have a server for each part

  • could load balance with our DNS server
  • Round Robin: to server 1, to server 2, to server 3, then back to server 1
  • What if server 1 gets heavy weight users?
    • Let Load Balancer decide instead of round robin

  • Can use AWS Load Balancer,
    • Application Load Balancer (ALB), Network Load Balancer (NLB)

Sticky Sessions

  • session cookies

RAID

  • RAID 0: 2 hard drives, stripe to drive 1 then drive 2
  • RAID 1: 2 hard drives, mirror data, if one dies, you still have a copy of your data
  • RAID 5: 5 drives, stripe to 4, and have 1 for redundancy
  • RAID 6: 2 hard drives can die
  • NFS file system, distributed file system
  • What if you trip over a power cord?

Caching

  • static page instead of dynamic page from PHP, HTML file is static
  • problem: changing the style would have to change hundreds of thousands of HTML files
  • MySQL cache: for identically executed queries

  • memcached: memory cache, stores in RAM
    • use extensively by Facebook

Data Replication

  • avoid single points of failure

  • Master-Master

  • High avaliability with Heartbeat Master-Master

  • Load Balancing + Replication
  • heartbeat
    • Active/Passive
    • if server dies, heartbeat ends, Passive becomes Active

  • Partition and Server slaves for replications

Layers of Replication and Redundancy

  • Multiple Load Balancers
  • Multiple Backend Servers
  • Multiple Load Balancers to Databases
  • Multiple Databases
  • Cross Connected
  • Multiple Network Swtiches
  • In a single Datacenter
    • AWS Avalibility Zone
      • Then different regions, East, West, Asia, Europe
  • To do Geography based Load Balancing by doing a Load Balancer with DNS
  • Avoid Single Points of Failure!

Security

  • want TCP, SSL to Load Balancer
    • Now everything is in HTTP instead

Extra 2: Chord Algorithm Berkeley CS 162

  • Consistent Hashing
    • ring space of \(0\) to \( 2^m - 1 \)

  • take an id like ip, hash it and put it on the ring
  • in our key value store, where to store keys?
  • node 20 knows key-values for [16, 20] since the previous node is at 15

  • correctness
    • because it is a ring!
    • always a node that hold they keys
  • performance
    • \(O(log(n))\)

  • routing
    • start at any node which will contain where the next node is
    • worst case is \(O(n)\) - not great

  • correctness with stabilization algorithm
    • updates predecestor and successor

  • a new node gets one node from DNS
    • it calls another node, which calls another node, etc.
      • the final node connects to the new node and updates its successor as the new node

  • finger table
    • a table that knows probabilisticly approximately halfway, quarter, eighth, etc.
    • now instead of having to hope each node
      • you can jump at most half way, then another quarter, etc.
      • \(\text{key}+2^i ~ \text{mod} ~ \text{ringsize}\)

  • fault tolerance if the node fails, successor and precessor has a copy!

  • Dynamo used Chord's Consistent Hashing but using a SLA agreement

Extra 3: Dynamo Amazon’s Highly Available Key-value Store

  • problem: reads scale with read replicates but writes too much overhead
    • another problem was it doesn't have auto-scaling
    • wanted to do data modeling of shopping cart
      • as a key-value operation

  • app at top
    • key value operations
    • shared (A-F, G-K, etc)

  • problem: wanted partitioning
    • solution/technique: consistent hashing
  • prob: high availability for writes
    • sol: vector clocks with conflict resolution on read
  • prob: handling failures (temporary)
    • sol: hinted handoff
  • prob: recovering from failures (permanent)
    • sol: anti-entropy with merkle-trees
  • prob: membership and failure dectection
    • sol: gossip-bsaed protocol

  • partitioning with consistent hashing
    • instead of just choosing a server from N server with hash(k) mod N
      • have a ring [0, L]
      • hash(k) mod L
      • distribute N servers randomly among the L ring and handle all key calls from ring section of previous node to current node

  • need a quorum to consider a read or write successful

  • vector clocks
    • conflict if servers think have different items
    • consistenty by sending messages as vectors
      • there is a property to timely order operations and get servers to be consistent

  • handling failures
    • temporary: hinted handoff, gives it to the next node in the consistent hash
    • permanent: sync using merkle tree
      • periodicly exchange key ranges
      • just pass root of tree instead of whole ring
        • if inconsistent, pass the children and repeat until find when inconsistent
        • pass only a logorithmic size of data and only the root in best case

  • membership and failure decection
    • could send heart beats
      • could have single node to handle listening like zookeeper
        • single point of failure
      • instead just gossip it and in logorithmic time, realize can identify node is dead

Extra 4: CAP Theorem

  • A distributed system coan have at msot two out of the three properties
    • Consistency
    • Availability
    • Partition Tolerance
  • BigTable, MongoDB, and Redis are CP
    • Spanner is CP but highly available
  • Dynamo, Cassandra are AP
  • Amazon Aurora? Physalia in AWS EBS (Looks to be CP but highly available)?

  • servers \(G_1\), \(G_2\) and client with initial value \(v_0\)

  • write to any server. Only \(G_1\) is updated in this case

Consistency

  • any read operation that begins after a write operation completes must return that value, or the result of a later writer operation

  • we write to \(G_1\) but read from \(G_2\) and get a different result

  • we write to \(G_1\) which replicates to \(G_2\) and sends an acknowledgement which \(G_1\) sends back to ghe client
    • now the client can read from \(G_2\) and get a consistent result

Availability

  • every request received by a non-failing node in the system must result in a response
    • server cannot ignore client's requests

Partition Tolerance

  • the network will be allowed to lose arbitarily many messages sent from one node to another

Proof

  • proof by contradiction. Assume a system can be CAP

  • begin by partitioning our system.

  • with a partition, the data is inconsistent

Extra 5: Serverless, Coordination-free Distributed Computing, and the CALM Theorem

  • 4 parts
    • Serverless
    • Avoiding Coordination
    • CALM Theorem
    • Hydro

  • new platform, new language!
    • super computers with new programming paradigms!

  • how will people program the cloud!?
    • building a program modeling is hard
    • distributed systems, consistent, and partial failure

  • more popular than Map Reduce

Serverless

  • fine grain resource usage and efficiency
    • new economy models for cloud providers and users

  • auto scaling!
  • not unbounded distributed computing

  • pay per IO
    • can't do the "batch" to disks
  • No inbound network communcation
    • makes distributed comp difficult
      • embarrassingly parallel like Map work
      • Reduce doesn't :(
        • communication heavy, like shuffles in Spark :'(

Avoiding Coordination

  • How do you embrace state
    • data gravity
    • consistency - hard :(

  • consistency over long distances is hard
    • split brain problem

  • coordination based consistency is bad!

  • make consistency as small as possible

  • coordination has really bad tail latency
    • slowdown cascades

  • instead reason about application semantics
    • rich application logic to READS and WRITES
    • formalize semantics!

CALM Theorem

  • Consistency as Logical Monotonicity
    • if they are logically monotonic, it's consistent!

  • programming confluence
  • only care about outcomes

  • distribtued deadlock detection
    • checks cycles
    • there exists
  • garbage collection
    • reference between objects on different machines
    • 05 to 06 are garbage
      • but machine 2 can't say its garbage until machine 3
      • it requires coordination

  • you can get crazy parallelism!
    • share nothing!
    • how to write in logic language instead of declarative language?
      • like SQL
      • maybe it'll be internal language IR for compiler
        • like databases too!
    • have our cake and eat it too!

Hydro: Stateful Serverless and Beyond

  • Anna autoscaling multi-tier KVS
  • Cloudburst Stateful FaaS with caches
  • HydroLogic, an IR (doesn't depend on order)
  • Hydrolysis, compiler

  • Anna be like Redis and S3
  • CALM consistency of simple lattices
    • autoscaling
    • best-of-conference
    • multi-tiered!
      • can be in fast memory
      • or slow persistent disk

  • shared nothing at all scales and threads!!!
  • under contention, cache thrashing problems

  • auto scaling
    • cost 350x performance

  • robot motion planning

  • serverless jupyter
  • each cell is running a cloudburst lambda! :O

  • can handle all the memroy pursue

  • sharing model state!

  • motion planning with lambdas

  • run compute, then share state

  • need a coordination EC2, bottleneck

  • much quicker cost

Extra 6: Stellar

  • We take for granted currency that's stable
    • good investments

  • want equitable access to financial systems
  • open membership
  • issuer-enforced finality
    • security
  • cross-issuer atomicity
    • good market to trade

  • ACH
    • requires national regulations

  • can't trade across systems paypal -> venmo?

Stellar

  • Replicated State Machine
    • keep ledger safe
    • public key authorize operations
    • accounts can issue assets
  • Transaction guarantee atomicity

  • only works if everyone agrees on ledge double-spend attack

  • think of the internet
    • transitive connections
    • China -> Stanford
    • China -/-> Google
    • Google -> Stanford

  • Stellar Consensus Protocol SCP
  • Byztentime Hypothesis
    • quorum slice - majority

  • quorum slice

  • is a quorum slice but not quorum

  • quorum tier
    • like the internet

  • citi bank sends an attack

  • ACLU doesn't agree because it has no quorum from other non-profits

  • both nodes depend on V7

  • liveness?
    • sufficiency problem
      • must be of an intact set (non-evil)

  • cascade theorem
  • federated voting
    • can get stuck
  • balloting
  • nomination

  • quorum intersecting
  • cascade until it includes all the nodes

  • crappy consensus

  • intertwined nodes cannot confirm contradictory statement
    • optimal safety

  • differ vote

  • green is in a quorum

  • will eventually cascade

  • balloting
    • federated voting can get stuck
    • invariant: all stuck and decided ballots must chose same value

  • prepare(n, x)
  • commit(n, x)
  • synchronize with timer
    • cascades

  • 1,000 operations/ledger
  • 133 nodes, 74 validators, 17 "tier-one" by 5 organizations
  • hash signing

Extra 7: CockroachDB, Spanner, MongoDB

  • timestamp 18:44
  • Amazon
    • fork of MySQL or PostgreSQL
    • custom storage on EBS
      • transaction level on EBS
  • Redis
    • In memory
    • single thread engine

CockroachDB

  • Database Don't die
  • Open Source like Spanner
  • Distributed Database
    • Decentralized Shared Nothing
    • Log Structured Storage Architecture at individual nodes (RocksDB)
    • Concurrency Control Model: Multi-Version OCC
    • Serializable Isolation

  • multi-layer
    • RocksDB storage manager (low level)
    • Raft - replication and consensus
    • key-value api
      • for pages to fetch data on

  • hybrid clock
    • order transaction globally
  • transaction stage intent check conflict commits (?)

  • global keyspace
    • key -> data
  • leader
    • raft replication
  • instead of buffer manager -> disk (your classic single node database)
    • get from distributed key value store system
  • FoundationDB does the same thing (distributed database)
  • engineering! done well, fault tolerant and all the edge cases!

Spanner

  • Cloud Spanner
  • Google wrote BigTable in 2006
    • give up SQL
    • give up joins
    • column-based database
  • Adwords ran on sharded MySQL
    • needed transactions
    • Megastore

  • 2011
  • geo-replicated
  • schematized, semi-relational ?
  • log structured on disk
  • strict 2PC MVCC Multi-Paxos 2PC
    • Paxos groups
    • External Consistency
      • global write_transaction synchronous replication
    • lock-free read only transactions

  • joins are slow :'(
    • have to go to another node/table
  • interleave
    • single page efficient physical denormalization

  • wound-wait deadlock prevention
    • don't need deadlock detection
    • ordering throuhg unique timestamps from atomic clocks and GPS devices
  • tablets (shards)
    • paxos - elect leader in tablet group
    • 2PC for txn spanning tablets

  • completely physical wall-clock time
    • necessary for linearizability
    • paxos group decide order transaction commit

  • internal Google API
    • time range

  • wait long enough
    • then can commit
      • at commit + release locks

  • F1 has OCC
    • now SQL
    • Built for the ad system
  • Spanner SQL
    • for everyone else

  • good benchmarks but expensive!

MongoDB

  • shared nothing document database (2007)
    • json document
    • now can do multi-action transactions

  • don't join

  • instead embed single document
  • pre join

  • the json example

  • no query optimizer
    • just heuristics
  • instead generate a bunch of query plans
    • execute all of them!
      • whichever is fastest return

  • shared nothing architecture
  • master slave replication
  • auto-sharding
    • partition by hash or range
    • automaticly split if a shard is too big
  • startups liked this, oh if I grow, it'll shard automaticly

  • mmap - OS storage manage
    • changed
    • storage backend is WiredTiger but could replace with RocksDB

  • avoid premature optimizations!
    • based on months in the future
  • MySQL and Postgres should be fine for most cases
  • If need to ACTUALLY scale, you have money - instead find and pay smart people to help

Extra 8: Scalability! But at What COST?

  • GraphX paper
    • lots of cores 128

  • repeated look at each label in the graph
    • flows through

  • ran a single core in Rust and it was fast!

  • with better algorithms! - Even faster

  • scalability fetishised by community

  • at 16 cores it is the same performance at a single core

  • 1, 2 cores
    • very very fast! micro-seconds

  • still high performance on a single laptop!

  • When you are focued - high performance compiled languages, caches, not JVM

  • step 1. get 100x on your computer
  • step 2. then 100x from 100x computers
    • but that's expensive!

Extra 9: Ray: A Distributed Execution Framework for AI

  • Task Parallel and Actors Abstraction
    • compatible with TensorFlow, PyTorch etc!

  • million of tasks per seconds
  • stateful compution
  • fault tolerance

Ray API

  • take python functions
    • add @ray.remote
    • returns a future
    • ray.get(id3)

  • task scheduled
    • think about tasks and futures

  • parameter server basically key-value
import ray

ray.init()

  • now a ray.remote
    • get with ray.get(ps.get_params.remote())

  • fake update

Ray Architecture

  • workers, object manager and scheduler
  • global control store

Libraries

  • Apache Arrow
    • serializing, like Python C++

  • pip install ray

  • linear based fault tolerance

    • rerun
    • can rerun

Extra 10: Cluster Management with Borg and Kubernetes

  • Borg

  • Google Data Centers

  • Even Bigger

  • program to run
  • cell/custesr
  • cmdline args

  • 10,000 copies of hello world

  • Borg
  • BorgMaster
  • Borglet
  • all managed by Config file
  • And Blaze?

  • things can go wrong

  • a cell
    • usage - cpu/vm

  • lots of waste
  • overhead

  • bigger cells are better, less partitioning overhead

  • bucket to next order of 2

  • some people run very little resource
  • some people have nice round numbers
  • used best fit scheduling algorithm
    • but it could be gamed

  • people overallocate

  • be more aggressive

  • yellow safety margin
    • smaller on week 2
    • then set it back to original settings

  • configs
  • monitor
  • binaries
  • you only need to focus on the app

Kubernetes

  • open source Borg

  • it schedules and allocates
    • you don't need to car about the machine

  • replication

  • Google Kubernetes Service

  • resiliency
  • efficiency
  • containers