Lecture 6: Fault Tolerance: Raft (1)

  • RAFT
    • state machine replication correct
  • MapReduce and GFS require single master to choose
    • simple, but single point of failure :/
  • Split Brain

  • two clients and two servers
    • what if a client crash?
    • what if a server crash?

  • split brain c1-s1, c2-s2, the data: 1 can be inconsistent
    • this is a network partition
      • solution: majority vote with odd number of servers
        • example 2 out of 3 majority
        • quorum
        • with 2f+1, you can have f failures, because f+1 agree to a majority
      • Paxos
      • VSR

Raft

  • K/V server
    • table
    • Raft layer
      • log of operations
    • replicated
  • Clients: C1, C2
    • put(k, v)
    • get(k)

  • for an operation
    • enters master server, enters RAFT layer
      • when have majority vote coordination
        • send up to the key/value state

  • AE: append entry
    • with 1 master and two other servers, only need one to reply

  • go interface
  • applyChannel, applyMsg with index
  • election timer -> start election
  • leader election
    • term++, request votes

  • choosing random times
  • newly elected leader, but has divergent logs
  • handling server crashes

  • S1 lost 3 on log

  • can become leader and crash

Visualization

  • green client sends value to blue server
  • distributed consensus is agreeing on a value with multiple nodes
  • nodes can be in 3 states

  • follower state

  • candidate state

  • leader state
  • all nodes start in follower state

  • it sends a request vote to the other nodes
    • they reply with their vote

  • if it gets a majority it becomes leader
    • this is leader election
    • all changes now go through the leader

  • each change is written to a log (uncommitted)

  • to commit the change is replicated to followers

  • after the leader commits, it notifies the followers the entry is committed
    • they commit in the process, called log replication

Leader Election

  • first election timeout
    • amount of time follower waits until becoming candidate
    • randomized between 150ms and 300ms

  • and votes for itself

  • the nodes also reset election timeout

  • like a server crashes

  • no heartbeat sent. one node starts and election because its timeout was randomly faster

  • example of a split vote

  • when this occurs a revote happens and Node D is leader

Log Replication

  • done using append entries messages

  • a response is sent to the client

  • network partitions

  • the log entry stays uncommitted!

  • once the partition is healed, B steps down