This post was originally posted from the 10% Smarter newsletter. Consider subscribing if you like it!
Ever wonder how large scale services at built at large tech companies? I’ve collected tech talks given by Facebook, Amazon, Netflix, and Google to share!
Facebook: YouTube Playlist
Amazon: YouTube Playlist
Netflix: YouTube Playlist
Google: YouTube Playlist
This post was originally posted from the 10% Smarter newsletter. Consider subscribing if you like it!
Distributed systems are complex and include many difficult problem to tackle. Fortunately, some techniques have been developed and are widely used in production to deal with these issues in distributed systems. Today, we’ll discuss the relationship between consistency and consensus, and how they are used online services all over the world.
Linearizability is the property that a data object should appear as if there were only one copy of the object, and all operations on object are atomic. Linearizability also guarantees the data will be the most up-to-date on reads and writes of a single object. This however, does not protect against write skew.
Linearizability is used by Apache ZooKeeper, etcd, and Google Chubby. These are lock granting services and are often used to implement distributed locks, leader election and service discovery. They use consensus algorithms to be linearizable.
For example, one of the ways to implement leader election is by using a lock. All the eligible nodes start up and try to acquire a lock and the successful one becomes the leader. The lock must be linearizable.
Single-leader, Multi-leader, and Leaderless Replication are unfortunately not linearizable.
Single-leader Replication is potentially linearizable if reads are from synchronously updated followers. However non-linearizable issues occur from stale-replicas or split-brains.
Multi-leader Replication is not linearizable because writes can be concurrent and asynchronous. This results in clients seeing different values for a single object.
Leaderless Replication is non-linearizable because clock timestamps are not guaranteed to be consistent with actual ordering of events due to clock skew.
What should we use instead?
The solution is Consensus Algorithms, which are linearizable because they are similar to single-leader replication but have additional measures to prevent stale replicas and split-brain.
The famous CAP Theorem is presented as Consistency, Availability, Partition Tolerance: pick 2 out of 3. However this is a misnomer. Network Partitions are inevitable in large scale systems, so a more accurate statement is choose Consistency or Availability when experiencing a Network Partition. The Consistency is this case actually means Linearizability!
Different databases and their chosen tradeoffs from the CAP theorem.
The main reason for dropping linearizability is performance, not fault tolerance. If an application requires linearizability and some replicas are disconnected from other replicas due to a network problem, then these replicas become unavailable and must wait until the network problem is fixed to ensure a single value for each object.
In understanding consensus algorithms, it is also important to understand causality. With causality, an ordering of event is guaranteed, such that cause always comes before effect. A system that obeys the ordering from causality is causally consistent.
If elements are in a total order, it means that they can always be compared.
With a partial order, we can sometimes compare the elements and say which is bigger or smaller, but cannot in order cases. For example: mathematical sets are partial ordered. You can’t compare {a, b} with {b, c}.
Linearizablity and Casual Consistency are slightly different. Linearizability has a total order of operations: if the system behaves as if there is only a single copy of the data. Causality or casual consistency has a partial order, not a total one. Concurrent operations are incomparable, we can’t say that one happened before the other, therefore Linearizability is stronger than causal consistency.
A good way of keeping track of causal dependencies in a database is by using sequence numbers or timestamps to order the events. The timestamp can be a logical clock that generates monotonically increasing numbers for each operation.
This becomes a total order where if operation A causally happened before B, then the sequence number for A must be lower than that of B. This allow for the ordering of concurrent operations arbitrarily.
In single-leader databases, the replication log defines a total order of write operations using the monotonically increasing sequence number. All followers apply the writes in that order and will always be in a causally consistent state. Doing this achieves linearizability.
Linearizability can also be achieved by Multi-Leader and Leaderless databases. If every node keeps track of their own sequence numbers and the maximum counter value it has seen so far, it is possible to have a total ordering. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.
To ensure a total ordering, give each node a unique identifier and use Lamport timestamps, which are pairs of (counter, nodeId). Multiple nodes can have the same counter value, but including the node ID in the timestamp makes it unique.
For example, if you had two nodes 1 and 2, you could get an ordering like: (1,1) -> (2, 2) -> (3, 1) -> (3,2).
Total Order Broadcast, or atomic broadcast, is a broadcast protocol for ordering requiring two properties:
Reliable delivery: no messages can be lost.
Totally ordered delivery: messages must be delivered to every node in the same order.
Another way of looking at total order broadcast is that it is a way of creating a log. Delivering a message is like appending to the log.
Because log entries are delivered to all nodes in the same order, if there are several concurrent writes, all nodes will agree on which one came first. Choosing the first of the conflicting writes as the winner and aborting later ones ensures that all nodes agree on whether a write was committed or aborted.
Three ways to make reads linearizable are:
You can sequence reads through the log by appending a message, reading the log, and performing the actual read when the message is delivered back to you (etcd works something like this).
Fetch the position of the latest log message in a linearizable way, you can query that position to be delivered to you, and then perform the read (the idea behind ZooKeeper’s sync()).
You can make your read from a replica that is synchronously updated on writes.
Finally, we can understand and look at consensus. Consensus means getting several nodes to agree on something. It’s not an easy problem to solve but fundamental for leader election and performing an atomic commit.
Two-phase commit is an algorithm to implement atomic commit, where all nodes either successfully commit, or abort.
One node is designated as the coordinator. When the application is ready to commit a transaction, the two phases are as follows:
The coordinator sends a prepare request to all the nodes participating in the transaction, for which the nodes have to respond with essentially a ‘YES’ or ‘NO’ message.
If all the participants reply ‘YES’, then the coordinator will send a commit request in the second phase for them to actually perform the commit. However, if any of the nodes reply ‘NO’, the coordinator sends an abort request to all the participants.
When a participant votes ‘YES’, it promises that it will be able to commit. Once the coordinator decides, that decision is irrevocable.
If one of the participants or the network fails during 2PC, the coordinator aborts the transaction. If any of the commit or abort request fail, the coordinator retries them indefinitely.
If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction.
The only way 2PC can complete is by waiting for the coordinator to recover in case of failure. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants.
2PC is also called a blocking atomic commit protocol, as 2PC can become stuck waiting for the coordinator to recover. Three-phase commit (3PC) is an alternative that requires a perfect failure detector.
Unfortunately, Two-Phase Commit is not used in practice, because distributed transactions are blocking causing performance and operational problems.
On the other hand, Paxos and Raft are popular fault-tolerant consensus algorithms that are used widely in practice. They solve the problem of consensus, which can be used to implement atomic commit, and do not block as long as a majority of nodes are agree.
The properties required of a consensus algorithm are:
Uniform agreement: No two nodes decide differently.
Integrity: No node decides twice.
Validity: If a node decides a value v, then v must have been proposed by some node.
Termination: Every node that does not crash eventually decides some value.
How are Paxos and Raft implemented? Consensus algorithms are actually total order broadcast algorithms, using single leader replication.
Both protocols define a monotonically increasing epoch number (ballot number in Paxos and term number in Raft) and make a guarantee that within each epoch, the leader is unique.
Whenever the current leader is thought to be dead, the nodes start a vote to elect a new leader. In each election round, the epoch number is incremented. If we have two leaders belonging to different epochs, the one with the higher epoch number will prevail.
A node cannot trust its own judgement. It must collect votes from a quorum of nodes. For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal.
There are two rounds of voting, once to choose a leader, and second time to vote on a leader’s proposal. The quorums for those two votes must overlap.
The biggest difference with 2PC, is that 2PC requires a ‘YES’ vote for every participant.
We’ve seen how Paxos and Raft can implement consensus, a powerful tool that allows several nodes in a distributed system to agree some state. We’ve look at how Paxos and Raft are actually total order broadcast algorithms, using single leader replication and how this can be used to achieve linearizability. Feel free to subscribe as we look at batch processing next week and share this article if you liked it!
]]>This post was originally posted from the 10% Smarter newsletter. Consider subscribing if you like it!
Transactions are one of the fundamental ideas in databases. They enable safe semantics for applications and developers. But what are they and how do you use them? Let’s dive in and find out.
Transactions group multiple reads and writes into one operation that rolls back everything if any step fails (all succeed or all fail). It sacrifices performance for safety guarantees.
To write a transaction in MySQL or Postgres, you can write:
BEGIN TRANSACTION;
-- Your queries
COMMIT;
Unlike relational databases such as MySQL or Postgres, NoSQL database choose to forgo transactions for greater scalability.
Transactions protect against:
Database software or hardware failure in the middle of a write.
Application crashes midway through series of operations.
Network failures that cuts off application from a database or one database node from another.
Multiple clients writing to the database at same time (overwriting each other).
Client reading partially updated data.
Race conditions with weird bugs.
How can transactions enable safety against these failures? Transactions follow a set of properties known as ACID that make them safe.
ACID stands for Atomicity, Consistency, Isolation, and Durability.
Atomicity refers to the ability to abort a transaction on error and have all the writes from the transaction discarded. It can then be safely retried.
Consistency means your applications must always be in good state. Your application will have invariants on your data that must always be true. Consistency is actually a property of the application but transactions enforce those rules.
Isolation in ACID means serializability. Concurrently executing transactions are isolated from each other such that each transaction seems to be the only transaction running on the entire database and the result seems like it was ran serially (one after the other).
Durability means once a transaction is committed, any data that has been written will not be forgotten, even if there is a hardware failure or the database crashes.
In a single-node database, this means the data has been written to hard drive. A write-ahead log is also used for recovery. In a replicated database, this means the data has been successfully copied to some number of nodes.
It’s important to note that perfect durability does not exist. If all the hard disks and backups are destroyed at the same time such as a power outage or a bug that crashes every node, there’s nothing the database can do to save you.
Why do we care about serializability and isolation? They can help mitigate concurrency issues (e.g. race conditions), which happen when one transaction reads data, while another transaction modified data, or when two transactions both modify the same data.
Let’s look at some different isolation levels.
Serializable Isolation is the strongest form of isolation. Concurrently executing transactions are isolated from each other and appear to be ran serially (one after the other).
Serializable isolation has a performance cost. It’s common for systems to use weaker levels of isolation.
Next is Read Committed Isolation, an isolation level that guarantees no dirty reads and no dirty writes.
Dirty Reads: When a transaction is able to read uncommitted data from another transaction.
Dirty Writes: When a transaction is able to overwrite uncommitted data from another transaction.
Most databases prevent dirty writes by using row-level locks that hold the lock until the transaction is committed or aborted. Only one transaction can hold the lock for any given object.
On dirty reads, requiring read locks do not work well. One long-running write transaction can force many read-only transactions to wait. Instead, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock.
One issue of Read Committed Isolation is read skew. Read skew means that you might read the value of an object in one transaction before a separate transaction begins, and when that separate transaction ends, that value has changed into a new value.
For example, say Alice has $1,000 in two bank accounts. If she does two reads, but a transfer transaction happened concurrently, she could read one account with $500 and one with $400 for only $900.
This is an issue for long running operations with transactions still happening such as backups and analytics queries.
To prevent read skew, each transaction reads from a consistent snapshot of the database (snapshot isolation). Write locks are used to prevent dirty writes. To prevent dirty reads, and avoid blocks writes, the database keeps multiple committed versions of objects known as multi-version concurrency control or MVCC.
However, both Read Committed and Snapshot Isolation is still vulnerable to Lost Updates. Lost Updates are an issue where a transaction can reads some value from a database, modify it and writes it back (read-modify-write cycle) when another transaction does as well. This can lead to one of the modifications being lost.
To deal with lost updates, there are many strategies:
Atomic Write Operations: these are the simplest solution to lost updates. Here we use atomic operations such as:
UPDATE counters SET value = value + 1 WHERE key = 'foo';
Explicit Locking: where the application explicitly lock objects that are going to be updated.
Detect Lost Updates: if a lost update is detected, the transaction can be aborted and forced to retry its read-modify-write cycle.
Compare and Set: if the current value does not match with what you previously read, the update has no effect.
UPDATE wiki_pages SET content = 'new content'
WHERE id = 1234 AND content = 'old content';
Conflict Resolution and Replication: with multi-leader or leaderless replication, compare-and-set do not apply because there is no guarantee of a single up-to-date copy of data. Instead the strategy used is to allow concurrent writes to create several conflicting versions of a value, and let the application resolve and merge the versions.
Write skew can occur if two transactions read the same objects, and then update some of those objects. You get a dirty write or lost update anomaly.
For example, the database is using snapshot isolation, and checks at least two doctors on call. Both transactions can commit successfully, causing an error leaving no doctor on call.
Automatically prevent write skew requires serializable isolation. The second-best option is to explicitly lock the rows that the transaction depends on.
BEGIN TRANSACTION;
SELECT * FROM doctors WHERE on_call = true AND shift_id = 1234 FOR UPDATE;
UPDATE doctors
SET on_call = false
WHERE name = 'Alice'
AND shift_id = 1234;
COMMIT;
As we’ve seen as these problems, it can often times make sense to use Serial Isolation. Let’s look at the ways we can use Serial Isolation.
To remove concurrency problems, actual serial execution, execute only one transaction at a time, in serial order, on a single thread.
Writers acquire locks and block both writes and readers. It protects against all the race conditions discussed earlier.
The blocking is implemented by having a lock on each object used in a transaction. The lock can either be in shared mode or in exclusive mode. The lock is used as follows:
When a transaction wants to read an object, it must first acquire a shared mode lock. Multiple read-only transactions can share the lock on an object.
When a transaction wants to write an object, it must acquire an exclusive lock on that object.
If a transaction first reads and then writes to an object, it may upgrade its shared lock to an exclusive lock.
After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction.
The two phases are acquiring the locks, and when all the locks are released.
Predicate locks lock all objects that match some search condition including objects that do not yet exist in the database, but which might be added in the future.
Two-phase locking with predicate locks, prevents all forms of write skew allowing for serializable isolation.
SELECT * FROM bookings WHERE room_id = 123 AND end_time > '2018-01-01 12:00' AND start_time < '2018-01-01 13:00';
Two-phase locking has some disadvantages though. Deadlock, where a transaction A is stuck waiting for transaction B to release its lock, and vice versa, can occur. Two-phase locking performance is significantly worse than weak isolation. A transaction may have to wait for several others to complete before it can do anything.
Serializable Snapshot Isolation provides full serializability and has a small performance penalty compared to snapshot isolation.
This is done by instead of holding locks on transactions, transactions are allowed to continue to execute as normal until the stage where the transaction is about to commit, when it then decides whether the transaction executed in a serializable manner. This approach is known as an optimistic concurrency control.
Two-phase locking is called pessimistic concurrency control because if anything might possibly go wrong, it’s better to wait.
Serializable snapshot isolation is optimistic concurrency control technique. Instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. The database is responsible for checking whether anything bad happened. If so, the transaction is aborted and has to be retried.
The database knows which transactions may have acted on an outdated premise and need to be aborted by:
Detecting reads of a stale MVCC object version and if any of the ignored writes have now been committed. If so, the transaction must be aborted.
Detecting writes that affect prior reads. SSI uses index-range locks except that it does not block other transactions. If so, it simply notifies the other transactions that the data they read may no longer be up to date.
This covers transactions and their properties that enable safe usage in databases and applications. Stay tuned as we explore the troubles of distributed systems next week. Feel free to share this article if you liked it!
]]>This post was originally posted from the 10% Smarter newsletter. Consider subscribing if you like it!
In past posts, we’ve covered the data modeling side of databases. This post will give an overview of the internal storage and retrieval in databases. Let’s start by compare two different storage engines:
Log-structured storage engines (log structured merge trees)
Page-oriented storage engines (b-trees)
To start, we’ll start with some definitions:
A log is an append-only sequence of records. A record is a key, value pair to store and access data.
Indexes are data structures to speeds up reads and find a record. Well-chosen indexes speed up read queries but every index slows down writes.
In a log-structured storage engine, we index records, using an in-memory key-value store, also known as a hash map.
Additionally, we keep records in an append-only log file. Then wouldn’t we run out of disk space? We solve this by breaking the log into segments files of certain sizes. Compaction takes a segment and keeps only the most recent key-value records.
Then we merge them into a new file and the old segment files will be deleted. This process is known as compaction and merging and can be done at the same time. Each segment has its own in-memory hash map and we can read the most recent key-value records from each segment file. This process can be done in a background thread.
This is the basic Log-Structured Storage Engine design. Its properties and considerations are as followed:
Efficiency: both appending and segment merging are sequential writes, which are more efficient that random writes.
File format: binary format.
Deleting Records: we use a deletion record (tombstone) that tells the merging process to discard previous values.
Crash Recovery: if restarted, the in-memory hash maps are lost. You can recover from reading each segment or speed up recovery by storing a snapshot of each segment hash map on disk.
Concurrency Control: this uses a single write thread as writes are appended to the log in a strictly sequential order, but reads can be concurrent by multiple threads as segments are immutable.
Limitations: The hash map must fit in memory and range queries are not efficient.
So far we’ve looked at basic version of Log-Structured Storage Engine, but a modified design known as SSTables and LSM-Trees are used more in practice.
A Sorted String Table or SSTable is another design that require the sequence of key-value records to be sorted by keys and for each key to appear only once in the merged segment file (compaction already ensures this).
The advantages of SSTables over log segments with hash indexes are:
Merging segments is simple and efficient with mergesort.
You can store fewer mapping keys in memory. The rest are easy to find with a search between indexed keys.
Since reads will scan over several key-value records in a range, we can group these records into a block and compress it before writing to disk.
We sort the data with a red-black tree or AVL tree. This in-memory balanced tree structure is called a memtable. When the memtable get bigger than some threshold (megabytes), we’ll write it to disk as a SSTable. On a read request, try to find the key in the memtable or an on-disk memtables, and then find the record in an on-disk segment file. From time to time, run merging and compaction with a background thread. In a database crash, the most recent writes are lost so write them to a separate on-disk log so we can recover the memtable. Once the memtable is written to an on-disk SSTable, the log is discarded.
Storage engines based on this are called LSM-Tree structure engines (Log Structure Merge-Tree).
LSM-tree algorithm can be slow when looking up keys that don’t exist in the database. Bloom filters are used to speed up checking for existence in the database.
Databases that use SSTables and LSM-Trees include:
LevelDB (Google, storage engine for BigTable)
RocksDB (Facebook, open sourced)
Cassandra
Now let’s look at page-oriented storage engines.
Page-oriented storage engines are built on B-trees, which are the indexing structure used in relational databases. B-trees keep the key-value records sorted by keys, allowing for efficient key-value lookups and range queries. This breaks the database into fixed sized blocks or pages on disk, traditionally 4KB.
When searching a B-tree, you start at the root page, which contains multiple keys and references to child pages, and traverse down the tree until you find your requested record. The branching factor is the number of references to child pages.
Both updates and writes search through the tree for a leaf page containing the key location. If there isn’t enough free space in the page, it is split into two half-full tree to keep the tree balanced. A B-tree with n keys will always have a depth of O(log n).
Write operations overwrite pages on disk, compared to LSM-trees which only append to files.
If the database crashes with some pages written, you’ll end up with a corrupted index. We solve this with a write-ahead log (WAL, also know as the redo log).
We can also have concurrent access with multiple threads doing concurrency control with latches (lightweight locks) on nodes in the tree internal data structures.
An additional common extension are B+ trees, which only store data at leaf nodes and pointers to other leafs for easy range queries.
LSM-trees are typically faster for writes and B-trees are faster for reads. LSM-trees are slower for reads because they have to check different data structures, and SSTables have to do compaction.
The Advantages of LSM-trees are:
Higher write throughput due to lower write amplification. SSDs can only overwrite blocks a limited number of times before wearing out. LSM-tree writes are mostly sequential appends.
LSM-trees compress better B-trees tend to leave disk space unused due to fragmentation.
The Disadvantages of LSM-trees are:
The background compaction process can interfere with the performance of reads and writes.
On B-trees, each key exists in exactly one place in the index. This offers strong transaction isolation using latches on whole ranges of keys.
So for we’ve looked at indexes for log-structured and page-oriented storage engines. A primary index uniquely identify one row in a relational table, or one document in a document database. This is done from the key mapping to the key-value record (row/document).
You can also use a secondary index, which is similar but not necessarily unique. These indexes can either map to a list of matching rows, or adding an additional unique id entity to the row.
Multi-Column Indexes allow you to index using multiple columns as a key. There are two ways to do it. Concatenated indexes indexing as a tuple like (lastname, firstname). Multi-dimensional indexes are more general and useful for geospatial data, which B-trees and LSM-trees cannot do efficiently. This is done using an R-tree.
So far we’ve looked at storage using disks and indexing using memory. While Disks have two significant advantages: they are durable, and they have lower cost per gigabyte than RAM, RAM is quickly becoming affordable.
You can indeed keep everything in memory and examples of in-memory systems are Spark, MemSQL, and in-memory caches such as Redis and Memcached.
Why do we need log-structured storage engines and page-oriented storage engines? Why couldn’t we use only one of them? Database operations have evolved to fall under two processes: transaction processing or analytics processing.
OLTP: Online Transaction Processing. Transactions are a group of reads and writes that form a logical unit and typically return a small number of records. These are necessary for bank transactions.
OLAP: Online Analytics Processing. These queries are performed by business analysis that perform aggregation over a huge number of record.
Initially, OLTP and OLAP were both done on the same database. However analytics workloads, such as aggregation, on a database are expensive so data warehouses are a separate specialized database for analytics usage. Data is extracted from OLTP databases and loaded into the warehouse using an Extract-Transform-Load (ETL) process.
Some examples are Amazon Redshift, Snowflake, Spark SQL, or Facebook Presto.
In most OLTP databases, storage is laid out in a row-oriented fashion. Analytics queries often query across few columns but all rows.
Column Oriented Storage databases for OLAP operations store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse the columns that it is interested in.
Column Oriented Storage databases use LSM-trees. All writes go to an in-memory store first, where they are added to a sorted structure and prepared for writing to disk.
Cassandra and BigTable are column families databases. Within each column family, they store all the columns from a row together, along with a row key.
This summarizes the internal storage and retrieval in databases as we compare log-structured and page-oriented storage engines in their use cases of OLTP and OLAP operations.
Next week, we’ll look at the encoding and dataflow of data as it goes through databases, services, and message queues. Stay tune for more and share if you like this article!
]]>This post was originally posted from the 10% Smarter newsletter. Consider subscribing if you like it!
Are you just getting started being an engineer and already feeling overwhelmed? Fear not! This newsletter aims to give a quick 10 minute summary of concepts to help you become a better software engineer every week.
I’ve decided to start off with System Design, or how do you design a large software application like Instagram, YouTube, or Netflix. You might have some knowledge of how to programming with a language like Python or Java and some data structures and algorithm knowledge, but System Design isn’t taught that much in college and is an important skill to be an effective engineer. So let’s get started!
This week I’ll be going through chapter 1 of Designing Data-Intensive Applications: Reliable, Scalable, and Maintainable Applications.
Reliability is the ability to work correctly and be resilient and tolerance of mistakes and surprises. It performs well even under load. The goal is to prevent faults from turning into failures.
Scalability is the ability to cope with increased load in terms of data, traffic, or complexity.
Maintainability is the ability to ensuring software can be maintained. This involves making a system operable, simple, and evolvable. Anticipating future problems and designing good abstractions like APIs improve maintainability.
We can see these would be important in designing a service like Instagram, YouTube, or Netflix. Now, how would we design these systems?
Before designing a system, you must gather requirements to ensure the system is successful.
Functional requirements are what the application should do, such as allow data to be stored, retrieved, searched, and processed. For example Instagram would let users upload photos, and let users follow other users to like their photos.
Nonfunctional requirements are general properties like reliability, scalability, maintainability, and security.
Once you have gathered your requirements, you can design out a basic architecture. Software engineering used to be hard, but today you can build systems with Lego-like building blocks using cloud services such as AWS! Some basic building blocks are:
Databases: to store data
Caches: to speed up reads
Search Indices: to speed up searches
Stream Processing: to sending messages between processes asynchronously
Batch Processing: to periodically crunch data
Let’s go more in-depth. How we can make the system fault-tolerant, scalable, and performant?
What are common kinds of faults in systems?
Hardware faults: When hardware and entire machines fail. Hard disks have a mean time to failure (MTTF) of 10 to 50 years. On a storage cluster with 10,000 disks, we should expect on average one disk to die per day.
Software Errors: Bugs and Systematic errors such as Runaway Processes or Cascading Failures.
Human Errors: Humans mistakes are the leading cause of errors. Configuration Errors are the leading cause of outages.
Here is a good list of post-mortems you can check out to see fault and failure cases in action!
Now onto scalability. We can look at an example from Twitter.
Twitter’s main operations are
User post tweets at 4.6k req/sec and 12k req/sec peak
Generating user’s home timeline at 300k req/sec
For Twitter, simply handling 12,000 writes per second (the peak rate for posting tweets) is easy but the fan-out complexity is hard.
Two ways to implement this are
Posting a tweet inserts into a global database table. When a user requests a home timeline, look up the people they follow, tweets by those users and SQL JOIN the tweets sorted by time
Use a cache for each user’s home timeline. When a user posts a tweet, look up the people that follow that user, and inserts the new tweet into each of their home timeline caches.
Looking at the tradeoffs:
Approach 1 is simpler and better for users with many followers. However the home timeline join query puts a lot of load.
Approach 2 requires less load as the average rate of published tweets is two orders of magnitude lower than the rate of home timeline reads so using user caches to do less work at read time and more work at write time is preferable. However if a user has 30 million followers, this requires 30 million home timeline cache writes, which is also a lot of load.
Twitter uses a hybrid of both approaches. Most users tweets are fanned out to home timeline caches. Celebrities with large number of followers are fetched and merged separately.
Twitter had dedicated whole server racks for Justin Bieber who was 3% of their traffic.
Performance is another important consideration when thinking about a system. How do you describe performance?
In a batch processing system like Spark, we care about throughput: the number of records processed per second.
In online systems we care about the service’s response time: time between a client sending a request and receiving a response.
An important distinction between response time and latency is that the response time is what the client sees: time to process the request, network and queueing delays. Latency is the duration that a request is waiting to be handled.
Additionally, we should use percentiles to describe latency. Since web services receive many requests we want the majority of requests to be fast. Using an average is skewed by outliers.
p99 and p999 latency mean 99% or 99.9% of requests are handled faster than x response time.
Amazon uses 99.9 for their internal services and have service level agreements (SLAs) as a contract their median p50 response time will be less than 200ms and p99 under 1s. A customer can get a refund if this is not met.
Finally, we want our systems to scale. There are three ways to scale with load:
Vertical Scaling: using a more powerful machine
Horizontal Scaling: distributing a load across multiple smaller machines
Elastic Systems: Automatically adding computing resources when load increases
This week, we’ve covered an overview of designing Reliable, Scalable, and Maintainable Applications! It might be a lot to think about and digest but feel free to go over it and review. Always keep learning and you’ll for sure grow as an engineer.
Next week we’ll go over Data Models and Query Languages. Ever wonder what NoSQL is?
Some recommendations in this post are linked as Amazon Associates links. A small amount from those purchases will help support me.
Cleansing is the first step to protect your skin and allow to grow healthy. When your skin is damaged, it’s best to use a foam cleanser that will be soft on your skin. Bubble on the foam on your hand and use your ring finger to run in the foam in a circular motion. Don’t rub harshly and try to let it air dry or gently use a towel.
Your skin has a natural PH acidity level of 4-6. Soaps can be very cleansing because they have a high PH. They are very alkaline and can strip away dirt and grime. The top layer of your skin in covered by a thin layer of acid mantle and a damaged skin barrier can take two weeks to return to a normal ph level. Unlike normal soap, foam cleansers have a low PH. They aren’t as stripping as normal soap and cleanse your skin while still being soothing.
Why is moisturizing so important? Moisturize in yor skin is easily lost when it is damaged. When this happens, you can experience transepidermal water loss which can leave your skin harsh and dry. It’s very important to moisturize and my favorite moisturizer is CeraVe Moisturizing Cream. It’s soothing and includes ceramides which provide skin repair and protection.
Your skin is made of three layers: the Epidermis, the Dermis, and Hypodermis. The Epidermis is the outer layer proving protection for your skin. The Dermis is the second layer which provides elasticity to your skin and is made of collagen. The third inner layer is the Hypodermis which provides cushion to your skin and is made of fat cells.
Ceramides are an important natural ingredient found in the Epidermis layer of your skin. People with eczema, such as myself, have lower levels of ceramides, which can leave cracks in skin openings and transepidermal water loss. So it’s important for me to use a lot of moisturizer to keep my skin healthy.
After you have finished moisturizing, there a few other ingredients you can apply to help repair your skin. First is hyaluronic acid, a natural ingredient in your body for tissue regeneration and skin repair. Hyaluronic acid will draw in water into the skin keeping it moist and healthy. It will also help heal your skin without scars if it is has been cut. While it is powerful, try to find the just right goldilocks amount, as overuse can have the opposite effect and inflame your skin. Hyaluronic acid can be found in serums such as Asterwood Naturals Hyaluronic Acid Serum.
Next is Cica, also known as Centella Asiatica. This is a compound found in plants which they use for cellular repair. Cica is rich in nutrients, kills bacteria, and helps repair the damaged skin barrier. This makes it good for use in burns or eczema and helps make your skin glowy and fresh. You can find it in COSRX Ceramide and Centella Asiatica Cream or the Laneige CICA Sleeping Mask which provides a moisture cream you can apply at night so you won’t lose moisture in your sleep.
We’ve looked at the three steps to take in your skin care routine to help heal broken skin and promote strong healthy skin. To recap, use a gentle foam cleanser with a low ph to clean your face, moisturize your face with Ceramide based moisturizing cream to keep your skin hydrated, and apply repairing ingredients such as hyaluronic acid and centella asiatica to help promote healthy skin repair. Hopefully this can help keep your skin care routine simple and easy and help you on your journey for strong healthy skin.
Firstly, don’t peel it off. That will pull off healthy skin as well! I know, flaky skin is annoying and looks disturbing, but I’ve found the best thing to do is just to apply lotion and moisturizer. This will soothe your skin and smooth flaky bits to not stick out of your skin.
My usually process is to apply a little bit of CeraVe moisturizer, then a oil based protection like coconut oil or Vaseline, and finally reapply CeraVe moisturizer again. Oil based products provide protection on top of your skin. Your skin naturally has fats called sebum to help create a barrier, but some skin do not produce enough sebum. This is why supplementing with oil based protection is useful. However, oil based products can stick flaky skin and make it easier to flake off, which we don’t want! That’s why I find it useful to apply CeraVe moisturizer before and after applying oil based protections. CeraVe moisturizer also includes ceramides, which make up 50% of your skin and are important in keeping your skin strong and healthy. Other oil based products I’ve heard are good include shea butter, jojoba oil and rosehip oil, but I have not tried them yet.
Another useful tip is to take cold showers to reduce inflammation. Afterwards, when you are drying yourself off, dry yourself off by dabbing yourself. Don’t rub the towel into the peeling areas. In addition to showers try avoiding overwashing your face. I used to also wash my face in the morning with water, but it would dry out my skin. Instead you can dab it with a slight damp towel.
To reduce flaky skin in the future, it is important to understand the root cause. I have eczema, which is an autoimmune disease. My skin experience transepidermal water loss, so it can dry out easily and create flaky skin. Two other common cause of dry skin are cold weather,which can pull out water from the skin, and overexfoliating, which can wipe away healthy skin cells and natural oil protection. Also be sure you are staying well hydrated and drink lots of water!
Hopefully these tips can help you deal with flaky skin in the future. Keep calm and moisturize on!
This post comes from a YouTube video I created on how to manage your ecezma. Consider subscribing if you like it!
Eczema, also known as Atopic Dermatitis is a skin condition resulting from inflammation of the skin. It can be itchy, painful, and result in patchy flaky skin. Luckily your boi Brian is here with some skin tips for dealing with Eczema.
Firstly, Moisturize your skin. I use oil products like Vaseline and a moisturizer like CeraVe.
Next, take care of your mental health and stress. Stress can be one cause of flareups of eczema.
Third, considering consulting a dermatologist or skin specialist. I was prescribed topic steroid cream. It made my skin better, but you should not overuse it.
Fourth, use lukewarm water in water. My skin is sensitive to hot showers, changes in temperature, and soap. I use cetaphil soap.
Finally, avoid irritations to your skin. Look at your lifestyle and seeing if anything is causing irritating to your skin, like skincare, being too rough touching your face, and diet.
As a bonus tip, be kind to yourself in the healing process. :)
Enjoy and hope to see you guys soon.