Merge pull request #114 from scalableinternetservices/highAvailability

High availability lecture notes
scalableinternetservices · Oct 10, 2024 · ab503eb · ab503eb
2 parents 18da4ef + 751dbef
commit ab503eb
Show file tree

Hide file tree

Showing 3 changed files with 53 additions and 35 deletions.
diff --git a/schedule.md b/schedule.md
@@ -57,7 +57,7 @@ The following schedule is subject to change, and many slide links are not yet ac
 ### Topics
 
 - Scaling Web Applications - [slides](/slides/2024f/05_scaling_web_applications/index.html)
-- Architecting for High Availability
+- Architecting for High Availability [slides](/slides/2024f/06_high_availability/index.html)
 
 ### Tasks
 

diff --git a/slides/2024f/05_scaling_web_applications/index.html b/slides/2024f/05_scaling_web_applications/index.html
@@ -574,7 +574,8 @@
 
 What else could we do?
 
-??? 
+???
+
 * Leave the source IP address as the client's IP address
 * Let the server bypass the load balancer and send the response directly to the client
 * This is called Direct Server Return (DSR) or Direct Routing (DR)

diff --git a/slides/2024f/06_high_availability/index.html b/slides/2024f/06_high_availability/index.html
@@ -28,6 +28,13 @@
 Having access to these services (high availability) at any time is increasingly
 important.
 
+???
+
+As a user of these services, we expect them to be available at all times.
+When they are not, it can be frustrating and even dangerous.
+When we build these services, we need to think about how customers will depend on them
+and what level of availability we need to provide.
+
 ---
 
 # Expressing Availability
@@ -47,6 +54,14 @@
 
 [Availability Table](https://sre.google/sre-book/availability-table/#tablea1)
 
+???
+
+What is considered downtime?
+* Unplanned outages
+* Are scheduled outages considered downtime? Usually not,
+  but we can arhitect our system to minimuze downtime during
+  certain types of maintenance, like software updates
+
 ---
 
 # Measuring Availability SLIs/SLOs/SLAs
@@ -81,45 +96,15 @@
 # Possible causes of failures. What if?
 
 * a server process hangs?
-
---
-
 * a server process dies?
-
---
-
 * an application server fails?
-
---
-
 * a load balancer fails?
-
---
-
 * a switch fails?
-
---
-
 * a router fails?
-
---
-
 * a connection to the Internet fails?
-
---
-
 * DNS fails?
-
---
-
 * the Internet fails?
-
---
-
 * a database fails?
-
---
-
 * an entire data center fails?
 
 ---
@@ -148,6 +133,13 @@
   dead) by having a pool of other servers to direct traffic to.
 ]
 
+???
+
+A load balanced system can handle a server failure by directing traffic to others.
+However, if the system is already operating at capacity, the loss of a server could cause a cascading failure.
+Part of high availability is ensuring that the system can handle the loss of a server without causing a failure.
+To do this we need to monitor the load on the servers and have a method to add capacity when needed.
+
 ---
 
 class: center, middle
@@ -182,7 +174,6 @@
 
 > During a failover, what happens to the IP address?
 
-
 ---
 
 # Load Balancer Failover
@@ -200,6 +191,17 @@
 Established flows can be supported depending on how much information sharing
 occurs between the load balancers.
 
+???
+
+There are different ways to handle load balancer failure.
+One way is to have two load balancers, one primary and one failover using a heartbeat to determine the health of each other.
+If this is a proxy load balancer layer 4 or 7, the load balancers may share session information so that if one fails, the other can take over the session.
+
+For a layer 4 packet re-writing load balancer, the load balancers may not need to share session information.
+Another way is to have a router that implements ECMP (Equal Cost Multi-Path) routing to distribute traffic to multiple load balancers.
+In this case, the load balancers do not need to communicate with each other.
+The router will distribute traffic to the load balancers based on a hash of the packet header, like the source IP address.
+
 ---
 
 class: center, middle
@@ -315,6 +317,10 @@
 
 ![Hurricane Sandy Headline](hurricane_sandy_article.png)
 
+???
+
+2012 Hurricane took AWS data center offline.
+
 ---
 
 class: center, middle
@@ -355,6 +361,13 @@
 
 We end up having to make a choice between performance and availability.
 
+
+???
+
+What is the performance problem with multiple A records?
+* The backend state would need to be replicated across data centers in near real-time.
+  - This is a difficult problem to solve.
+  - CAP theorem says we can't have consistency and availability and performance in a distributed system.
 ---
 
 class: center, middle
@@ -368,9 +381,13 @@
 
 ---
 
-What if your service is failing due to excessive load?
+# Now that we have architecture for high availability
+
+We have redundant systems in place to handle failures
+
+> What if your service is failing due to excessive load?
 
-How can your service be highly available when under heavy load?
+> How can your service be highly available when under heavy load?
 
 --