Skip to content

Commit

Permalink
Merge pull request #114 from scalableinternetservices/highAvailability
Browse files Browse the repository at this point in the history
High availability lecture notes
  • Loading branch information
zwalker authored Oct 10, 2024
2 parents 18da4ef + 751dbef commit ab503eb
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 35 deletions.
2 changes: 1 addition & 1 deletion schedule.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ The following schedule is subject to change, and many slide links are not yet ac
### Topics

- Scaling Web Applications - [slides](/slides/2024f/05_scaling_web_applications/index.html)
- Architecting for High Availability
- Architecting for High Availability [slides](/slides/2024f/06_high_availability/index.html)

### Tasks

Expand Down
3 changes: 2 additions & 1 deletion slides/2024f/05_scaling_web_applications/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -574,7 +574,8 @@

What else could we do?

???
???

* Leave the source IP address as the client's IP address
* Let the server bypass the load balancer and send the response directly to the client
* This is called Direct Server Return (DSR) or Direct Routing (DR)
Expand Down
83 changes: 50 additions & 33 deletions slides/2024f/06_high_availability/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,13 @@
Having access to these services (high availability) at any time is increasingly
important.

???

As a user of these services, we expect them to be available at all times.
When they are not, it can be frustrating and even dangerous.
When we build these services, we need to think about how customers will depend on them
and what level of availability we need to provide.

---

# Expressing Availability
Expand All @@ -47,6 +54,14 @@

[Availability Table](https://sre.google/sre-book/availability-table/#tablea1)

???

What is considered downtime?
* Unplanned outages
* Are scheduled outages considered downtime? Usually not,
but we can arhitect our system to minimuze downtime during
certain types of maintenance, like software updates

---

# Measuring Availability SLIs/SLOs/SLAs
Expand Down Expand Up @@ -81,45 +96,15 @@
# Possible causes of failures. What if?

* a server process hangs?

--

* a server process dies?

--

* an application server fails?

--

* a load balancer fails?

--

* a switch fails?

--

* a router fails?

--

* a connection to the Internet fails?

--

* DNS fails?

--

* the Internet fails?

--

* a database fails?

--

* an entire data center fails?

---
Expand Down Expand Up @@ -148,6 +133,13 @@
dead) by having a pool of other servers to direct traffic to.
]

???

A load balanced system can handle a server failure by directing traffic to others.
However, if the system is already operating at capacity, the loss of a server could cause a cascading failure.
Part of high availability is ensuring that the system can handle the loss of a server without causing a failure.
To do this we need to monitor the load on the servers and have a method to add capacity when needed.

---

class: center, middle
Expand Down Expand Up @@ -182,7 +174,6 @@

> During a failover, what happens to the IP address?


---

# Load Balancer Failover
Expand All @@ -200,6 +191,17 @@
Established flows can be supported depending on how much information sharing
occurs between the load balancers.

???

There are different ways to handle load balancer failure.
One way is to have two load balancers, one primary and one failover using a heartbeat to determine the health of each other.
If this is a proxy load balancer layer 4 or 7, the load balancers may share session information so that if one fails, the other can take over the session.

For a layer 4 packet re-writing load balancer, the load balancers may not need to share session information.
Another way is to have a router that implements ECMP (Equal Cost Multi-Path) routing to distribute traffic to multiple load balancers.
In this case, the load balancers do not need to communicate with each other.
The router will distribute traffic to the load balancers based on a hash of the packet header, like the source IP address.

---

class: center, middle
Expand Down Expand Up @@ -315,6 +317,10 @@

![Hurricane Sandy Headline](hurricane_sandy_article.png)

???

2012 Hurricane took AWS data center offline.

---

class: center, middle
Expand Down Expand Up @@ -355,6 +361,13 @@

We end up having to make a choice between performance and availability.


???

What is the performance problem with multiple A records?
* The backend state would need to be replicated across data centers in near real-time.
- This is a difficult problem to solve.
- CAP theorem says we can't have consistency and availability and performance in a distributed system.
---

class: center, middle
Expand All @@ -368,9 +381,13 @@

---

What if your service is failing due to excessive load?
# Now that we have architecture for high availability

We have redundant systems in place to handle failures

> What if your service is failing due to excessive load?

How can your service be highly available when under heavy load?
> How can your service be highly available when under heavy load?

--

Expand Down

0 comments on commit ab503eb

Please sign in to comment.