Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rudimentary best practices for applications (#157) #180

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 97 additions & 8 deletions docs/xks/developer-guide/best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,107 @@ id: best-practices
title: Best Practices
---

This page aims to collect best practices and common mistakes that can be made while using XKS.
This page aims to collect best practices and common mistakes that can be made while using XKF. It's meant to be used as the quick summary when you are lacking in time.

## Container Resources
If you follow these guidlines, you will most likely have a well running application in the general case.

## Probes
We segment our best practices into two parts. The first regards general philosophies you should adhere to when developing your applications.
The second is a checklist you should check make sure you satisfy before bringing your application to production.

## Pod Scaling
## Development Philosophy in XKF

## Disruption Budgets
This segment provides a summay of things one should consider while developing applications in XKF.

## Resources
- **Cattle not Pets**

Here are some good resources to also read on top of this page.
In contemporary software systems it is common to say that you should treat components of your systems as cattle, not pets. In the case of Kubernetes this relates to the fact that a pod may be restarted at arbitrary times or a node can be rotated. To be cloud native means to accomodate for this. This is discussed at length in other places in our documentation, but it is worthwhile to always think: _What happens in my application if the pod is restarted?_

* [https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html](https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html)
- **Run more than one replica in all environments**

Unless you have a special case you should run your application as more than one replica, even in non production environments. Apart from the more obvious fact that you get higher availability, it will help you catch bugs related to concurrency earlier. For example, if your application writes to a database and you want high availability in your production environment, add high availability to your non production environments as well. If you have a bug, e.g. some bad transactional logic in your application it is better to expose this bug in non production environments as well.

- **Crash on unrecoverable errors**

Since it is easy to create new pods in Kubernetes it is better to just crash and start from a well defined state if your application encounters errors that are not recoverable.

# Production Readiness checklist

Please consider the following items before bringing your application to production. There exist a multitute of situations where items from this checklist does not make sense, but we strongly recommend that you at least consider every item in order to have as stable and well behaving applications as possible.

## Read language specific docs

Xenit provides some language specific documentation, e.g. [Xenit's Golang style guide](https://xenitab.github.io/docs/xenit-style-guide/golang) or [Xenit's Javascript/Typescript style guide](https://xenitab.github.io/docs/xenit-style-guide/javascript). These are summaries of experiences we have gathered while running production applications in XKF.

## Readiness/Liveness probes

Good probes are important if you want stable applications. They help with error reporting, but in our experience, they are the most important when you deploy new versions.

The short summary follows,

1. Liveness probes are too powerful and in most situations you do not want them.
2. You most likely want a basic readiness probe. This readiness probe should probably not involve other applications as this can lead to thundering herd problems. An endpoint that answers with **200** on **/healthz** on your http server for your application is a very good start.

Your mileage may vary and only you can know what it means for your application to be ready to receive traffic. Consider reading [livenes probes are dangerous](https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html) as it provides a more nuanced discussion when and where you want to use liveness/readiness probes.

## Incoming HTTP

TODO

## Outgoing HTTP

TODO

## Graceful Shutdown

Your pod can be shut down by kubernetes. Make sure you capture sigterm and act reasonably on the signal. What reasonable is depends on you application, but a common case is to finish handling all ongoing http-requests as well as closing connections cleanly to databases and external services.

## Observability and Telemetry

Make sure you have set up sufficient observability tools for your application. What this means depends on context. At Xenit we have found that the following provides a good start.

### Logs

1. Log the correct amount. This usually means one info log per "thing that happened". If you add too much you will never find the log you are looking for.
2. Always add context to logs. A log should always be related to the entity that was affected. If you don't add this information the log is just noise and will never help you find issues in the future.
3. Assume things work and mostly log errors. We have found that most info logs are just a poor man's implementation of metrics and traces. The ideal log is the log that immediately tells you what is wrong with the system. If you can get away with not logging something, consider doing so.
4. Consider disabling HTTP request logging for non error http requests. We have found that it is common to add a general http log on top of every application. If you have a lot of traffic this will mean a lot of logs. How useful is it really to log that someone made a succesful _GET_ request to a certain endpoint?

### Metrics

Metrics is also a nuanced topic that depends on your context. However, it is usually a good idea to add the so called **RED** metrics to your application. **RED** stand for _Rate_, _Errors_, and _Duration_. We have found that measuring the rate of incoming requests, the rate of them erroring and the time it takes to execute them provides a very good baseline for knowing how your applications are doing.

Consider reading our extended [extended documentation on metrics](https://xenitab.github.io/docs/xks/developer-guide/observability).

### Traces

Add tracing to your application. We have found that modern trace tools provide pretty good configuration out of the box. You just need to add an appropriate tracing library to your application. Consider reading our extended [extended documentation on tracing](https://xenitab.github.io/docs/xks/developer-guide/observability).

## Pod disruption budgets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this to availability or something similar. I think this PDB documentation is to in depth for a overview page.

I think you should mention pdb as a part of getting good avaliability in k8s together with things like antiaffinity and link to our other docs for more details.


Make sure you have written a pod disruption budget. Otherwise you will have problems with downtime when a node rotates. This is extensively documented [here](https://xenitab.github.io/docs/xks/developer-guide/scheduling-scaling#pod-disruption-budget).

## Network Policies

Configure network policies. No network policy is needed if your application only communicates inside namespace. However, if you want network traffic across namespaces, you need to configure network policies. It is documented extensively [here](https://xenitab.github.io/docs/xks/developer-guide/networking).

## Resources and Scaling

Make sure you have written reasonable resource requests and limits. It is extensively documented [here](https://xenitab.github.io/docs/xks/developer-guide/scheduling-scaling#pod-resources).

## Secret Management & External Resources

If you communicate with things outside of your namespace, i.e. databases and such make you have checked the following.

1. Make sure no secrets are commited to either source code repository, nor gitops repository.
2. We recommend using MSI (Managed Service Identity) to provide an identity for your pods. Documentation can be found [here](https://xenitab.github.io/docs/xks/developer-guide/cloud-iam). It's okay to get secrets from Cloud provider key vault solutions as well.
3. How you load secrets is very specific for your own application. However, we have used and documented Secret Store CSI Driver. Documentation can be found [here](https://xenitab.github.io/docs/xks/developer-guide/secrets-management).

## Documentation

Is everything documented to a sufficient level? If the whole team would quit tomorrow, could someone take over the application with as little friction as possible?

## Further reading

This documentation provides bare minimum for making a production ready kubernetes application. Consider reading the following for further understanding,

1. [Production checklist](https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html)