From c329fa5f4f3f089ab32c6b12bc3830b2f4a8d097 Mon Sep 17 00:00:00 2001 From: Christoffer Olsson Date: Thu, 15 Dec 2022 14:46:57 +0100 Subject: [PATCH 1/4] rudimentary best practices for applications (#157) --- docs/xks/developer-guide/best-practices.md | 123 +++++++++++++++++++-- 1 file changed, 115 insertions(+), 8 deletions(-) diff --git a/docs/xks/developer-guide/best-practices.md b/docs/xks/developer-guide/best-practices.md index 652bb927650..d0f74a46f53 100644 --- a/docs/xks/developer-guide/best-practices.md +++ b/docs/xks/developer-guide/best-practices.md @@ -3,18 +3,125 @@ id: best-practices title: Best Practices --- -This page aims to collect best practices and common mistakes that can be made while using XKS. +This page aims to collect best practices and common mistakes that can be made while using XKF. It's meant to be used as the quick summary when you are lacking in time. -## Container Resources +If you follow these guidlines, you will most likely have a well running application in the general case. -## Probes +We segment our best practices into two parts. The first regards general philosophies you should adhere to when developing your applications. +The second is a checklist you should check make sure you satisfy before bringing your application to production. -## Pod Scaling +# Development Philosophy in XKF -## Disruption Budgets +This segment provides a summay of things one should consider while developing applications in XKF. -## Resources +- **Cattle not Pets** -Here are some good resources to also read on top of this page. + In contemporary software systems it is common to say that you should treat components of your systems as cattle, not pets. In the case of Kubernetes this relates to the fact that a pod may be restarted at arbitrary times or a node can be rotated. To be cloud native means to accomodate for this. This is discussed at length in other places in our documentation, but it is worthwhile to always think. _What happens in my application if the pod is restarted?_ -* [https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html](https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html) +- **Run more than one replica in all environments** + + Unless you have a special case you should run your application as more than one replica, even in non production environments. Apart from the more obvious fact that you get higher availability, it will help you catch bugs related to concurrency earlier. For example, if your application writes to a database and you want high availability in your production environment, add high availability to your non production environments as well. If you have a bug, e.g. some bad transactional logic in your application it is better to expose this bug in non production environments as well. + +- **Crash on unrecoverable errors** + + Since it is easy to create new pods in Kubernetes it is better to just crash and start from a well defined state if your application encounters errors that are not recoverable. + +# Production Readiness checklist + +Please consider the following items before bringing your application to production. There exist a multitute of situations where items from this checklist does not make sense, but we strongly recommend that you at least consider every item in order to have as stable and well behaving applications as possible. + +## Read language specific docs + +Xenit provides some language specific documentation, e.g. [Xenit's Golang style guide](https://xenitab.github.io/docs/xenit-style-guide/golang) or [Xenit's Javascript/Typescript style guide](https://xenitab.github.io/docs/xenit-style-guide/javascript). These are summaries of experiences we have gathered while running production applications in XKF. + +## Readiness/Liveness probes + +Good probes are important if you want stable applications. They help with error reporting, but in our experience they are the most important when you deploy new versions. + +The short summary follows, + +1. Liveness probes are too powerful and in most situations you do not want them. +2. You most likely want a basic readiness probe. This readiness probe should probably not involve other applications as this can lead to thundering herd problems. An endpoint that answers with **200** on **/healthz** on your http server for your application is a very good start. + +Your mileage may vary and only you can know what it means for your application to be ready to receive traffic. Consider reading this https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html as it provides a more nuanced discussion when and where you want to use liveness/readiness probes. + +## Incoming HTTP + +TODO + +## Outgoing HTTP + +TODO + +## Database + +Your pod can be shut down by kubernetes. Make sure the following has been configured. + +1. Your application has graceful shutdown that closes connection to database. + +## Observability and Telemetry + +Make sure you have set up sufficient observability tools for your application. What this means depends on context. At Xenit we have found that the following provides a good start. + +### Logs + +1. Log the correct amount. This usually means one info log per "thing that happened". If you add too much you will never find the log you are looking for. +2. Always add context to logs. A log should always be related to the entity that was affected. If you don't add this information the log is just noise and will never help you find issues in the future. +3. Assume things work and mostly log errors. We have found that most info logs are just a poor man's implementation of metrics and traces. The ideal log is the log that immediately tells you what is wrong with the system. If you can get away with not logging something, consider doing so. +4. Consider disabling HTTP request logging for non error http requests. We have found that it is common to add a general http log on top of every application. If you have a lot of traffic this will mean a lot of logs. How useful is it really to log that someone made a succesful _GET_ request to a certain endpoint? + +### Metrics + +Metrics is also a nuanced topic that depends on your context. However, it is usually a good idea to add the so called **RED** metrics to your application. **RED** stand for _Rate_, _Errors_, and _Duration_. We have found that measuring the rate of incoming requests, the rate of them erroring and the time it takes to execute them provides a very good baseline for knowing how your applications are doing. + +Consider reading our extended [extended documentation on metrics](https://xenitab.github.io/docs/xks/developer-guide/observability). + +### Traces + +Add tracing to your application. We have found that modern trace tools provide pretty good configuration out of the box. You just need to add an appropriate tracing library to your application. Consider reading our extended [extended documentation on tracing](https://xenitab.github.io/docs/xks/developer-guide/observability). + +## Pod disruption budgets + +Make sure you have written a pod disruption budget. This is extensively documented [here](https://xenitab.github.io/docs/xks/developer-guide/scheduling-scaling#pod-disruption-budget). +A basic one can look as follows + +``` +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: podinfo +spec: + minAvailable: 2 + selector: + matchLabels: + app: podinfo +``` + +To check if a budget exists for your application the you can use the following, + +``` +kubectl get PodDisruptionBudget +``` + +## Network Policies + +Configure network policies. No network policy is needed if your application only communicates inside namespace. However, if you want network traffic across namespaces, you need to configure network policies. It is documented extensively [here](https://xenitab.github.io/docs/xks/developer-guide/networking). + +## Resources and Scaling + +Make sure you have written reasonable resource requests and limits. It is extensively documented [here](https://xenitab.github.io/docs/xks/developer-guide/scheduling-scaling#pod-resources). + +## Secret Management + +1. Make sure no secrets are commited to either source code repository, nor gitops repository. +2. Load secrets using Secret Store CSI Driver. It is documented extensively [here](https://xenitab.github.io/docs/xks/developer-guide/secrets-management). + +## Documentation + +Is everything documented to a sufficient level? If the whole team would quit tomorrow, could someone take over the application with as little friction as possible? + +## Further reading + +This documentation provides bare minimum for making a production ready kubernetes application. Consider reading the following for further understanding, + +1. https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html From c0c39852053870e592a4cd8718a11c71ff79d7af Mon Sep 17 00:00:00 2001 From: Christoffer Olsson Date: Wed, 18 Jan 2023 11:41:20 +0100 Subject: [PATCH 2/4] rudimentary best practices for applications after review --- docs/xks/developer-guide/best-practices.md | 38 ++++++---------------- 1 file changed, 10 insertions(+), 28 deletions(-) diff --git a/docs/xks/developer-guide/best-practices.md b/docs/xks/developer-guide/best-practices.md index d0f74a46f53..f5815d6fece 100644 --- a/docs/xks/developer-guide/best-practices.md +++ b/docs/xks/developer-guide/best-practices.md @@ -16,7 +16,7 @@ This segment provides a summay of things one should consider while developing ap - **Cattle not Pets** - In contemporary software systems it is common to say that you should treat components of your systems as cattle, not pets. In the case of Kubernetes this relates to the fact that a pod may be restarted at arbitrary times or a node can be rotated. To be cloud native means to accomodate for this. This is discussed at length in other places in our documentation, but it is worthwhile to always think. _What happens in my application if the pod is restarted?_ + In contemporary software systems it is common to say that you should treat components of your systems as cattle, not pets. In the case of Kubernetes this relates to the fact that a pod may be restarted at arbitrary times or a node can be rotated. To be cloud native means to accomodate for this. This is discussed at length in other places in our documentation, but it is worthwhile to always think: _What happens in my application if the pod is restarted?_ - **Run more than one replica in all environments** @@ -36,7 +36,7 @@ Xenit provides some language specific documentation, e.g. [Xenit's Golang style ## Readiness/Liveness probes -Good probes are important if you want stable applications. They help with error reporting, but in our experience they are the most important when you deploy new versions. +Good probes are important if you want stable applications. They help with error reporting, but in our experience, they are the most important when you deploy new versions. The short summary follows, @@ -53,11 +53,9 @@ TODO TODO -## Database +## Graceful Shutdown -Your pod can be shut down by kubernetes. Make sure the following has been configured. - -1. Your application has graceful shutdown that closes connection to database. +Your pod can be shut down by kubernetes. Make sure you capture sigterm and act reasonably on the signal. What reasonable is depends on you application, but a common case is to finish handling all ongoing http-requests as well as closing connections cleanly to databases and external services. ## Observability and Telemetry @@ -82,26 +80,7 @@ Add tracing to your application. We have found that modern trace tools provide p ## Pod disruption budgets -Make sure you have written a pod disruption budget. This is extensively documented [here](https://xenitab.github.io/docs/xks/developer-guide/scheduling-scaling#pod-disruption-budget). -A basic one can look as follows - -``` -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: podinfo -spec: - minAvailable: 2 - selector: - matchLabels: - app: podinfo -``` - -To check if a budget exists for your application the you can use the following, - -``` -kubectl get PodDisruptionBudget -``` +Make sure you have written a pod disruption budget. Otherwise you will have problems with downtime when a node rotates. This is extensively documented [here](https://xenitab.github.io/docs/xks/developer-guide/scheduling-scaling#pod-disruption-budget). ## Network Policies @@ -111,10 +90,13 @@ Configure network policies. No network policy is needed if your application only Make sure you have written reasonable resource requests and limits. It is extensively documented [here](https://xenitab.github.io/docs/xks/developer-guide/scheduling-scaling#pod-resources). -## Secret Management +## Secret Management & External Resources + +If you communicate with things outside of your namespace, i.e. databases and such make you have checked the following. 1. Make sure no secrets are commited to either source code repository, nor gitops repository. -2. Load secrets using Secret Store CSI Driver. It is documented extensively [here](https://xenitab.github.io/docs/xks/developer-guide/secrets-management). +2. Use MSI (Managed Service Identity) to provide an identity for your pods. Documentation can be found [here](https://xenitab.github.io/docs/xks/developer-guide/cloud-iam). +3. How you load secrets is very specific for your own application. However, we have used and documented Secret Store CSI Driver. Documentation can be found [here](https://xenitab.github.io/docs/xks/developer-guide/secrets-management). ## Documentation From 442cb4b4502e8c3726c6a4674ca1181bf0c27b7e Mon Sep 17 00:00:00 2001 From: Christoffer Olsson Date: Thu, 9 Feb 2023 11:16:28 +0100 Subject: [PATCH 3/4] Clarify that it is ok to get secrets from Cloud Provider Key Vault --- docs/xks/developer-guide/best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/xks/developer-guide/best-practices.md b/docs/xks/developer-guide/best-practices.md index f5815d6fece..602b4016463 100644 --- a/docs/xks/developer-guide/best-practices.md +++ b/docs/xks/developer-guide/best-practices.md @@ -95,7 +95,7 @@ Make sure you have written reasonable resource requests and limits. It is extens If you communicate with things outside of your namespace, i.e. databases and such make you have checked the following. 1. Make sure no secrets are commited to either source code repository, nor gitops repository. -2. Use MSI (Managed Service Identity) to provide an identity for your pods. Documentation can be found [here](https://xenitab.github.io/docs/xks/developer-guide/cloud-iam). +2. We recommend using MSI (Managed Service Identity) to provide an identity for your pods. Documentation can be found [here](https://xenitab.github.io/docs/xks/developer-guide/cloud-iam). It's okay to get secrets from Cloud provider key vault solutions as well. 3. How you load secrets is very specific for your own application. However, we have used and documented Secret Store CSI Driver. Documentation can be found [here](https://xenitab.github.io/docs/xks/developer-guide/secrets-management). ## Documentation From 533746c23eb908620177055fc81777ada1d198d0 Mon Sep 17 00:00:00 2001 From: Christoffer Olsson Date: Thu, 9 Feb 2023 11:21:15 +0100 Subject: [PATCH 4/4] fix some lint errors --- docs/xks/developer-guide/best-practices.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/xks/developer-guide/best-practices.md b/docs/xks/developer-guide/best-practices.md index 602b4016463..ae510a8e230 100644 --- a/docs/xks/developer-guide/best-practices.md +++ b/docs/xks/developer-guide/best-practices.md @@ -10,7 +10,7 @@ If you follow these guidlines, you will most likely have a well running applicat We segment our best practices into two parts. The first regards general philosophies you should adhere to when developing your applications. The second is a checklist you should check make sure you satisfy before bringing your application to production. -# Development Philosophy in XKF +## Development Philosophy in XKF This segment provides a summay of things one should consider while developing applications in XKF. @@ -43,7 +43,7 @@ The short summary follows, 1. Liveness probes are too powerful and in most situations you do not want them. 2. You most likely want a basic readiness probe. This readiness probe should probably not involve other applications as this can lead to thundering herd problems. An endpoint that answers with **200** on **/healthz** on your http server for your application is a very good start. -Your mileage may vary and only you can know what it means for your application to be ready to receive traffic. Consider reading this https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html as it provides a more nuanced discussion when and where you want to use liveness/readiness probes. +Your mileage may vary and only you can know what it means for your application to be ready to receive traffic. Consider reading [livenes probes are dangerous](https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html) as it provides a more nuanced discussion when and where you want to use liveness/readiness probes. ## Incoming HTTP @@ -106,4 +106,4 @@ Is everything documented to a sufficient level? If the whole team would quit tom This documentation provides bare minimum for making a production ready kubernetes application. Consider reading the following for further understanding, -1. https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html +1. [Production checklist](https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html)