-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keycloak.X Operator #330
base: main
Are you sure you want to change the base?
Keycloak.X Operator #330
Changes from 5 commits
cc61b5a
418e369
b101eb7
b4ccee5
8f45ed5
93a7b6f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# Keycloak.X Operator | ||
|
||
## Motivation for a new operator | ||
|
||
The current Operator made in Go Lang served us and the community well so far, but increasing challenges are paving the road for a complete re-write. | ||
|
||
* The codebase is hard to maintain because of organic growth and accumulated technical debt | ||
* The Keycloak community is more keen to Java and there is less Go Lang expertise | ||
* The current project needs some high-cost maintenance tasks to be performed in order to use the latest features (e.g. webhooks) and receive the latest fixes and patches, specifically: | ||
* upgrading Go lang version from 1.13 to 1.17 | ||
* upgrading the Operator SDK and the dependencies | ||
|
||
Those upgrades will require creating a completely new project, using different libraries, moving, and in some cases, rewriting the components, e.g. the whole testsuite. | ||
* The current approach around CRDs no longer fits the long term vision for cloud-native Keycloak as it is very error-prone and fragile. | ||
* A Java operator can share business objects with the Keycloak main codebase increasing the code-reuse and dramatically reducing the chances of introducing bugs in the translation process from Kubernetes resources. | ||
* A unified codebase will make it easy to test and debug the entire system. | ||
* A new operator will embrace from the ground up the new Cloud Native capabilities of upcoming Keycloak releases such as the Quarkus distribution and Store.X, making those first-class citizen overall improving the user experience. | ||
|
||
|
||
## Features | ||
|
||
--- | ||
**NOTE** | ||
|
||
The primary target of the operator is to make it easy to achieve production grade installations of Keycloak.X. | ||
|
||
--- | ||
|
||
### Configuring Keycloak deployment | ||
|
||
The operator will use a CRD representing the Keycloak installation. The CRD will expose the following configuration options: | ||
* custom Keycloak image; default base image will be used if not specified | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can also use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure this is something that makes sense for this operator (unless it's somehow "required" by OLM best practices which AFAIK it is not). With descoped/cluster-wide operators (which implies multi-tenancy) I believe it makes sense for the custom image to be primarily and explicitly set on the CR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mentioned multi-tenancy. What is our main deployment model?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pedroigor By deployment you mean operator or operand deployment? For the operator, it'll be up to the user. Initially we'll probably support only namespaced mode (= single deployment per namespace) as we wait for proper descoped implementation in OLM. |
||
* Git configuration to fetch static business objects configuration (see below) | ||
vmuzikar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* manual horizontal and vertical scaling | ||
vmuzikar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* pod and node affinity, toleration | ||
* SSL config, truststore | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just FYI - we're working on productizing Cert Manager (see https://github.com/openshift/cert-manager-operator). It is very likely that this will be the future tool for managing certificates in OpenShift. I think it would be good to have the first class support for it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! Will look into it. However, we'd like to stay also vanilla K8s friendly so this would have to be an additional feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the distribution side, the operator should be able to provide TLS key material by:
The same goes for the trust store. I'm not sure what k8s/openshift provides in this regard, but the TLS configuration options were thought to make it easier to leverage existing key stores or read key material from a file, possibly mounted into the pod. For keystores, the distribution makes this really easy to set up because you only need a But for PEM files, we don't have anything like that. So perhaps we could have some default file names for both cert and its corresponding key PEM files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We support different proxy modes, some of them should not require TLS (e.g.: edge). But I guess we always want a secure channel between Keycloak and the ingress/route, even if running within the cluster network. Perhaps this requirement should be more explicit? In addition to that, do we want to enable HTTP? Is this somewhat needed by Grafana (or has a complex set up) or any other client/service that needs to communicate with Keycloak? I'm also missing exposing options to manage some key ingress/route settings. I think these are very common settings you usually want to change and have some control:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll add more details around SSL. The current plan is that the operator will provide only default Ingress setting. If you want to control Ingress and configure it, you can do it directly. You can even deploy a custom Ingress. |
||
|
||
Since most of the configuration options will come from the Keycloak.X distribution itself, the CRD will also expose appropriate fields for passing any distribution related options to the container, like database connection details (obviously without any credentials), SPIs configuration, etc. | ||
|
||
### Configuring Keycloak business objects using Kubernetes resources | ||
|
||
The new operator will expose two main ways of configuring business objects in Keycloak (e.g.: Realms, Roles, Clients, etc.) in addition to the built-in Dynamic configuration through the REST API/UI console: | ||
* Static file based configuration stored in Git to enable GitOps operations | ||
* Static configuration through Kubernetes API | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does it mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See the chapter below. ;) |
||
|
||
Static configuration will be strictly read-only, therefore two-way syncing is not going to be needed. | ||
|
||
Static configuration is going to provide an immutable and cloud native way for managing business objects in Keycloak that can be easily moved between environments (e.g. dev, stage, prod) in a predictable manner. This feature will leverage the new Store.X which enables federation of the configuration from multiple sources (static/dynamic) by re-architecting the storage layer. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a link to the appropriate Store.X capabilities here? I think this is rather vague and led me to "I want to know more". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
#### Static configuration through Git | ||
|
||
The `Keycloak` CRD will enable defining a specific commit (identified by an hash) in a Git repository containing the static configuration in the form of JSON/YAML files. To update the configuration the user will simply change a commit hash in a `Keycloak` CR and the operator will roll out the new configuration to all pods. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would be nice to see an example config There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pretty much work in progress but this is an anticipation: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't have the CRD fully defined yet and I believe this is an implementational detail. But it really should be simple. I can imagine a simple field in the CRD containing repo address and commit hash. Plus a few optional fields for authentication. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it is an implementation detail, but at least for me it helps a lot to wrap my head around the concept to see an example, as unfinished as it is :) |
||
|
||
#### Static configuration through Kubernetes API | ||
|
||
The operator will leverage dedicated CRD(s), initially, there will be only one `Realm` CRD directly translated from Keycloak's [RealmRepresentation](https://github.com/keycloak/keycloak/blob/c7134fd5390d7c650b3dfd4bd2a2855157042271/core/src/main/java/org/keycloak/representations/idm/RealmRepresentation.java). A Realm includes all subresources. As a result, it is going to be possible to configure every object in Keycloak through this CR even though for some of them it won't be recommended (e.g. Users). To implement this, the operator will simply translate the CRs to YAML files and mount them to Keycloak pods, again leveraging Store.X. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is quite weird that we have the Realm CRD but we don't recommend configuring some key aspects like users. In practice, you are favoring managing users (and some others) through the admin console, right? IMO, we should provision realms and then rely on the administration console. However, I see some space for improvement to enable some key capabilities/configuration when provisioning realms such as:
We would have some realm templates that users can choose from to provision the realm. Manage realms (fully) using CRs does not look user-friendly for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
this is a great idea! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, i really like the idea of templates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The templates functionality will basically have to come from the Store.X. Operator won't be actively involved in this. To my understanding, it will indeed work as you're suggesting. A "template" will be the static config (stored either in git or as a CR) while some config will be dynamic (modifiable through REST API / Admin Console) thanks to different store types aggregation. Correct @hmlnarik @mhajas ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Realm templates makes little sense to me. This should rather be addressed by improving the RealmRepresentation (REST API, static store, CRs, etc, should all share the same "representation") and making it easier to configure things than to have to rely on templates/examples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it intended that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was current main at the time of writing it. ;) It's just my habit to keep it super-future proof. I can change it to main if you really believe it makes more sense. I don't have a strong opinion about it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for keeping the permalink There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. np :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate a bit about |
||
|
||
vmuzikar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
It will be possible to store any credentials as Secrets in K8s leveraging [Keycloak Vault functionality](https://www.keycloak.org/docs/latest/server_admin/index.html#_vault-administration) where possible. | ||
|
||
It's purpose of the upcoming Store.X initiative to provide a full-fledged static configuration backend for Keycloak but there will be a mid-term preview to enable bulk imports at startup time leveraging the REST api. | ||
|
||
### Keycloak versions alignment | ||
|
||
The operator will have its release cycle aligned with Keycloak. Each operator version will support only one corresponding Keycloak version. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Surely you'll need to support at least two Keycloak versions per operator version to span the upgrade process? |
||
|
||
### Upgrades | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about downgrades and failure scenarios? Just to give an example, you upgrade from Keycloak X to Y and Keycloak Y crashes with your setup. Now you need to revert. How to achieve that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As the operator version is tied to Keycloak version, there's no special treatment for downgrades. Operator downgrade would need to be done through OLM. But it should not be a problem thanks to rolling upgrades. I added a few more details on that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would like to see some more on how upgrades will be performed, and how canary style upgrades will be achieved, as well as how rollback will be achieved. |
||
|
||
In order to upgrade to a newer Keycloak version, the operator will be upgraded first to ensure full compatibility with the operand. | ||
|
||
If custom Keycloak image is not used, the operator will use a default base image. After the operator is upgraded, it automatically upgrades Keycloak too using a newer base image. | ||
|
||
In case a custom Keycloak image is used, the image will need to be rebuilt to perform the upgrade. This is not going to be operator's responsibility as building a custom image often requires a complex process incl. CI/CD pipelines. After the operator is upgraded, it won't manage any existing Keycloak instances until its custom image is manually rebuilt using the right Keycloak base image aligned with the operator and updated and the image coordinates are updated in the CR. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about: Which leads me to the question: |
||
|
||
Store.X will optionally allow zero-downtime rolling upgrades (a Keycloak upgrade performed pod by pod) that will ensure that Keycloak cluster will remain available even when upgrade fails on one of the pods. | ||
"Recreate" upgrade strategy (all pods gracefully shut down and re-created with new image) will be also available. | ||
|
||
### Reaugmentation process in Kubernetes | ||
|
||
We will be leveraging Kubernetes volumes to act as "caches" for the [augmented/configured](https://quarkus.io/guides/reaugmentation) version of Keycloak. | ||
An initial POC to show the concept has been drafted here: | ||
https://github.com/andreaTP/poc-mutable-jar | ||
|
||
We will use Kubernetes volumes to cache the augmented version of the binaries. | ||
The artifacts in the kubernetes volume will be produced by an init-container and the operation is going to result in a noop in case the volume has already been populated by a compatible augmentation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens when the reaugmentation fails for some reason? retry mechanism? messages, etc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can imagine a "reaugmentation fault" is the same as "pod won't start", i.e. readiness probes will be failing. |
||
|
||
### Connecting to a database | ||
|
||
A Postgres DB instance will have to be provisioned externally, it's not Keycloak Operator's responsibility to manage a database. The DB credentials will be fetched from K8s Secrets. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is MySQL not going to be supported any more? |
||
|
||
In long-term plan we'll add a limited integration with a Postgres Operator to leverage its backup functionalities for Keycloak upgrades. | ||
|
||
### Observability | ||
|
||
The operator will provide CR metrics as well as it will provide integration with Prometheus, Grafana and AlertManager for both operator and operand metrics. This will be addressed in an upcoming design proposal. | ||
|
||
### Ingresses | ||
|
||
The operator will provide an out-of-the-box experience using an opinionated default Ingress (Route on OpenShift) configuration. This configuration will support further "manual" modification. Additionally, it will be possible to completely disable this feature. | ||
|
||
### Outgoing requests proxy settings | ||
|
||
The operator will respect the standard `HTTP_PROXY`, `HTTPS_PROXY` and `NO_PROXY` environmental variables for any outgoing requests. These variables are by default [set by OLM](https://docs.openshift.com/container-platform/4.9/operators/admin/olm-configuring-proxy-support.html). The operator will also pass them to Keycloak pods to leverage the built-in support for them. This behaviour will be overridable. | ||
|
||
|
||
## Codebase | ||
|
||
The code for the new operator will be organized as a Maven sub-module in the main GitHub `keycloak/keycloak` repository. | ||
Dependency management will automatically piggy-back on the Keycloak BOM of the Quarkus distribution guaranteeing compliance of the used library versions. | ||
|
||
It will use the Java Operator SDK and its Quarkus extension. This implies the usage of Fabric8 K8s client. | ||
|
||
## Kubernetes compatibility matrix | ||
* OpenShift >=4.7 | ||
* Vanilla Kubernetes >=1.20 | ||
|
||
Other Kubernetes distributions are supported only in the best effort mode. | ||
|
||
## Distribution | ||
|
||
The Operator deployment is going to be performed leveraging OLM providing both CLI approach via `Subscription` objects for managing the operator installation, and UI in OpenShift. The Operator as such is going to be distributed as a container image. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OLM is not used much except in OpenShift clusters, and Helm is widely used. What about providing an Helm chart, or at least plain manifests ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed a dependency on OLM just to install the operator is very heavy and unrealistic for most consumers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I use the olm on several vanilla k8s clusters. So it may be more common then you think. But, a helm option might be nice too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, using OLM is possible, but by requiring it you're adding another hurdle that makes Keycloak less attractive and there is no documented reason why forcing use of OLM is necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI, I'm just a user. I have experienced some of the benefits of the olm when it comes to stuff like auto upgrading of things. That's still rather painful with helm. Having up to date instances is pretty important in a security sensitive service like keycloak... Devs time is scarce. If all they have time for is to support one install system that works everywhere (k8s and openshift), though may be harder for some, I get that. Others can come along and implement other install methods if needed. I don't think this blocks that. This is just the description of the starting point I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the install method for the operator is any more complex than this, I think that counts as a bug in the operator rather than a design choice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A. If thats true, no need for helm either then. Single static manifest. Done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hello @sathieu @benlangfeld @kfox1111! Thank you for your feedback. It will (and actually already is) of course possible to deploy the operator "manually" using kustomize. We won't be creating any artificial obstacles to use the operator this way, we actually want to make this experience as good as possible as it is the main way for using the operator during its development. OLM is pretty much a standard now in the operator's world, with more than 200 operators in OperatorHub.io, which is not exclusive to OpenShift. So while OLM is not a hard dependency for the new operator, it is strongly recommended for the best experience. We currently don't plan support for Helm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current keycloak-operator has the option to be installed either with our without OLM. I think forcing the added complexity of another management interface (OLM) makes very little sense. besides k8s, or openshift for that matter, do just fine managing operator lifecycle. In my experience using k8s in production, most organizations are using CI/CD to manage application deployments. Forcing OLM in that scenario adds no value, and in fact detracts from one of the core tenants of reliability engineering, "Simplicity". I'm all for having the flexibility to use OLM for those that want/need it, but for most production environments forcing and additional management interface (OLM) just doesn't make sense. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DavidKittleSEL The new operator will use exactly the same approach as the old one. Primarily distributed through OLM with the possibility to deploy the operator manually without it. |
||
|
||
## Migration from the old operator | ||
No direct migration path. Generic migration steps will be documented. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thats ok. But is there a guarantee both can co'exist in the same cluster? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The two operator (old written in Go, new written in Java) won't be (most probably) possible to use at the same time. There'd be API conflicts (CRDs) and it wouldn't make much sense as the old one supports only Wildfly dist of Keycloak whereas the new one only Quarkus dist of Keycloak. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it would only be a problem I think if you were trying to reuse the CRD's with the new operator as well as the old one. Will the new one have its own API completely separate from the old one so the two can co'exist? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The APIs should be separated, the new operator will use a different CRDs version. |
||
|
||
## Future considerations | ||
|
||
### Autonomous operator | ||
|
||
Our long-term vision includes making the operator autonomous to some extent, basically making it a [Level 5 operator](https://operatorframework.io/operator-capabilities/). It should be able to understand the operand's metrics and reflect them while automatically scaling and healing Keycloak deployment. | ||
vmuzikar marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing the following configuration options:
owners
for distributed caches, the number of entries, etc. Not sure how that would look like with the new store though, especially because the operator might need an Infinispan Server in place.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this. Imo Caching/Clustering as well as the JVM environment is an infrastructure concern and thus should be handled by the operator. Not sure how Store.X fits in here, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe both areas are basically a dist config options and as such are configured as any other Quarkus dist option, that means env vars e.g. Is that correct? We have this already covered by:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I forgot about this important detail. However, we are still missing some key options for clustering like setting the number of owners or even entries in the cache. Right now, if you want something different than the default, you would need to provide your own configuration file. And that means mounting a file into a pod or building a custom image.
I think we need to discuss more this area and how to make it easier to set up.
As per JVM settings, it should be a matter of setting
JAVA_OPTS
env var. I was just wondering if you want to provide some specific properties in the CR for the most common settings. For me, env var works just fine.