Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OSSM 2 vs 3 differences doc #145

Merged
merged 8 commits into from
Oct 25, 2024

Conversation

longmuir
Copy link

@longmuir longmuir commented Oct 9, 2024

First version of OpenShift Service Mesh 2 vs 3 Documentation. This is meant to be a high-level overview covering "What" has changed and a bit of "Why", with links added later to sections which may provide more detail on "How" in OSSM 3. Note that this is not meant to be exhaustive (there will be lots of changes between Istio releases as well), but is to highlight the important things that may require action for OSSM 2 users should be aware of coming to 3.

When reviewing, please consider:

  • Is the information correct? Can it be made easier to understand?
  • Can the information be condensed? This should be to the point.
  • What is missing? What else is changing that users should be aware of?

Signed-off-by: Jamie Longmuir <[email protected]>
@openshift-ci openshift-ci bot added the size/L label Oct 9, 2024
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/README.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved

While OpenShift Service Mesh 2 used a resource called `ServiceMeshControlPlane` to configure Istio, OpenShift Service Mesh 3 uses a resource called `Istio`.

The `Istio` resource contains a `spec.values` field that derives its schema from Istio’s Helm chart values. While this is a different configuration schema than `ServiceMeshControlPlane` uses, the fact that it is derived from Istio’s configuration means that configuration examples from the community Istio documentation can often be applied directly to Red Hat OpenShift Service Mesh’s `Istio` resource. The `spec.values` field has a similar format in `IstioOperator` resource (which is not part of OpenShift Service Mesh), with the `Istio` resource providing an additional validation schema enabling the ability to explore the resource using the OpenShift CLI command `oc explain istios.spec.values`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here it would make more sense to refer to upstream helm charts (and their values API) rather than the IstioOperator API

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did mention helm charts in the first sentence, but I also mentioned IstioOperator as well because it is widely used throughout the Istio.io documentation and has a full API reference in the Istio docs. The only reference I could find for Helm values was in the installation section: https://istio.io/latest/docs/setup/install/helm/#installation-steps, and it seemed a lot more difficult to translate pages like this: https://artifacthub.io/packages/helm/istio-official/istiod?modal=values, into config YAML if you're starting with the Istio. So, at face value, it looked like IstioOperator might be a more familiar starting point for upstream users... or people just reading the docs.

That said, I'm open to how best to convey this... and if the values field in IstioOperator is substantially different, then I should probably remove the reference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the values field is almost identical to ours - we have just additionally rolled up the components into the values by adding enabled: true/false fields

docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
longmuir and others added 3 commits October 14, 2024 20:28
Co-authored-by: Sridhar Gaddam <[email protected]>
Co-authored-by: Daniel Grimm <[email protected]>
Co-authored-by: Nick Fox <[email protected]>
Signed-off-by: Jamie Longmuir <[email protected]>
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved
Remove Gateway API reference from Gateways section.

Co-authored-by: Eoin Fennessy <[email protected]>
@longmuir longmuir requested a review from eoinfennessy October 18, 2024 19:09

## The OpenShift Service Mesh 3 operator

OpenShift Service Mesh 3 uses an operator that is maintained upstream as the sail-operator in the istio-ecosystem organization. This operator is smaller in scope and includes significant changes from the operator used in OpenShift Service Mesh 2 that was maintained as part of the Maistra.io project.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout the operator docs, we call it the "Sail Operator", not "sail-operator".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - will do a search and replace.


This simplification greatly reduces the footprint and complexity of OpenShift Service Mesh, while providing better, production-grade support for observability through Red Hat OpenShift Observability.

## `Istio` replaces `ServiceMeshControlPlane`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion, I'd change this to something like:

Suggested change
## `Istio` replaces `ServiceMeshControlPlane`
## The `Istio` resource replaces the `ServiceMeshControlPlane` resource

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, perhaps the title is fine as is, since it's formatted as code and "resource" is mentioned in the text below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I like your suggestion as it's more clear.


The Istio CNI node agent is used to configure traffic redirection for pods in the mesh. It runs as a DaemonSet, on every node, with elevated privileges. The Istio CNI agent has a lifecycle that is independent of Istio control planes, and must be upgraded separately.

In OpenShift Service Mesh 2, service meshes were namespace scoped by default and thus each mesh included its own instance of Istio CNI. As all OpenShift Service Mesh 3 meshes are cluster-wide, there can only be one version of Istio CNI per cluster, regardless of how many service mesh control planes are on the cluster.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the 2.x operator deployed one instance of Istio CNI for each Istio version. Thus, two meshes configured with the same Istio version used the same Istio CNI instance (deployed in the openshift-operators namespace).

Also, the reason why there can now only be one instance of Istio CNI cluster isn't because meshes are cluster-wide, but because we are no longer deploying Istio CNI via Multus, and Istio CNI itself does not support running multiple instances/versions. That is why IstioCNI is now a separate resource. This allows the user to have full control over which version of Istio CNI should be deployed when multiple Istios of different versions are installed. Without the IstioCNI resource, the operator would have to decide which version to install. Another reason for the introduction of Istio CNI is that the Istio CNI lifecycle isn't and shouldn't be tied to the lifecycle of the control plane(s).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying this - I tried to reword with this. How's this?

In OpenShift Service Mesh 2, the operator deployed an Istio CNI instance for each minor version of Istio present in the cluster and pods were automatically annotated during sidecar injection, such that they picked up the correct Istio CNI. This was enabled by using the Multus CNI plugin and meant that the management of Istio CNI was mostly hidden from users.
OpenShift Service Mesh 3 no longer uses the Multus CNI plugin, and instead runs Istio CNI as a chained CNI plugin. While this simplification provides greater flexibility for network integrations, without Multus, it means that only one instance of Istio CNI may be present in the cluster at any given time and users must manage its lifecycle independent of Istio control planes.
For these reasons, the OpenShift Service Mesh 3 operator manages Istio CNI with a separate resource called IstioCNI. A single instance of this resource is shared by all Istio control planes (managed by Istio resources). The IstioCNI resource must be upgraded before individual control planes (Istio resources) are upgraded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

docs/ossm/ossm2-vs-ossm3.md Outdated Show resolved Hide resolved

OpenShift Service Mesh 2 used the two resources `ServiceMeshMemberRoll` and `ServiceMeshMember` to indicate which namespaces were to be included in the mesh. When a mesh was created, it would only be scoped to the namespaces listed in the `ServiceMeshMemberRoll` or containing a `ServiceMeshMember` instance. This made it simple to include multiple service meshes in a cluster with each mesh tightly scoped, referred to as a “multitenant” configuration.

In OpenShift Service Mesh 2.4, a “cluster-wide” mode was introduced to allow a mesh to be cluster-scoped, with the option to limit the mesh using an Istio feature called `discoverySelectors`, which limits the scope of the mesh to a set of namespaces defined with a [label selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). This is aligned with how community Istio worked, and allowed Istio to manage cluster-level resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which limits the scope of the mesh ...

This is not perfectly exact. Yes, in 2.4 we could say that the SMMR/SMM limited the scope of the mesh, but in 3.x (and 2.x cluster-wide mode), there is no single concept of "scope". Instead, scope is split into discovery and injection. In 2.x multitenant mode the SMMR affected both at the same time. In 2.x cluster-wide and 3.x, you limit discovery via discoverySelectors and injection via namespace/workload labels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"scope of the mesh" is probably an over simplification, as you're right that injection matters too (the injection doc PR is relevant here too). I can change "limits the scope" to "limits the Istio control plane's visibility", which is more precise.

In 2.x though, at least as we documented, you still needed to add an annotation or label to workloads for injection to happen. So, Discovery Selectors doesn't sound too different in practice (even though the Istiods are scoped differently)? Perhaps it was more different under the hood...


Sidecar injection in OpenShift Service Mesh 3 works the same way as it does for Istio - with pod or namespace labels used to trigger sidecar injection and it may be necessary to include a label that indicates which control plane the workload belongs to. Note that Istio has deprecated pod annotations in favor of labels for sidecar injection.

When an `Istio` resource has the name “default”, the label `istio-injection=enabled` may be used.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true when the update strategy is InPlace. If the strategy is RevisionBased, then the IstioRevision name is default-<version>, which currently prevents you from using the istio-injection=enabled label.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will update.


When an `Istio` resource has the name “default”, the label `istio-injection=enabled` may be used.

However, when an Istio resource has a name other than “default” - as required when multiple control plane instances are present and/or a canary-style control plane upgrade is in progress, it is necessary to use a label that indicates which control plane (revision) the workload(s) belong to - namely, `istio.io/rev=<IstioRevision-name>`. These labels may be applied at the workload or namespace level.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Istio resource can be called default even when using the revision-based update strategy. Perhaps you wanted to say IstioRevision instead of Istio here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, yes, good catch - this wasn't right. It works if I change it to IstioRevision I think. I'll also clarify the previous paragraph, and mention that you can query revisions with oc get istiorevision

Related, when you have a moment, please review this PR: #152 as it goes into this topic in more depth.


In Istio, gateways are used to manage traffic entering (ingress) and exiting (egress) the mesh. While by default, OpenShift Service Mesh 2 deployed and managed an Ingress Gateway and an Egress Gateway with the service mesh control plane, configured in the `ServiceMeshControlPlane` resource, the OpenShift Service Mesh 3 operator will no longer create or manage gateways.

In OpenShift Service Mesh 3, gateways are created and managed independent of the operator and control plane using gateway injection. This provides much greater flexibility than was possible with the `ServiceMeshControlPlane` resource and ensures that gateways can be fully customized and managed as part of a GitOps pipeline. This allows the gateways deployed and managed alongside their applications with the same lifecycle.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mention of Gateway API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support the CRDs for Gateway API on OCP yet (had to clarify that in 2.6), so it won't be a fully supported solution until OCP Ingress ships their support with the CRDs. I had it here before, and took it out... I think I will add it back, as we mostly support using it...but with caveats.


## Introducing Canary Upgrades

OpenShift Service Mesh 2 supported only one approach for upgrades - an in-place style upgrade, where the control plane was upgraded, then all gateways and workloads were restarted for the proxies could be upgraded. While this is a simple approach, it can create risk for large meshes where once the control plane was upgraded, all workloads must upgrade to the new control plane version without a simple way to roll back if something goes wrong.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something's off here:

then all gateways and workloads were restarted for the proxies could be upgraded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"needed to be" restarted? I am meaning that the proxies needed to be bounced as part of an upgrade - is that not correct?


OpenShift Service Mesh 2 supported one form of multi-cluster: federation, which was introduced in version 2.1. In this topology, each cluster maintains its own independent control plane, with services only shared between those meshes on an as-needed basis. Communication between federated meshes is entirely through Istio gateways, meaning that there was no need for service mesh control planes to watch remote Kubernetes control planes, as is the case with Istio's multi-cluster service mesh topologies. Federation is ideal where service meshes are loosely coupled - managed by different administrative teams.

OpenShift Service Mesh 3 includes support for Istio's multi-cluster topologies, namely: Multi-Primary, Primary-Remote and external control planes. These topologies effectively stretch a single unified service mesh across multiple clusters. This is ideal when all clusters involved are managed by the same administrative team. Istio's multi-cluster topologies are ideal for implementing high-availability or failover use cases across a commonly managed set of applications.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mention of plan to again introduce federation at some point? Readers could be left wondering whether/why the federation feature is going away.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed... unfortunately, as this is for the product doc, we can't mention features that aren't at least tech preview (and those would have a big disclaimer). We will look to tell people about federation in the release blog posts.

Signed-off-by: Jamie Longmuir <[email protected]>

The `Istio` resource contains a `spec.values` field that derives its schema from Istio’s Helm chart values. While this is a different configuration schema than `ServiceMeshControlPlane` uses, the fact that it is derived from Istio’s configuration means that configuration examples from the community Istio documentation can often be applied directly to Red Hat OpenShift Service Mesh’s `Istio` resource. The `spec.values` field in the `IstioOperator` resource (which is not part of OpenShift Service Mesh) has a similar format. The `Istio` resource provides an additional validation schema enabling the ability to explore the resource using the OpenShift CLI command `oc explain istios.spec.values`.

## New resource: `IstioCNI`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luksa @dgn Can you have another look at this section on the IstioCNI resource... Based on the discussion in the team call, we are still using Multus in 3.0 rather than CNI chaining, and it may not have been the reason for the IstioCNI resource. Originally, I had said it had to do with the cluster vs namespace scoping, but changed it after feedback... was it just a simplification to decouple the lifecycles for canary upgrades? Can you suggest a better explanation for "Why" we now have the IstioCNI resource?

If the reasoning is messy, we can gloss over it a bit - the aim of this page is to tell users what they need to know coming from OSSM 2, not how everything works under the hood.


In OpenShift Service Mesh 2, the operator deployed an Istio CNI instance for each minor version of Istio present in the cluster and pods were automatically annotated during sidecar injection, such that they picked up the correct Istio CNI. This was enabled by using the Multus CNI plugin and meant that the management of Istio CNI was mostly hidden from users.

OpenShift Service Mesh 3 no longer uses the Multus CNI plugin, and instead runs Istio CNI as a chained CNI plugin. While this simplification provides greater flexibility for network integrations, without Multus, it means that only one instance of Istio CNI may be present in the cluster at any given time and users must manage its lifecycle independent of Istio control planes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still use Multus, and the usage of Multus is also unrelated to us running only one instance (that part is true). The actual reason we only run one instance in 3.0 is because upstream did not accept our patch that would have allowed running multiple copies of CNI, with the reason given that it's not required as they have different lifecycles (which then turned into us creating a separate CRD for it)

Signed-off-by: Jamie Longmuir <[email protected]>
@yxun
Copy link

yxun commented Oct 24, 2024

hello, how should I add a doc section in this PR ?
I see we have a 2-to-3 difference in "Setting the minimum and maximum protocol versions" topic.

In short, we configured this feature in OSSM 2.x SMCP API[1] and we should configure this feature by following upstream doc[2] in OSSM 3

ref: [1] https://docs.openshift.com/container-platform/4.17/service_mesh/v2x/ossm-security.html#ossm-security-min-max-tls_ossm-security
[2] https://istio.io/latest/docs/tasks/security/tls-configuration/workload-min-tls-version/

@longmuir
Copy link
Author

@yxun Because this PR is large and reviewed, I am going to merge this, and then you can make your PR against it. While it is possible for you to edit it directly, I would rather push the content as is, and then we review smaller changes. Of course, it will receive another review before publishing to the product docs, so it is not last call :).

@longmuir
Copy link
Author

/retest

@nrfox
Copy link

nrfox commented Oct 25, 2024

/override ci/prow/scorecard

Copy link

openshift-ci bot commented Oct 25, 2024

@nrfox: nrfox unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:.

In response to this:

/override ci/prow/scorecard

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@FilipB
Copy link

FilipB commented Oct 25, 2024

/override ci/prow/scorecard

Copy link

openshift-ci bot commented Oct 25, 2024

@FilipB: Overrode contexts on behalf of FilipB: ci/prow/scorecard

In response to this:

/override ci/prow/scorecard

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@FilipB FilipB merged commit 9a174ec into openshift-service-mesh:main Oct 25, 2024
10 checks passed
@gwynnemonahan
Copy link

Hey @longmuir ,

General question: is the content in a particular order, like from most important changes to least important changes?

@longmuir
Copy link
Author

longmuir commented Nov 1, 2024

@gwynnemonahan The ordering is roughly equivalent to the steps a user would take to adopt OpenShift Service Mesh... though it is also big/high-level changes -> smaller esoteric changes. So, starting with high level product differences -> operator differences -> installation -> scoping -> common configuration -> less common configuration.

Some sections are also closely related, for example "Scoping of the Mesh...", "Side car injection..." and "Multi-control plane..." are all tied together, hence put together. For those, you need to read the earlier topics for the later ones to make sense.

The two at the end - network policy and istioctl topics, are fairly independent. Arguably, we could have the istioctl one earlier even.

We tweaked it a few times during the eng reviews, so definitely open to suggestions on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants