Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC : Repository Registration for Remote Backed Storage #8623

Closed
psychbot opened this issue Jul 11, 2023 · 11 comments · Fixed by #9105 or #9802
Closed

RFC : Repository Registration for Remote Backed Storage #8623

psychbot opened this issue Jul 11, 2023 · 11 comments · Fixed by #9105 or #9802
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage v2.10.0

Comments

@psychbot
Copy link
Member

psychbot commented Jul 11, 2023

Problem Statement

OpenSearch with remote backed storage enables storing indexed data to remote data store which guarantees data durability. As of today the user has to register the repository manually by calling PUT /_snapshot/remote-repository and update either the cluster level remote repository settings or index level remote repository settings or both in order to use the remote backed storage feature.

Cluster Settings for Remote Repository -

  • cluster.remote_store.repository
  • cluster.remote_store.translog.repository

IndexSettings for Remote Repository -

  • index.remote_store.segment.repository
  • index.remote_store.translog.repository

Once the user updates these settings then only the indexed data will be backed to remote store which essentially means any index created before this process will not be backed to remote store until we have #7986 built in OpenSearch which allows migrating older indices to remote store.

Due to this manual process in between we will miss on backing up system indices to remote store as all the system indices gets created during the cluster bootstrap.

Requirements

Functional

  • Existing functionality of repository registration should function the same how it functions today.
  • The repositories supplied during the cluster bootstrap should be the first thing to register in order to achieve backing up system indices to remote backed storage.
  • Add support to tag a repository whose some of the fields cannot be altered and the repository cannot be deleted such that repository registered as remote store repository. E.g. restricted : false or restricted : true or remote_store_repository : false or remote_store_repository : true
  • The repository information will be supplied during cluster bootstrap via yml file.

Non-Functional

  • The repository registration during bootstrap should have minimal or no impact in cluster bootstrap time.

Assumptions

  • Its users responsibility to keep repository information on all node in sync.
  • Its users responsibility to not alter or delete repository information in yml file.

Background

OpenSearch has a plugin based architecture which allows developers to build plugins using the interfaces provided by the core and run them as part of the OpenSearch engine. Some of the plugins create system indices and stores information necessary for their functioning during cluster bootstrap.

Remote backed storage in its current state can’t back these system indices which are created during cluster bootstrap and hence we want to support the registration of repositories during cluster bootstrap via yml and register the repositories at the very starting of cluster bootstrap.

[Solution 1] Cluster Settings based approach

In this solution we will be passing the repository information in Opensearch yml and during the cluster bootstrap the active cluster manager will register the repository.

Algorithm

The solution will have the following steps

  1. Supplying repository information and cluster settings - Currently we do not accept repository information via yml file. We will allow supplying repository information via yml and use the same during node bootstrap.
    Below is the format how repository information and cluster settings will be supplied via yml
    "repository_information":
        "my-remote-segment-store":
            "type": "s3"
            "settings": "{\"bucket\": \"my-s3-bucket\",\"base_path\": \"my/snapshot/directory\"}"
            "restricted": true
        "my-remote-translog-store":
            "type": "fs"
            "settings": "{\"location\": \"/mnt/remote\"}"
            "restricted": true
            
    "cluster.remote_store.repository": "my-remote-segment-store"
    "cluster.remote_store.translog.repository": "my-remote-translog-store"
  1. Registering the repository - We want the repository registration to happen instantly when the cluster manager is elected.
    Their are two ways to achieve this -
    a. [Preferred] Cluster State Change Event - Listening to cluster state change event and when the cluster manager is elected the task for registering the repository will be submitted. The ClusterStateListener implementation will be removed once the repository is registered.
    b. Background Thread - A background thread which will keep polling local cluster state periodically and once the the cluster manager is elected the executor will stop.
  2. Registration task should be submitted by one node - In order to achieve this the repository registration logic will be functional only on the active cluster manager. Once the repository is registered it will remove the ClusterStateListener implementation.

RepoRegistrationSequenceDiag

Failure Scenarios

  1. Handling Node/Process Restart- If the node is not active cluster manager, During restart the ClusterStateListener implementation will be added to StateListener during bootstrap and upon first cluster state changed event it will be removed from StateListener as this is not the active cluster manager and repository is already registered.
  2. Handling Node Reboot of Active Cluster Manager (Single Node Cluster) - If the node is active cluster manager, During restart the ClusterStateListener implementation will be added to StateListener during bootstrap and upon first cluster state changed event it will check if the repository information is already present in the cluster state. As the information will be already present the ClusterStateListener implementation will be removed from StateListener.

Migration/Upgrade Scenarios

All the nodes which supports remote backed storage will have a node attribute lets say remote_backed_storage. Below are some of the scenarios -

  1. Remote Store Node sends join request to Non Remote Store cluster - The non remote store cluster manager doesn't have the validator of node attribute and hence will allow the validators to succeed and send a validate join request as the request is from non remote store cluster manager the validator will be skipped allowing the node to join the cluster.
  2. Remote Store Node with incorrect repository Information sends join request to Non Remote Store cluster - The non remote store cluster manager doesn't have the validator of node attribute and hence will allow the validators to succeed and send a validate join request as the request is from non remote store cluster manager the validator will be skipped allowing the node to join the cluster.
  3. Remote Store Node with incorrect repository Information sends join request to Remote Store cluster - A node join request will be sent from data node to cluster manager and both of them will have the node attribute which will allow the validators to succeed, post that a validate join request will be sent from cluster manager to data node and validator checks if the cluster state information is same as the yml information and as the information is different the validator fails leading to node not joining the cluster.
  4. Non Remote Store Node sends join request joining to remote store cluster manager- A node join request from non remote store node will fail as the validator on remote store cluster manager will not get the node attribute from the data node leading to fail the join request.
  5. Remote Store Node sends join request to remote store cluster - A node join request will be sent from data node to cluster manager and both of them will have the node attribute which will allow the validators to succeed, post that a validate join request will be sent from cluster manager to data node and validator checks if the cluster state information is same as the yml information and as the information is same validator passes leading to node joining the cluster.
  6. Remove Conflicting Nodes During Upgrades - During upgrades once the new cluster manager(i.e. remote store cluster manager node) gets elected it will reject the node join request from older nodes which doesn't have the node attributes and yml information isn't matching the cluster state.
  7. Repository Registration During Upgrades - Repository registration will only happen once the active cluster manager which has the repository information gets elected.

Pros

  • As the registration of repository happens when the cluster manager gets elected this will work for single node cluster as well.

Cons

  • Only when the cluster manger node which has the repository information gets elected as the active cluster manager then only the repository will get registered.
  • Cluster settings will be exposed to the customer and can be updated manually.

[Preferred][Solution 2] Node Attribute based approach

In this solution we will pass the information via OpenSearch yml and during the node bootstrap the repository information will be added to the node attributes and during the node join the node attributes will be passed to active cluster manager to register the repository and to perform validation.

Algorithm

  1. Supplying repository information and cluster settings - Currently we do not accept repository information via yml file. We will allow supplying repository information via yml in the form of node attributes and use the same during cluster bootstrap.Below is the format how repository information and cluster settings will be supplied via yml
# Node Attributes
node.attr.remote_store.segment.repository : "my-remote-segment-store"
node.attr.remote_store.repository.my-remote-segment-store.type : "s3"
node.attr.remote_store.repository.my-remote-segment-store.settings :
    bucket : "my-s3-bucket"
    base_path : "my/snapshot/directory"
    system_repository: true
node.attr.remote_store.translog.repository : "my-remote-translog-store"
node.attr.remote_store.repository.my-remote-translog-store.type : "fs"
node.attr.remote_store.repository.my-remote-translog-store.settings :
    location : "/mnt/remote"
    system_repository: true

# Cluster Settings
"cluster.remote_store.repository": "my-remote-segment-store"
"cluster.remote_store.translog.repository": "my-remote-translog-store"
  1. Registering the repository - We want the repository registration to happen instantly when the cluster is formed/forming. When a node tries to join the cluster it will send the repository information to the active cluster manager, the cluster manager will validate the repository information against the repository information in its node attributes and register the same if it matches otherwise reject the node join request.

  2. Registration task should be submitted by one node - In order to achieve this the repository registration logic will be functional only on the active cluster manager. The node joining will send the repository information in node attributes to active cluster manager and it will validate the information to register the repository if not already registered for all the subsequent node join request if the repository is registered the registration logic will be No-Op.
    

Failure Scenarios

  1. Handling Node/Process Restart - If the node is not active cluster manager, During restart the node will send a join request with repository information in node attribute and as the repository will be already registered it will be a No-Op.
  2. Handling Node Restarts of Active Cluster Manager (Single Node Cluster) - Not sure how this will be exaclty handled.

Migration/Upgrade Scenarios

Below are some of the scenarios -

  1. Remote Store Node sends join request to Non Remote Store cluster - The non remote store cluster manager doesn't have the validator of node attribute and hence will allow the validators to succeed and send a validate join request as the request is from non remote store cluster manager the validator will be skipped allowing the node to join the cluster.
  2. Remote Store Node with incorrect repository Information sends join request to Non Remote Store cluster - The non remote store cluster manager doesn't have the validator of node attribute and hence will allow the validators to succeed and send a validate join request as the request is from non remote store cluster manager the validator will be skipped allowing the node to join the cluster.
  3. Remote Store Node with incorrect repository Information sends join request to Remote Store cluster - A node join request will be sent from data node to cluster manager and both of them will have the node attribute which will fail as the information present in the node attributes will be different from whats present on cluster manager node leading to node not joining the cluster.
  4. Non Remote Store Node sends join request joining to remote store cluster manager - A node join request from non remote store node will fail as the validator on remote store cluster manager will not get the node attribute from the data node leading to fail the join request.
  5. Remote Store Node sends join request to remote store cluster - A node join request will be sent from data node to cluster manager and both of them will have the node attribute which will allow the validators to succeed, post that a validate join request will be sent from cluster manager to data node and validator checks if the cluster state information is same as the yml information and as the information is same validator passes leading to node joining the cluster.
  6. Remove Conflicting Nodes During Upgrades - During upgrades once the new cluster manager(i.e. remote store cluster manager node) gets elected it will reject node join request of nodes which doesn't have the matching node attributes.
  7. Repository Registration During Upgrades - Repository registration will happen when a node will try to join a cluster with all the repository information in its node attribute and the active cluster manager will register the repository by reading the same.

Pros

  • This overcomes the limitation of first approach where the repository registration will only happen when the cluster manager which has the repository information in yml gets elected. With this approach once a remote store node joins the cluster with all the information in its node attribute the cluster manager will register the repository during node join.
  • No Cluster settings will be exposed and cannot be updated manually as its a node level attribute.

[Solution 3] Extended Node Attribute based approach

This approach is similar to second approach instead of storing node attributes in the form of key value pair of string to string it will be stored in a string to json serialized object. The other node reading the node attribute will have to deserialize the object to get the information present against the set attribute.
Below is the high level idea of how the information will be stored -

node.attrs.remote_store.repository_information : "JsonSerializedObject@1234"
node.attrs.remote_store.translog.repository_information : "JsonSerializedObject@1234"

Pros

  • Provides better and stronger validation mechanism as the data present in the node attributes will be serialized and if the serialization fails the node join request will be rejected.

Cons

  • If tomorrow we update the object format we will have to think about the backward compatibility and avoid any change which is backward incompatible.
  • Even a minor mistake while serializing the repository information can lead to failure of node joining as node attribute will be incorrect or incompatible.

FAQ

  1. What will happen if there is a partial success during multiple repository registration?
    Will be adding retries on the repositories which were not able to register successfully the first time. If there is consistent failure we will let the cluster changed event kick in and handle the flow again.

Appendix

Migration/Upgrade Scenario

Screenshot 2023-07-26 at 11 46 04 AM Screenshot 2023-07-26 at 11 46 32 AM Screenshot 2023-07-26 at 11 47 04 AM Screenshot 2023-07-26 at 11 47 26 AM
@psychbot psychbot added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 11, 2023
@gbbafna gbbafna added the RFC Issues requesting major changes label Jul 12, 2023
@linuxpi
Copy link
Collaborator

linuxpi commented Jul 12, 2023

Can there be a race condition where a system index creation happens before the RepositoryRegistrationListener is able to register the repo?

What would be the behavior if this happens? Will the index creation fail since "cluster.remote_store.repository" is already set?

@psychbot psychbot changed the title RFC : Repository Registration during Cluster Bootstrap [RFC] Repository Registration during Cluster Bootstrap Jul 12, 2023
@psychbot psychbot changed the title [RFC] Repository Registration during Cluster Bootstrap [Draft][RFC] Repository Registration during Cluster Bootstrap Jul 12, 2023
@psychbot psychbot changed the title [Draft][RFC] Repository Registration during Cluster Bootstrap RFC : Repository Registration for Remote Backed Storage Jul 22, 2023
@shwetathareja
Copy link
Member

Thanks @psychbot for the proposal! Please replace "master" with "cluster manager" in the diagrams.

  1. Both the Solution 1 & 2 talk about providing remote repo details in the yml file either via cluster settings or node attributes. We should probably rename the approaches better.
  2. Today, user can register a repo either at cluster level or at index level. In case of index level repo, there could be multiple valid repos. And, plugins can potentially provide their repo details when creating their system index if needed. Then, having a repo being passed at node level and then using that to prevent node joins sounds contradictory. The repository doesn't seem like a node property.

On Solution 2:

Registering the repository - We want the repository registration to happen instantly when the cluster is formed/forming. When a node tries to join the cluster it will send the repository information to the active cluster manager, the cluster manager will validate the repository information against the repository information in its node attributes and register the same if it matches otherwise reject the node join request.

Can you explain exactly at what point repository will be registered in terms of operation on leader. Before first election, during election or post election. Which ClusterStateTaskExecutor would register the repo and update the state? How are you ensuring atomicity?

Remove Conflicting Nodes During Upgrades - During upgrades once the new cluster manager(i.e. remote store cluster manager node) gets elected it will trim the conflicting nodes which doesn't have the matching node attributes.

Can you add more details on how this node trimming would happen? Are you going to make the trimming logic pluggable? This also means if the cluster manager node gets updated to latest version before data nodes and becomes leader, then old data nodes can't join the cluster. This could potentially mean that shard will not be migrated to new nodes. So, this will be a breaking change.

Repository Registration During Upgrades - Repository registration will happen when a node will try to join a cluster with all the repository information in its node attribute and the active cluster manager will register the repository by reading the same.

Exactly at what point during upgrade, the repo will be registered?

Single node with these information in node attribute will lead to repository registration and it doesn't require cluster manager node with repository information to become active cluster manager.

Can you explain this Pro more as active cluster manager needs to have repository else how will it validate the joins and register the repo?

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jul 24, 2023

Repository isn't tied to a node, it can be registered to the cluster(don't prefer the dynamic nature of repositories)but the remote-backed node joining the cluster should ensure that the repository(backing store) it refers to is already validated and registered as an explicit hard dependency or registers it during the process by the first remote-backed node joining the cluster(idempotent). Making this configuration at an index level makes the cluster too hard to reason about data loss semantics, so the intent is to keep this at a cluster level but also cater to the fact that we could have mixed clusters in which case all shard on the remote backed node will be considered durable.

The leader would need to ensure the remote backed nodes have homogeneous and validated repository configurations across those nodes. The node join validation part will check if there is a repo already registered, it matches that of current joining node, if not the joining node as a part of a new RemoteNodeJoinTaskExecutor does the job of validation and repo registration as a part of node join.

Since repo registration itself requires a cluster state update, care needs to be taken to register the repository. We need to evaluate if repo registration and leader elected state publication can be bundled together. The caveat is failure to register the repo could register in leader failure, but also ensures no other request like index creation can supersede a repo registration task

There would be no trimming logic as such @psychbot please correct my understanding, the join validation will fail to join the cluster it thinks is not suited for the configuration.

@psychbot
Copy link
Member Author

@linuxpi

Can there be a race condition where a system index creation happens before the RepositoryRegistrationListener is able to register the repo?

No, as per the preferred approach above we should not see race condition as the repository registration will happen when the cluster manager is elected and first node with required repository information in node attributes sends a node join request to the cluster manager.
The index creation will only start once the node joins the cluster.

@shwetathareja
Copy link
Member

  1. I thought we already allow configuring a different repo at the index level. Would it fail if user registers a different repo at the index level other than what is provided as node attribute? Would it be enforced to use a single repository across all indices (including system indices) to ensure homogeneity?
  2. In case there is no trimming logic, then what happens when leader switches to new cluster manager node which understands remote store and there are older nodes in the cluster, would they continue to be part of cluster? wouldn't this break homogeneity constraint?
  3. Instead of bundling repository registration and election state in single cluster state update. Have you evaluated registering the repo during first index creation using node scope setting? The first index creation and repo registration can happen atomically. Probably, you don't need join validator checks as well. Once the repository is registered node scope setting or attribute can't impact anything anyway.

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jul 24, 2023

Thanks @shwetathareja
Remote store features are experimental and are subject to contract changes, for GA we intend to have common repository across all indices, we can add index level overrides if we see a future need but we should start simple IMO
The way I was thinking was very similar to version upgrades where non-remote backed nodes can join till all shards have migrated to remote backed node. There were alternate thoughts to support mixed clusters like docrep indices on non-remote backed nodes based on allocation deciders if there are explicit index level replication type enabled. We are yet to close on the behaviour, however the guarantee are the following,

  1. A remote backed node should have repository validated and registered before it can join the cluster.
  2. All remote backed nodes should have the same repository information.

Index creation is not the event we want to hook repository registration since we need to support migration to remote backed indices which should start as long as we have remote backed nodes and none of the allocation constraints have been breached.
Similarly a node hosting a shard can be taken down, node attributes modified to reflect remote backed node and the node could be restarted to join the cluster expecting that shard would recover as remote backed as a part of the migration process

To differentiate between mixed clusters and migration cases, plan is to have a cluster level settings to auto-migrate indices to remote backed node and vice versa(direction of upgrade/downgrade). By default auto-upgrade to remote backed node would be enabled unless there is an explicit, index level setting to keep replication mode to be docrep.
Once the indices have been upgraded and no non-remote backed node in the cluster, index creation will start to fail. There is another migration RFC that is underway that should cover these in more details

Repository registration is not the sole purpose of join validator, it should also restrict a non-remote backed node from joining once we have upgraded all indices and don't plan to downgrade or run in mixed mode(heterogeneous setup)

@shwetathareja
Copy link
Member

shwetathareja commented Jul 25, 2023

Thanks for the details @Bukhtawar .
I am trying to understand why do we need to introduce a remote backed node type via attributes. How is it different from any other data nodes (except that it has some extra attributes). What prevents a regular data node from hosting shards which are remote backed along with doc rep shards.

Also, once the repository is registered, in case a node joins with different repo (for the sake of discussion), what problem can it cause? Basically, the node is going to use the repo which is registered in cluster state and the repo in the node attribute is going to be ignored anyway. Btw, in which case a different repo would be set in node attributes besides a configuration error.

Repository registration can be triggered via first index creation or when first index is migrate to remote backed index. Essentially, the first time a remote backed index is encountered, it always ensure to register the repo first before proceeding further.

@harishbhakuni
Copy link
Contributor

@psychbot thanks for the detailed proposal. have couple of basic comments/doubts:

  1. We have to restrict updation of cluster.remote_store.repository cluster setting as well, right?
  2. Can there be usecases where due to some reason, user would want to change/update the storage behind the remote store indices?

@psychbot
Copy link
Member Author

@shwetathareja

Today, user can register a repo either at cluster level or at index level. In case of index level repo, there could be multiple valid repos. And, plugins can potentially provide their repo details when creating their system index if needed. Then, having a repo being passed at node level and then using that to prevent node joins sounds contradictory. The repository doesn't seem like a node property.

Adding to what @Bukhtawar said, repository information being a node level attribute ensures that a node which is joining the cluster has same repository(refereed in the cluster level settings/cluster state) and its a hard dependency for the repository to be present. Also,
in 2.10 we have made these index level settings updation restricted, The index level settings cannot be updated by customer and will read the cluster level settings which itself is cannot be updated once set which makes the repository setting fixed throughout the cluster lifecycle and a candidate for node attribute. See #8770 #8812

Can you explain exactly at what point repository will be registered in terms of operation on leader. Before first election, during election or post election. Which ClusterStateTaskExecutor would register the repo and update the state? How are you ensuring atomicity?

The repository registration will happen only after the election and when first node join request lands onto the cluster manager node.

  1. The cluster manager's node join validator will be responsible to validate if the node which has joined has the required attributes to join a remote backed storage enabled cluster or not.
    a. If its the first node join request, the cluster state wont have any information regarding remote store hence the node join validator will validate that all the required information is present.
    b. If its any node join request after the first one, the cluster state will have the all the information regarding remote store and hence the node join validator will validate that the all the information present in node attribute is same as the cluster state to allow node to join the cluster.
  2. JoinTaskExecutor/New Implementation of JoinTaskExecutor will be responsible for reading the repository information.
    a. If its the first node, the executor will update the cluster state with information present in node attributes.
    b. If its any node apart from first, the executor will run the validator to validate the cluster state information against the node attributes to allow node to join the cluster.
  3. Before updating the cluster state it will be the responsibility of the TaskExecutor to validate repository information is already present or not to update cluster state. If the cluster state doesn't contain the repository information the synchronized block will be responsible for updating the cluster state atomically with the repository information.

Can you add more details on how this node trimming would happen? Are you going to make the trimming logic pluggable? This also means if the cluster manager node gets updated to latest version before data nodes and becomes leader, then old data nodes can't join the cluster. This could potentially mean that shard will not be migrated to new nodes. So, this will be a breaking change.

If during upgrade when the cluster manager changes all node will send join request to new cluster manager then the node join validator will get executed and in that scenario we wont need this trimming logic. I thought earlier that there wont be any join request sent to the new cluster manager and hence added this trimming logic part, will remove.

Exactly at what point during upgrade, the repo will be registered?

This is applicable when on a 2.10 cluster we will enabling remote store, at that point we will require to create new set of data nodes which will join the cluster and cluster manager will register the repository when the first remote store node from new set of data nodes will send a join request with all the required attributes.

Can you explain this Pro more as active cluster manager needs to have repository else how will it validate the joins and register the repo?

In Second approach - While enabling remote store in 2.10 cluster, the repository will get registered when a remote-store node from new set of nodes will send a join request with all the attributes to the current active cluster manager.
While in case of First approach - The repository will only get registered when the cluster manager of new set of nodes will get elected as cluster manager. The registration logic depends on the yml file present locally on active master, hence the repository registration requires cluster manager from new set of nodes to be elected. Will update the wording as its somewhat misleading.

@psychbot
Copy link
Member Author

@harishbhakuni21
The cluster level setting has always been static and cannot be updated.
https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/indices/IndicesService.java#L263

We should not and are not allowing updation of repository information in any case as that can lead to catastrophic outcomes. Once the repository is registered and passed onto the cluster level settings it should remain same throughout the cluster lifecycle.

@anasalkouz anasalkouz added the Storage:Durability Issues and PRs related to the durability framework label Jul 26, 2023
@anasalkouz
Copy link
Member

@sachinpkale @psychbot
Is this capability a blocker for remote storage? do we need to track it for 2.10?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage v2.10.0
Projects
None yet
8 participants