Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoupled mode is not being honored in cacheless mode after redirect #2146

Open
2 tasks
JustinKyleJames opened this issue Nov 8, 2023 · 17 comments
Open
2 tasks

Comments

@JustinKyleJames
Copy link
Contributor

JustinKyleJames commented Nov 8, 2023

  • 4-2-stable
  • main

When a client is connected to one server and the S3 resource is attached to another server and in decoupled mode, an iput does not honor decoupled mode. It instead uses the path in iRODS for the key.

The object can still be read from any of the servers since the key that is used in S3 is stored in the catalog.

I tested the following scenarios:

  • Client connected to provider, resource attached to provider, iput wrote the file correctly with the numeric decoupled key.
  • Client connected to consumer, resource attached to provider, iput wrote the file incorrectly using the iRODS path as the key.
  • Client connected to provider, resource attached to consumer, iput wrote the file incorrectly using the iRODS path as the key.
  • Client connected to consumer, resource attached to consumer, iput wrote the file correctly with the numeric decoupled key.
@korydraughn korydraughn changed the title Decoupled mode is not be honored in cacheless mode after redirect Decoupled mode is not being honored in cacheless mode after redirect Nov 8, 2023
@trel trel added the bug label Nov 8, 2023
@JustinKyleJames
Copy link
Contributor Author

It appears that the problem is caused by the L1desc entry not following a redirect. The data id is used for decoupled mode and it is taken from the L1desc table.

@JustinKyleJames
Copy link
Contributor Author

This is not an issue with archive mode as the data object id is read directly from the irods::file_object class since the object has already been written to cache and therefore the database. This is not yet available in cacheless mode so we had to use the L1desc table.

@fj-morales
Copy link

Hi,

Is there any update regarding this issue? Is there any plan to pick it up in a coming release?

TIA for your prompt response.

@JustinKyleJames
Copy link
Contributor Author

Hi,

Is there any update regarding this issue? Is there any plan to pick it up in a coming release?

TIA for your prompt response.

The fix will be in the next release of the plugin. We should be able to fix it in the next few weeks.

@fj-morales
Copy link

Hi, thanks for your response. Would that fix work with irods 4.12? Or only 4.3.x?

@korydraughn
Copy link
Contributor

korydraughn commented Nov 28, 2023

The fix would apply to 4.2 and 4.3 assuming it doesn't require any changes to the iRODS server.

If changes to the server are needed, then only 4.3 will receive the fix.

JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jan 10, 2024
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jan 10, 2024
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jan 11, 2024
@JustinKyleJames
Copy link
Contributor Author

The commits have been merged to both main and 4-2-stable. Closing this issue.

@JustinKyleJames
Copy link
Contributor Author

Reopening this issue because the fix did not work on 4.2.12. When I tested it I accidentally had the resource set to detached mode so the redirect did not happen as intended.

It appears that the issue is that unlike 4.3.1, in 4.2.12 at the time s3_file_create() is called the entry has not yet been written to the r_data_main table. We rely on that for the S3 resource to determine the object id which is used for the S3 object key in decoupled mode.

JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jan 18, 2024
@JustinKyleJames
Copy link
Contributor Author

I have unchecked both the main and 4-2-stable checkmarks as the solution here does not work for files that use parallel writes.

Here is a brief description on how this currently works. Let's say the client is connected to server A and the S3 resource is attached to server B.

  1. s3_resolve_resc_hier_operation is run on server A.
  2. The db_reg_data_obj_op operation is called on the catalog provider to create the data object entry. After this there is a record in the database for the replica.
  3. The request is redirected to server B.
  4. On server B, in s3_file_create_operation the physical path is paved over. This allows the plugin to write to the correct S3 key.
  5. If the file is a small file, s3_notify_operation is called on server A. The plugin uses this to pave over the physical path within server A's agent memory.
  6. If the file is a large file, the notification operation is not called on server A. The plugin has no opportunity to pave over the physical path within server A's agent memory.
  7. The file is written to S3 (from server B).
  8. At completion, server B executes rs_data_object_finalize and sends the physical path it has to the catalog provider.
  9. The catalog provider updates the physical path in the database.

For this to work correctly, the physical path needs to be repaved within the agent memory on server A before step 8 is run. In the case of large files, the only plugin operation run on server A is the initial s3_resolve_resc_hier_operation. This is too early to pave over the physical path. There is no other opportunity to pave over this path.

At this point I don't see a solution that only involves changes to the S3 plugin.

Some options to move forward:

  1. Update the core code for 4.3.2 to make sure the plugin's notify operation gets called on server A. This would mean that there would be no fix for 4.2.12.
  2. Come up with a post put rule / microservice that would update the path as necessary. This rule would have to to be installed if you plan on using a resource in decoupled mode that is cacheless_attached.
  3. A combination of these two with the rule needed for older code and the core changes for 4.3.2+.

@trel
Copy link
Member

trel commented Jan 19, 2024

I like option 3 on first glance.

@korydraughn
Copy link
Contributor

Is all of this because decoupled mode needs the data id to create a valid physical path / prefix?

@trel
Copy link
Member

trel commented Jan 22, 2024

Is all of this because decoupled mode needs the data id to create a valid physical path / prefix?

Right... because we decided to use 'reversed data_id' as the prefix?

What if... we did something else? Design goals include... 'stable', and ... what else?

@korydraughn
Copy link
Contributor

Exactly what I was getting at. There doesn't appear to be a reason to use the reversed data id.

Perhaps the scheme can be controlled by the admin (with a sane default if the admin doesn't care).

@JustinKyleJames
Copy link
Contributor Author

Exactly what I was getting at. There doesn't appear to be a reason to use the reversed data id.

Perhaps the scheme can be controlled by the admin (with a sane default if the admin doesn't care).

The thing is that the object key has to be predictable (so that both servers agree on the key) and guaranteed to be unique.

On the first point, if it is not predictable then the key would have to somehow be shared between the two computers.

The obvious solution would be to use a hash of the original iRODS path. I thought that I looked into that and found that it wasn't possible but don't remember why. When I get a chance I will look into it again.

@korydraughn
Copy link
Contributor

There's several pieces of information included in the API requests. Perhaps there's a combination of things that are always available, that when used together in some capacity, lead to a unique value.

@fj-morales
Copy link

Hello. Are there any more planned activities targetting this issue?

@JustinKyleJames
Copy link
Contributor Author

JustinKyleJames commented Jun 26, 2024

Hello. Are there any more planned activities targetting this issue?

Not at the moment. It is a difficult problem to solve because the plugin isn't given enough information so it might require a server code change. I'll circle back around to this in the next few days to see if there is something else I can come up with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants