-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement locking for dataset creation in Globus uploader #446
Comments
The other option is to add to clowder the option to do a get_or_create_dataset(). We should be able to use : https://stackoverflow.com/a/16362833 |
Implementing this in Clowder would obviously be the best option, if this is an acceptable approach. |
As noted on Slack, the Mongo docs (and the SO post above) indicate that this isn't a reliable mechanism unless there's a unique index (https://docs.mongodb.com/manual/reference/method/db.collection.findAndModify/#upsert-and-unique-index). I think we run into the same problem --- with 20+ uploaders running, usually the dataset creation collision happens in a very short period (~ms). We're seeing this problem on < 1% of data right now, so I'd rather go with a solution that absolutely fixes the problem. Max and I discussed sherlock backed by Redis, but reading an exchange about the algorithm I'm not convinced it's any better. I'm how looking at the python-etcd client, which seems (along with Zookeeper) to have a better approach. If this fails, we can discuss the Mongo approach above. |
Per discussion with @robkooper and @max-zilla, we will move ahead with creating a new endpoint in Clowder to lock the collection using intent exclusive (IX) lock in combination with the findAndModify call above. See also https://docs.mongodb.com/manual/faq/concurrency/#which-operations-lock-the-database. Assigning to @max-zilla |
After further discussion, back on me to try the etcd approach. |
waiting on nebula. |
@craig-willis any updates here? |
@craig-willis hasn't revisited after etcd timeout issues |
Because Clowder allows multiple datasets to have the same name, we've frequently run into a problem with duplicate datasets during the Globus upload process. This results in downstream problems, such as extractors not triggering because of incomplete data if files are split across datasets.
We've discussed implementing a method in the Clowder API -- getOrCreateDataset or similar -- that would return an existing ID or create if it didn't exist, but have had pushback from the Clowder team since it would require locking in Mongo.
An alternative is to implement locking in the uploader itself either via Postgres or another package such as https://github.com/vaidik/sherlock.
Completion criteria:
The text was updated successfully, but these errors were encountered: