Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement locking for dataset creation in Globus uploader #446

Open
1 of 2 tasks
craig-willis opened this issue May 17, 2018 · 9 comments
Open
1 of 2 tasks

Implement locking for dataset creation in Globus uploader #446

craig-willis opened this issue May 17, 2018 · 9 comments
Assignees
Labels

Comments

@craig-willis
Copy link
Contributor

craig-willis commented May 17, 2018

Because Clowder allows multiple datasets to have the same name, we've frequently run into a problem with duplicate datasets during the Globus upload process. This results in downstream problems, such as extractors not triggering because of incomplete data if files are split across datasets.

We've discussed implementing a method in the Clowder API -- getOrCreateDataset or similar -- that would return an existing ID or create if it didn't exist, but have had pushback from the Clowder team since it would require locking in Mongo.

An alternative is to implement locking in the uploader itself either via Postgres or another package such as https://github.com/vaidik/sherlock.

Completion criteria:

  • Implement distributed locking mechanism in uploader
  • Update documentation
@robkooper
Copy link
Member

The other option is to add to clowder the option to do a get_or_create_dataset(). We should be able to use : https://stackoverflow.com/a/16362833

@craig-willis
Copy link
Contributor Author

Implementing this in Clowder would obviously be the best option, if this is an acceptable approach.

@craig-willis
Copy link
Contributor Author

As noted on Slack, the Mongo docs (and the SO post above) indicate that this isn't a reliable mechanism unless there's a unique index (https://docs.mongodb.com/manual/reference/method/db.collection.findAndModify/#upsert-and-unique-index). I think we run into the same problem --- with 20+ uploaders running, usually the dataset creation collision happens in a very short period (~ms). We're seeing this problem on < 1% of data right now, so I'd rather go with a solution that absolutely fixes the problem.

Max and I discussed sherlock backed by Redis, but reading an exchange about the algorithm I'm not convinced it's any better. I'm how looking at the python-etcd client, which seems (along with Zookeeper) to have a better approach. If this fails, we can discuss the Mongo approach above.

@craig-willis craig-willis self-assigned this May 24, 2018
@craig-willis
Copy link
Contributor Author

Per discussion with @robkooper and @max-zilla, we will move ahead with creating a new endpoint in Clowder to lock the collection using intent exclusive (IX) lock in combination with the findAndModify call above. See also https://docs.mongodb.com/manual/faq/concurrency/#which-operations-lock-the-database.

Assigning to @max-zilla

@craig-willis
Copy link
Contributor Author

After further discussion, back on me to try the etcd approach.

@max-zilla
Copy link
Contributor

waiting on nebula.

@max-zilla
Copy link
Contributor

@craig-willis any updates here?

@craig-willis
Copy link
Contributor Author

#491

@max-zilla
Copy link
Contributor

@craig-willis hasn't revisited after etcd timeout issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants