Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace SDB with S3 in AWS job store #964

Open
hannes-ucsc opened this issue Jun 10, 2016 · 19 comments · May be fixed by #3569
Open

Replace SDB with S3 in AWS job store #964

hannes-ucsc opened this issue Jun 10, 2016 · 19 comments · May be fixed by #3569

Comments

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jun 10, 2016

Quoting an AWS support professional in case 1767267511:

I would recommend seeing if you would consider DynamoDB to replace your SimpleDB solution. DynamoDB is essentially the successor of SimpleDB, which is slowly being pulled out from active development. In fact, we're no longer offering that to new customers at this point.

If Toil should run on newly opened AWS accounts, we need to phase out SimpleDB.

I propose that we create a new, second implementation of the AWS job store that uses DynamoDB. The new implementation should be accessible under the aws job store locator, while the old one becomes aws_old.

The reason I didn't use DynamoDB in the first place was the payment model, which is based on a flat rate as a function of a configurable ("provisioned" in Amazon lingo) request volume. Toil would have to set that request volume to user-specified value (with a sensible default) before a workflow starts and make sure that it configures it back to the lowest possible value on exit.

┆Issue is synchronized with this Jira Story
┆friendlyId: TOIL-350

@hannes-ucsc hannes-ucsc self-assigned this Jun 10, 2016
@hannes-ucsc hannes-ucsc added this to the Sprint 04 (3.3.0) milestone Jun 21, 2016
@hannes-ucsc hannes-ucsc modified the milestones: Sprint 05 (3.4.0), Sprint 04 (3.3.0) Jun 28, 2016
@hannes-ucsc hannes-ucsc removed the ready label Jun 28, 2016
@hannes-ucsc hannes-ucsc modified the milestones: Sprint 05 (skipped), Sprint 06 (3.5.0) Jul 5, 2016
@hannes-ucsc hannes-ucsc removed this from the Sprint 06 (3.5.0) milestone Jul 29, 2016
@cket cket added the discuss label Dec 1, 2016
@cket
Copy link
Contributor

cket commented Dec 1, 2016

We should keep an eye on this, if they start deprecating SDB we need to start on a Dynamo job store replacement.

@abatilo
Copy link

abatilo commented Jan 31, 2021

Any chance that this can be revisited?

@DailyDreaming
Copy link
Member

@abatilo s3 is now strongly consistent, and so this issue is now about replacing SDB with s3. This will probably be worked on relatively soon actually (sometime in the next few months).

@abatilo
Copy link

abatilo commented Jan 31, 2021

Would it be a big lift? I would be curious to know if I could help.

@DailyDreaming
Copy link
Member

@abatilo Medium sized, I would guess? It still needs to be explored.

Most of the work would involve removing the current sdb functionality, identifying everything it's shuttling back and forth (primarily items with job attributes, representing jobs to be processed), and then making the remapping that will fetch/put files into s3. Jobs would map to job files in s3, and the presence of one signifies a job yet to be run, and a job that has finished should no longer have a file. Most of the work will be in the https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py file.

Some examples:

Loading a job currently uses a jobstore id to key the attributes for a job out of sdb: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py

This would need to be changed to using the jobstore id to fetch a bucket file by bucket name (aws jobstore name) and key.

Same with deleting a job: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py or listing jobs: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py#L322

There are also some odd spots where not finding a job needs to be handled specific to sdb, for example:

self.processRemovedJob(issuedJob, resultStatus)

If you want to tackle this, or a portion of it, we'd be happy to have the help and I'd be glad to review code progress on this as well.

@DailyDreaming
Copy link
Member

@abatilo We have sprint planning tomorrow and I'm going to propose putting this into the upcoming sprint.

@abatilo
Copy link

abatilo commented Feb 10, 2021

That's awesome. Thank you

@abatilo
Copy link

abatilo commented Feb 17, 2021

@DailyDreaming Could we still consider DynamoDB? S3 has throughput limits which might become problematic.

@DailyDreaming
Copy link
Member

@abatilo Yes, that's certainly still a possibility. What kind of limits concern you? First hit searching indicates 3500 requests/second to PUT data, and 5500 requests per second to GET data on s3. I'm not sure we're going to be hitting those limits, though it does look like dynamodb has higher limits.

@abatilo
Copy link

abatilo commented Feb 27, 2021

Members of my informatics team have expressed to me that with the current usage of S3, we've had pipelines fail due to hitting S3 limits. I haven't had time to dig in yet but that's why I wanted to bring it up here.

@DailyDreaming
Copy link
Member

I see. The database is more to enforce strong consistency, so I'd have to investigate how much the rate will increase (which I suspect would mostly be from heading a file to check for existence, rather than checking the db).

@unito-bot unito-bot assigned DailyDreaming and unassigned w-gao Mar 8, 2021
@unito-bot unito-bot changed the title Replace SDB with DynamoDB in AWS job store Replace SDB with S3 in AWS job store Jan 13, 2022
@unito-bot
Copy link

➤ Adam Novak commented:

Since S3 is strongly consistent now, we’re planning to just use that and not DynamoDB.

@stain
Copy link

stain commented Jan 19, 2022

Will it be possible to use other S3 backends than AWS?

@adamnovak adamnovak linked a pull request Mar 30, 2022 that will close this issue
19 tasks
@Guigzai
Copy link

Guigzai commented Feb 1, 2024

Hello,

It would be interesting to get rid of the amazon dependency to be able to use on-premise kubernetes platforms.

And therefore to replace sdb with something other than an amazon solution like dynamodb.

Would it be possible to consider solutions like Redis, etc.?

Regards

@unito-bot
Copy link

➤ Adam Novak commented:

Lon is making a cool control flow diagram for this.

@davidjsherman
Copy link
Contributor

We've been following this issue for a long time, hoping that using a strongly consistent S3 backend as mentioned by @unito-bot would be adopted.

Specifically we'd like to use Ceph's S3-compatible object storage, which guarantees strong consistency. Deploying Ceph is a common cluster storage solution for on-premises Kubernetes, since the Rook operator does the heavy lifting.

@adamnovak
Copy link
Member

We have Ceph now at UCSC, and using Ceph directly (instead of through the shared filesystem) might be interesting.

@davidjsherman
Copy link
Contributor

What could we (at Inria) do to contribute?

@stxue1
Copy link
Contributor

stxue1 commented Jun 19, 2024

Lon will probably be the one who would work on this, though it will be a while before this is added to the sprint. We don't have many internal people using the AWS implementation so we haven't had much spare development time for this.

We have a vague idea on implementing jobstore plugins similar to batchsystem plugins, so any ideas/recommendations there can be helpful.

Community contributions are of course always welcome. Unfortunately I'm unsure where those contributions could go, as this is Lon's task and I'm unsure of its current progress. If you want, you could ping him and ask where contributions for this could go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.