Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWSJobStores are unrestartable after ~2M files (~10GB metadata) #1809

Closed
joelarmstrong opened this issue Aug 6, 2017 · 1 comment
Closed
Labels

Comments

@joelarmstrong
Copy link
Contributor

joelarmstrong commented Aug 6, 2017

SimpleDB has a limit of ~10GB of storage per domain. Unfortunately, that is reached pretty easily within ~1-3M jobs. You end up getting a message like this, indicating that the statsAndLogging thread failed:

Exception in thread Thread-24:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/toil/statsAndLogging.py", line 134, in statsAndLoggingAggregator
    if jobStore.readStatsAndLogging(callback) == 0:
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 561, in readStatsAndLogging
    info.save()
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 966, in save
    expected_value=expected)
  File "/usr/local/lib/python2.7/dist-packages/boto/sdb/domain.py", line 94, in put_attributes
    replace, expected_value)
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/utils.py", line 347, in _put_attributes_using_post
    return self.get_status('PutAttributes', params, verb='POST')
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1227, in get_status
    raise self.ResponseError(response.status, response.reason, body)
SDBResponseError: SDBResponseError: 409 Conflict
<?xml version="1.0"?>
<Response><Errors><Error><Code>NumberDomainBytesExceeded</Code><Message>Too many bytes in this Domain.</Message><BoxUsage>0.0000219907</BoxUsage></Error></Errors><RequestID>bdd2461b-9d81-ae59-49c7-8df8510
9081d</RequestID></Response>

After that, making any further progress in the workflow is impossible (even after restart).

The root of the problem, IMO, is that Toil doesn't currently have a great way of cleaning up files. Way, way too many files just sit around completely orphaned. cleanup=True sort of works, but requires users to a) know about it and b) trace the file flow to be sure the file's lifetime is OK.

I have a script that removes all orphaned files from an AWSJobStore, fixing the immediate problem:

https://gist.github.com/joelarmstrong/75d565d7cc864ebd71a56e53b67b3358

But I think there are some good ways to address this properly, without having to run my hacky script. This wouldn't be an improvement only for massive workflows. Even tiny workflows take up far too much space, and people definitely notice that they suddenly have 5GB less free space on their dev machine.

  • Easy win: I think stats/logging files are stored forever by default. We probably shouldn't do that.
  • Automated cleanup. I believe cleanup=True isn't the default because fileIDs might be passed back as promises. Well, it's realistically pretty easy to detect which fileIDs are passed as promises (my script does that). We could try, by default, to delete all fileIDs created by a job except those passed back (even as a nested Promise) within the job's return value. I'm hesitant to put yet another feature into toil, but I think this would ultimately have a positive impact on reliability and user experience.

Maybe we can discuss Monday.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-190

@ejacox ejacox added the discuss label Aug 16, 2017
@DailyDreaming
Copy link
Member

This should be resolved by #964 and I'm closing this in favor of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants