AWSJobStores are unrestartable after ~2M files (~10GB metadata) #1809

joelarmstrong · 2017-08-06T01:22:59Z

SimpleDB has a limit of ~10GB of storage per domain. Unfortunately, that is reached pretty easily within ~1-3M jobs. You end up getting a message like this, indicating that the statsAndLogging thread failed:

Exception in thread Thread-24:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/toil/statsAndLogging.py", line 134, in statsAndLoggingAggregator
    if jobStore.readStatsAndLogging(callback) == 0:
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 561, in readStatsAndLogging
    info.save()
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 966, in save
    expected_value=expected)
  File "/usr/local/lib/python2.7/dist-packages/boto/sdb/domain.py", line 94, in put_attributes
    replace, expected_value)
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/utils.py", line 347, in _put_attributes_using_post
    return self.get_status('PutAttributes', params, verb='POST')
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1227, in get_status
    raise self.ResponseError(response.status, response.reason, body)
SDBResponseError: SDBResponseError: 409 Conflict
<?xml version="1.0"?>
<Response><Errors><Error><Code>NumberDomainBytesExceeded</Code><Message>Too many bytes in this Domain.</Message><BoxUsage>0.0000219907</BoxUsage></Error></Errors><RequestID>bdd2461b-9d81-ae59-49c7-8df8510
9081d</RequestID></Response>

After that, making any further progress in the workflow is impossible (even after restart).

The root of the problem, IMO, is that Toil doesn't currently have a great way of cleaning up files. Way, way too many files just sit around completely orphaned. cleanup=True sort of works, but requires users to a) know about it and b) trace the file flow to be sure the file's lifetime is OK.

I have a script that removes all orphaned files from an AWSJobStore, fixing the immediate problem:

https://gist.github.com/joelarmstrong/75d565d7cc864ebd71a56e53b67b3358

But I think there are some good ways to address this properly, without having to run my hacky script. This wouldn't be an improvement only for massive workflows. Even tiny workflows take up far too much space, and people definitely notice that they suddenly have 5GB less free space on their dev machine.

Easy win: I think stats/logging files are stored forever by default. We probably shouldn't do that.
Automated cleanup. I believe cleanup=True isn't the default because fileIDs might be passed back as promises. Well, it's realistically pretty easy to detect which fileIDs are passed as promises (my script does that). We could try, by default, to delete all fileIDs created by a job except those passed back (even as a nested Promise) within the job's return value. I'm hesitant to put yet another feature into toil, but I think this would ultimately have a positive impact on reliability and user experience.

Maybe we can discuss Monday.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-190

The text was updated successfully, but these errors were encountered:

DailyDreaming · 2020-04-15T21:47:26Z

This should be resolved by #964 and I'm closing this in favor of it.

ejacox added the discuss label Aug 16, 2017

DailyDreaming closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWSJobStores are unrestartable after ~2M files (~10GB metadata) #1809

AWSJobStores are unrestartable after ~2M files (~10GB metadata) #1809

joelarmstrong commented Aug 6, 2017 •

edited by cricketsloan

Loading

DailyDreaming commented Apr 15, 2020

AWSJobStores are unrestartable after ~2M files (~10GB metadata) #1809

AWSJobStores are unrestartable after ~2M files (~10GB metadata) #1809

Comments

joelarmstrong commented Aug 6, 2017 • edited by cricketsloan Loading

DailyDreaming commented Apr 15, 2020

joelarmstrong commented Aug 6, 2017 •

edited by cricketsloan

Loading