You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SimpleDB has a limit of ~10GB of storage per domain. Unfortunately, that is reached pretty easily within ~1-3M jobs. You end up getting a message like this, indicating that the statsAndLogging thread failed:
Exception in thread Thread-24:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/toil/statsAndLogging.py", line 134, in statsAndLoggingAggregator
if jobStore.readStatsAndLogging(callback) == 0:
File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 561, in readStatsAndLogging
info.save()
File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 966, in save
expected_value=expected)
File "/usr/local/lib/python2.7/dist-packages/boto/sdb/domain.py", line 94, in put_attributes
replace, expected_value)
File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/utils.py", line 347, in _put_attributes_using_post
return self.get_status('PutAttributes', params, verb='POST')
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1227, in get_status
raise self.ResponseError(response.status, response.reason, body)
SDBResponseError: SDBResponseError: 409 Conflict
<?xml version="1.0"?>
<Response><Errors><Error><Code>NumberDomainBytesExceeded</Code><Message>Too many bytes in this Domain.</Message><BoxUsage>0.0000219907</BoxUsage></Error></Errors><RequestID>bdd2461b-9d81-ae59-49c7-8df8510
9081d</RequestID></Response>
After that, making any further progress in the workflow is impossible (even after restart).
The root of the problem, IMO, is that Toil doesn't currently have a great way of cleaning up files. Way, way too many files just sit around completely orphaned. cleanup=True sort of works, but requires users to a) know about it and b) trace the file flow to be sure the file's lifetime is OK.
I have a script that removes all orphaned files from an AWSJobStore, fixing the immediate problem:
But I think there are some good ways to address this properly, without having to run my hacky script. This wouldn't be an improvement only for massive workflows. Even tiny workflows take up far too much space, and people definitely notice that they suddenly have 5GB less free space on their dev machine.
Easy win: I think stats/logging files are stored forever by default. We probably shouldn't do that.
Automated cleanup. I believe cleanup=True isn't the default because fileIDs might be passed back as promises. Well, it's realistically pretty easy to detect which fileIDs are passed as promises (my script does that). We could try, by default, to delete all fileIDs created by a job except those passed back (even as a nested Promise) within the job's return value. I'm hesitant to put yet another feature into toil, but I think this would ultimately have a positive impact on reliability and user experience.
Maybe we can discuss Monday.
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-190
The text was updated successfully, but these errors were encountered:
SimpleDB has a limit of ~10GB of storage per domain. Unfortunately, that is reached pretty easily within ~1-3M jobs. You end up getting a message like this, indicating that the statsAndLogging thread failed:
After that, making any further progress in the workflow is impossible (even after restart).
The root of the problem, IMO, is that Toil doesn't currently have a great way of cleaning up files. Way, way too many files just sit around completely orphaned.
cleanup=True
sort of works, but requires users to a) know about it and b) trace the file flow to be sure the file's lifetime is OK.I have a script that removes all orphaned files from an AWSJobStore, fixing the immediate problem:
https://gist.github.com/joelarmstrong/75d565d7cc864ebd71a56e53b67b3358
But I think there are some good ways to address this properly, without having to run my hacky script. This wouldn't be an improvement only for massive workflows. Even tiny workflows take up far too much space, and people definitely notice that they suddenly have 5GB less free space on their dev machine.
cleanup=True
isn't the default because fileIDs might be passed back as promises. Well, it's realistically pretty easy to detect which fileIDs are passed as promises (my script does that). We could try, by default, to delete all fileIDs created by a job except those passed back (even as a nested Promise) within the job's return value. I'm hesitant to put yet another feature into toil, but I think this would ultimately have a positive impact on reliability and user experience.Maybe we can discuss Monday.
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-190
The text was updated successfully, but these errors were encountered: