Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior when EE2 throws an exception from finish_job #43

Open
MrCreosote opened this issue Mar 8, 2021 · 1 comment
Open
Labels
bug Something isn't working

Comments

@MrCreosote
Copy link
Member

I recently introduced a bug such that EE2 would throw an exception from the finish_job method because catalog_utils.catalog is not an extant field:
https://github.com/kbase/execution_engine2/blob/aecddbf3357eba07c2da92a1538387a2779e6ea1/lib/execution_engine2/sdk/EE2Status.py#L549

However, the job runner, when running an app in the narrative, seemed to ignore the exception - it didn't print a stack trace, at least. Instead it logged the following:

Job is done
1615180489.855946:ERROR:jr:WARNING: Module DataFileUtil was already used once for this job. Using cached version: url: https://github.com/kbaseapps/DataFileUtil commit: b640d1ce826f46d1bff497a2f475a4c7de016a18 version: 0.1.4 release: release
INFO:root:Running docker container for image: dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:About to run 34f56f72-7fcd-11eb-849a-02ac072d09e7 dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:Container id is b450fe482fc4f5315485edb4aa127b6deff68c644002064bc30230132ec880f3
1615180494.837040:ERROR:jr:WARNING: Module DataFileUtil was already used once for this job. Using cached version: url: https://github.com/kbaseapps/DataFileUtil commit: b640d1ce826f46d1bff497a2f475a4c7de016a18 version: 0.1.4 release: release
INFO:root:Running docker container for image: dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:About to run 37eeefaa-7fcd-11eb-849a-02ac072d09e7 dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:Container id is 93eb0680402df0c88d3e2066f64efb77a08df1fcde02701ed59ee21767fae635
1615180499.823481:ERROR:jr:WARNING: Module DataFileUtil was already used once for this job. Using cached version: url: https://github.com/kbaseapps/DataFileUtil commit: b640d1ce826f46d1bff497a2f475a4c7de016a18 version: 0.1.4 release: release
INFO:root:Running docker container for image: dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:About to run 3ae7dc9e-7fcd-11eb-849a-02ac072d09e7 dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:Container id is 36ea0f819dd8fcb5a2534e5cafd6775db56841afab80d199cd0ef3413100862e
INFO:sanic.root:Stopping worker [11342]
INFO:sanic.root:Server Stopped
INFO:jr:Logger initialized for 6045b269172cf13980dabe9a
The monitor jobrunner script killed your job. It is possible the node got overloaded with too many jobs! Memory={'total': 33729970176, 'available': 31053275136, 'percent': 7.9, 'used': 2393518080, 'free': 8088121344, 'active': 2393497600, 'inactive': 19168612352, 'buffers': 136212480, 'cached': 23112118272, 'shared': 1810432, 'slab': 3632816128} CPU=26.1 

After fixing the bug, the logs ended after the Job is done line.

I would expect the job runner to print the exception and some sort of severe warning message that it couldn't complete the job properly.

@MrCreosote MrCreosote added the bug Something isn't working label Mar 8, 2021
MrCreosote added a commit to kbase/execution_engine2 that referenced this issue Mar 9, 2021
EE2status attempting to access get_catalog_utils().catalog caused
finish_job() to throw an exception and consequently the job runner to freak out:

kbase/JobRunner#43

Accessed the catalog directly and added a unit test that covers the code
in question.
@MrCreosote
Copy link
Member Author

I'm guessing this line needs a try/except and logging: https://github.com/kbase/JobRunner/blob/master/JobRunner/JobRunner.py#L293

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant