You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At high log volume and small-ish CPU/memory reservations our PXM worker command (Python, https://github.com/mapbox/pxm/blob/dev/cloudformation/pxm.template.js#L82) is breaking (I believe) the pipe to its parent worker. Logging less and increasing reservations for the stack helps but I'm interested in trying to solve the root problem.
When I say that we break the pipe, what I mean is that the PXM worker gets a 120 code from its child process at https://github.com/mapbox/ecs-watchbot/blob/master/lib/worker.js#L18. This 120 code is only set by Python when Python fails to flush stderr and stdout just before exiting. In the jobs where we see 120 exit codes, we also witness missing log messages where we would expect them; the messages our worker command is sending to stderr are not making it through and then we fail to flush the streams at the end. I do not understand at all how Python finds its stderr in a busted state.
I've tried to reproduce this problem outside of a deployed ecs-watchbot stack, but the watchbot logger can keep up with logs as fast as a single-threaded Python program can write them. I might be able to break the logger if I ran everything in a resource limited container and logged from multiple threads (as we are doing in PXM), but before I do that I am wondering if there are known limits to the watchbot logger and if this is something we should try to fix at all.
@rclark I hope you're doing well. Don't sweat this one, just letting you know your favorite pathological users have found another limit to computers 😄
The text was updated successfully, but these errors were encountered:
We've run up against similar challenges before when running things as Node.js child processes. However in this case it sounds like we're probably doing that piece alright, but there's a pretty thick pile of Node code before getting to the container's stdout.
There's a bunch of potential space here where python writing faster than node could end up with a backup in python-land. It is kind of interesting that you say this happens during a pre-shutdown flush -- sounds like there's data to flush that's backed up, but I wonder if the pipeline in Node.js is possibly getting shut down prematurely? Or maybe there's a limit to how long python is willing to wait to flush, and Node.js is still backed up when time runs out?
There's even the possibility that the way your container's logging is hooked up to CloudWatch Logs could cause some degree of slowdown. There are a bunch of spaces here where I really don't know all the implementation specifics.
At high log volume and small-ish CPU/memory reservations our PXM worker command (Python, https://github.com/mapbox/pxm/blob/dev/cloudformation/pxm.template.js#L82) is breaking (I believe) the pipe to its parent worker. Logging less and increasing reservations for the stack helps but I'm interested in trying to solve the root problem.
When I say that we break the pipe, what I mean is that the PXM worker gets a 120 code from its child process at https://github.com/mapbox/ecs-watchbot/blob/master/lib/worker.js#L18. This 120 code is only set by Python when Python fails to flush stderr and stdout just before exiting. In the jobs where we see 120 exit codes, we also witness missing log messages where we would expect them; the messages our worker command is sending to stderr are not making it through and then we fail to flush the streams at the end. I do not understand at all how Python finds its stderr in a busted state.
I've tried to reproduce this problem outside of a deployed ecs-watchbot stack, but the watchbot logger can keep up with logs as fast as a single-threaded Python program can write them. I might be able to break the logger if I ran everything in a resource limited container and logged from multiple threads (as we are doing in PXM), but before I do that I am wondering if there are known limits to the watchbot logger and if this is something we should try to fix at all.
@rclark I hope you're doing well. Don't sweat this one, just letting you know your favorite pathological users have found another limit to computers 😄
The text was updated successfully, but these errors were encountered: