-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fpTracebackPartialExpand SIGSEGV on amd64 with deep inlining #69629
Comments
What's odd is this ought to repro it if my theory is right, but it does not: https://go.dev/play/p/SGNoe38oMwP
|
CC @golang/runtime |
here's the diff of the generated assembly when inlining of that specific function is enabled/disabled via and here's the inlining decision:
There's another layer of inlining above though:
and another layer on top of that:
and another layer on top of that:
|
I'll get an answer to you tomorrow regarding running the Tested https://go.dev/play/p/NMzamFE6hvu really quick and it doesn't crash either. Unfortunate! Unfortunately, if inlining in further upstream calls is implicated, I'm not able to give you the source to anything above the /lz4 node in the package tree. |
Disabling that individual inlining decision within the package causes it to now start crashing with a different traceback, still from the same inlined location, because I need to play whackamole with pgohash across each place it could possibly be inlined. I'm going to use
|
setting
|
Thanks. It would also help if you could disassemble the function containing the crash PC ( |
|
Oof, SP is completely bogus? The reported Also, By the way, do you know what is up with the formatting of your stack traces? Some frames are missing filename/line, and some of the frames don't make much sense:
How did |
Unfortunately, this is being mangled by our lambda stderr/stdout lambda log agent because it's executing within AWS lambda. It is also unfortunately theoretically possible that there are multiple crashes in close proximity that are resulting in interleaving of multiple dumps, although I don't think that's the case in any of the dumps I sent you, because I only see one
|
For extra measure,
|
Ah, that's much better. That is the dereference of the FP. |
I spent half an hour sorting from the randomly ordered output the stacks by fp and sp on my latest build because of that damn lack of microsecond precision in lambda generated log entries, you're welcome ;) I have now flagged off using pgohash what I believe to be all of the relevant inlining, and it's now broken at a different spot, which I think means I have to add more to my pgohash and keep playing whackamole.
Relevantly:
|
I think this is more in line with what you were expecting to see, I've manually checked everything after unscrambling, I think.
I'll try again to remove the inlining at |
No luck with successively trying to remove inlining, it keeps crashing at new places where it trips over inlining again. It appears any inlining at or upstream of compactFileReader.readRecord() results in the saveblockevent / fpTracebackPartialExpand path being unable to handle traversing the stack upwards. perhaps some kind of interaction with generics or with devirtualisation? I'll work that theory in the Go playground with a more minimal example, it is painful to have to wait 20 minutes per build cycle and attempt to repro in our staging environment. |
Apologies if I missed it, but how have you concluded that inlining/PGO is where things are going wrong? Do you see crashes in Also, does the crash output include a register dump? It'd look something like this:
I ask because, based on #69629 (comment), the register |
Correct, disabling PGO causes the bug to not appear. Funnily enough, a different PGO file that does not trigger optimisations in lib/retriever/cstorage/lz4 (because the package was added after that PGO file was captured) also does not trigger the bug.
Nope, it does not appear to :( Now that we know the problem seems to be confined to only the currently running goroutine I'm tempted to pare back our dumps to show only the current goroutine to increase the likelihood we get the register dump output. |
No, it cannot include a register dump. This is the failure path inside (thus the message |
Indeed. We do reach the If we had a core dump we could definitely get all the registers, plus inspect the state of the stack we're failing to unwind. I've had some luck with that on a previous frame pointer issue. You can set I've tried reproducing this as well. No luck yet, but I did uncover a different (maybe not related) bug (#69747). It's tricky to construct a CPU profile for PGO which influences inlining for specific functions, at least for a toy example. |
Unfortunately getting a core dump off of a running Lambda is going to be nigh impossible. However, one option we are open to is modifying our Go runtime during the build process, we have a toolchain for applying patches to |
Oh cool, if you can patch the runtime maybe we can do even better with debug logging. Something like this should capture some useful info: debuglog.patch. If you apply that and build the program with
That should at least definitively tell us how far the unwinding gets when it crashes. |
Heisenbug, with pgo enabled, -tags debuglog, and the runtime patch enabled, it no longer crashes. I am attempting to get the bug to happen again. will keep you posted. |
Got it to repro, with the debugging on. Some additional code we'd landed in the main branch caused pgo to make different decisions, causing it to not repro any more. I apologise for the fact that multiple of these are all mingled together, but here you go. Filtering on goid should give you what you need. |
Yay! Thank you for making that modification and collecting all this data! Grabbing one example:
From other stuff shared in this issue, it seems to consistently crash at dereferencing
All the other cases in the file look like that. It's interesting to me that all the crashes in that file dereference the exact same address... Here's how I read this example, cross-referencing the full stack trace from this comment:
One more example:
Different call sequence, but once again things seem to go bad going up to I'm wondering how |
Yup, readRecord and toIndex are safe for me to share. I'll get those to you, what's the best email address to use for you? it's likely going to sprawl bigger than we want in a gh issue. Unfortunately |
You can send the info to [email protected] |
Thank you @nsrip-dd. The problem was not in fact the Go compiler, but was instead a latent bug in https://github.com/dgryski/go-metro/blob/master/metro_amd64.s per dgryski/go-metro#9 |
Thanks to @lizthegrey for gathering so much info to help debug this. Some more detail in case anybody's wondering how we tracked that down: I looked up the 0x30bc5b29 value that shows up all over this issue. It's a constant used for the metrohash algorithm. I hadn't seen any obvious issues in the code/compiler output so far, so I was beginning to suspect bad custom assembly. I looked to see if any Go versions of metrohash use assembly for amd64. Turns out that one does! As for why was an issue after enabling PGO but not before, given something like this sequence of calls:
If |
Thanks for tracking this down and following up here. |
Change https://go.dev/cl/640075 mentions this issue: |
I just wanted to say thank you to @lizthegrey and @nsrip-dd for the amazing debugging on this issue! |
Go version
go version go1.23.1 linux/amd64
Output of
go env
in your module/workspace:What did you do?
Compiled using PGO, which inlined this (otherwise not normally inlinable) method:
into this method:
and for completeness, the struct:
and also our runtime profiling config:
What did you see happen?
Rarely, when a block longer than a second happens reading from the channel:
the problem appears to be at https://github.com/golang/go/blob/go1.23.1/src/runtime/mprof.go#L592 when unwinding the inlined calls on the stack.
We cannot repro this behaviour on arm64, it only repros on amd64, and we do not believe it would repro on go1.22 based on what we are seeing in the changed code from 1.22 to 1.23 with https://go-review.googlesource.com/c/go/+/533258 and 87abb4a both being new additions, but we've already upgraded to 1.23 semantics in our full codebase and don't want to attempt to undo them all.
What did you expect to see?
No crash.
The text was updated successfully, but these errors were encountered: