-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Firmware panic in network code #2530
Comments
Here's the kernel dump and the firmware binaries. Note that the kernel dumped had a |
Could you post the code/related snippets for everything involving the handling of |
Thanks for the quick reply. For running it's literally just @kernel
def run_seq_kernel(self):
run_rtios(self._scan_artiq_seqs[self._dax_scan_index]) where For the static data version, For the rpc version, I have seen corruptions and realizing that the allocation was on the stack so I moved the allocation out to the upper most level of the code. Hence, the setting of the @kernel
def _dax_control_flow_run_kernel(self):
if self.is_seq_scan:
self._scan_artiq_seqs = self.get_rtio_data()
try:
DaxScan._dax_control_flow_run_kernel(self)
finally:
if self.is_seq_scan:
self._scan_artiq_seqs = self._old_scan_artiq_seqs This is the kernel entry point and all the usage are within I believe there is not other use of the array in our code. |
And yes, the kind of call pattern in #1394 was exactly what I noticed and fixed by moving the rpc to top level. Looking at the LLVM ir, it does seem like the |
Just to be sure I understand correctly (since that wasn't clear to me from the initial post), for the "non-RPC" case you assign Also, when you that the experiment runs these 20k times, does that mean that the crash happens even without ever returning from the kernel? I'm asking because attribute writeback is another fraught issue with lists (as you pointed out). |
More or less, more specifically, the host assigned to
I have seen this happening on the literal first run after rebooting kasli a few times yesterday so returning shouldn't have an effect there I hope. The kernel does return after running this 20k times and at least for today it usually takes ~ 10 runs to trigger. |
If your experiment doesn't actually require any hardware readback, you might be interested in a branch I have that adds an emulation layer to run kernels on the host by compiling them to x86_64 Linux instead, where you then can use Valgrind/GDB/… to investigate memory corruption issues in the RPC code as usual. The more hardware syscalls (e.g. rtio_input/…) you need to mock out to get your kernel to compile, the more annoying this would be to use, though. |
We do have some input elsewhere in the kernel that measure PMT counts and does an async rpc call to save them. It roughly looks like self.histogram.append([self._detection.count(c) for c in channels]) where |
Oh, forgot to mention that I have not observed a crash if I disabled the |
Hmm, could this point to a hardware issue (e.g. power supply stability)? Or maybe the crash is related to the RTIO analyzer, if you have that enabled? (which would cause changes to the network activity based on RTIO activity) Does enabling TRACE-level logging reveal anything/change the behaviour? These are just random suggestions for debugging, of course. I don't have any particular good guesses at this stage (without time and access to a reproduction case), nor do we seem to have seen this particular crash in Oxford based on the log archives. |
I was thinking about it and it's entirely possible that the correlation with
How do I do this? I still feel like this is either a race/re-entrance issue or a memory corruption. A few conceptual question about the overall structure,
Thanks. |
The firmware is loaded in from flash, which should only be written to in very limited circumstances (e.g. when uploading the startup/idle kernels).
On the comms CPU, all the dynamic allocations should be served from the heap. The heap allocator will panic (and print a memory map) when it runs out of free space. IIRC only the kernel CPU |
OK thanks I'll try this. We are logging via UART.
So I assume that even if the firmware code is corrupted in memory, it shouldn't persist across reboot. This would be consistent with what we saw since we did try to reflash all firmware and it didn't make a difference.
I see. The part I specifically refers to was https://github.com/m-labs/artiq/blob/release-6/artiq/firmware/runtime/session.rs#L161, which seems to be talking about a pointer the kernel receives so I thought it was allocated somewhere else. Or is it a pointer the kernel send to the firmware and then received? Is We've also noticed that disabling the rpc call and just use static data reduces the frequency of the panic happening. It has also happened many time for us in other experiments that makes extensive use of rpc (the PMT monitor experiment) without the aforementioned experiments. |
Oh, apologies, the stack guard page was only added with the switch to RISC-V on ARTIQ 7 and newer. I don't recall whether it was possible to just corrupt memory when exceeding the allotted stack in the old version. If I were in your shoes, I'd consider upgrading before sinking more time into debugging memory corruption issues, as the PMP support on the RISC-V cores helps, and some bugs related to RPC codegen might already be fixed now. (Of course, we ran with or1k-based firmware here in Oxford for a long time, but annectodally, unexplained corruption issues seemed more frequent there.) |
Upgrading artiq won't be trivial for us at the moment but could happen in a year or so I hope. In the mean time, I got a trace that I assume is related to the issue,
The dashboard shows a connection reset by peer error at 17:24:05, which is similar to/same as the one I got previously when the firmware paniced. This was also triggered by switching back to using rpc to upload the data. The difference though, is that the firmware did not panic and I can run another experiment without any problem. Also as soon as I've enabled the more verbose logging, I can see these network errors popping up. I have not tried restarting kasli to see if these error persists. |
Oh, and the rpc's should all be a call to an function that was defined like
that's significantly shorter than all the previous ones.... |
I there a way I can figure out the content/reason for those dropped ethernet packages without separate hardware monitor? |
These are likely just broadcast packets that are part of the regular background chatter on the network. The easiest way to monitor them (apart from modifying smoltcp to print them out) would be to insert a PC with two network cards between the Kasli and the rest of the network, such that you can monitor everything using Wireshark. As mentioned above, though, personally I'd really recommend spending the time on an ARTIQ upgrade instead. Even the big or1k/RISC-V switch was painless for us here, and any work spent on this is not "wasted" in the same way that hunting down a bug in an out-of-date version is. Of course, your circumstances might differ. |
Unfortunately I'm not the one who decide what version to use and I'm the only one having this issue... I'm fully aware that debugging the issue on an older version of the firmware would likely have little/no interest from everywhere else but with 80% of my time running the experiment spent on restarting artiq I don't have a much better choice. |
Upgraded to artiq-7 and just got another crash with this backtrace,
It seems that this might be #2123 though |
Looking at the backtrace for some of the crashes I've seen before on artiq 6, I did find a few of,
The call stack looks very similar even though the exact error location was different. |
Bug Report
One-Line Summary
Frequent and reproducible firmware panic when running the experiment.
Issue Details
Steps to Reproduce
The experiment involves uploading a int32 array to the kernel from the host and generating output from the data. I have not successfully reduced the kernel but here's the part that generate the output from the data
as well as one of the data that fairly repeatably trigger the issue.
The data is uploaded (repeated 11 times in a list) either by returning it from a rpc call with return type
TList(TArray(TInt32))
or by assigning it to a field of the experiment object and accessing the field in the kernel. Both method can trigger the panic though the rpc one may be triggering it somewhat more frequently. Each experiment runs through these ~20k times and this can happen anywhere between the 1st experiment and the 20th.Expected Behavior
The kernel runs.
Actual (undesired) Behavior
The kernel panics. The actual error vary a lot but the most frequent one is
The value 31 and 19 are very stable.
Decoded backtrace.
A related question is how much memory (on artiq 6) is available for different types of memory? Static data in kernel, rpc data per-package, rpc returned data that's allocated on the kernel stack. etc. I cannot easily reproduce this without uploading data buffer so I want to know what are the limit and if any of the limits can be exceeded without an explicit error message.
Your System (omit irrelevant parts)
The text was updated successfully, but these errors were encountered: