Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firmware panic in network code #2530

Open
yuyichao opened this issue Aug 2, 2024 · 21 comments
Open

Firmware panic in network code #2530

yuyichao opened this issue Aug 2, 2024 · 21 comments

Comments

@yuyichao
Copy link

yuyichao commented Aug 2, 2024

Bug Report

One-Line Summary

Frequent and reproducible firmware panic when running the experiment.

Issue Details

Steps to Reproduce

The experiment involves uploading a int32 array to the kernel from the host and generating output from the data. I have not successfully reduced the kernel but here's the part that generate the output from the data

@kernel
def _run_rtios(rtios, cur_mu):
    pending_data = False
    target = np.int32(0)
    for v in rtios:
        if pending_data:
            rtio_output(target, np.int32(v))
            pending_data = False
        elif v < 0:
            cur_mu -= np.int64(v)
            at_mu(cur_mu)
        else:
            target = np.int32(v)
            pending_data = True
    return cur_mu
as well as one of the data that fairly repeatably trigger the issue.
dummy_data = [     22529,   67110664,      24065,   67110664,         -8,
     22528,  234881024,      24064,  234881024,       -152,
     22529,   67116808,      24065,   67116808,         -8,
     22528, 1073676288,      24064, 1073676288,       -536,
     22529,   67116810,      24065,   67116810,         -8,
     22528,  536870912,      24064,  183904959,       -536,
     22529,  100665096,      24065,   67110664,         -8,
     22528,  234881024,       -152,      22529,  100671240,
        -8,      22528,  536870912,       -536,      22529,
 100671242,         -8,      22528,  901943132,      -1040,
     21248,          1,       -100,      21248,          0,
      -279,      20736,          0,      22016,          1,
     -2000,      23040,          1,      20992,          0,
     23552,          1,      22784,          1,         -8,
     22784,          0,      -7992,      20992,          1,
     -2000,      20992,          0,      -8000,      20992,
         1,      -2000,      20992,          0,      -8000,
     20992,          1,      -2000,      20992,          0,
     -8000,      20992,          1,      -2000,      20992,
         0,      -8000,      20992,          1,      -2000,
     20992,          0,      -8000,      20992,          1,
     -2000,      20992,          0,      -8000,      20992,
         1,      -2000,      20992,          0,      -8000,
     20992,          1,      -2000,      20992,          0,
     -8000,      20992,          1,      -2000,      20992,
         0,      -8000,      20992,          1,      -2000,
     20992,          0,      -8000,      20992,          1,
     -2000,      20992,          0,      -8000,      20992,
         1,      -2000,      20992,          0,      -8000,
     20992,          1,      -2000,      20992,          0,
     -8000,      20992,          1,      -2000,      20992,
         0,      -8000,      20992,          1,      -2000,
     20992,          0,      -8000,      20992,          1,
     -2000,      20992,          0,      -8000,      20992,
         1,      -2000,      20992,          0,      -8000,
     20992,          1,      -2000,      20992,          0,
     -8000,      20992,          1,      -2000,      20992,
         0,      -8000,      20992,          1,      -2000,
     20992,          0,      -8000,      20992,          1,
     -2000,      20992,          0,      -8000,      20992,
         1,      -2000,      20992,          0,      -8000,
     20992,          1,      -2000,      20992,          0,
     -8000,      20992,          1,      -2000,      20992,
         0,      -8000,      20992,          1,      -2000,
     20992,          0,      -8000,      20992,          1,
     -2000,      20992,          0,      -8000,      20992,
         1,      -2000,      20992,          0,      -8000,
     20992,          1,      -2000,      20992,          0,
     -8000,      20992,          1,      -2000,      20992,
         0,      -8000,      20992,          1,      -2000,
     20992,          0,     -25000,      20992,          1,
     -2000,      20992,          0,     -25000,      20992,
         1,      -2000,      20992,          0,     -25000,
     20992,          1,      -2000,      20992,          0,
    -25000,      20992,          1,      -2000,      20992,
         0,     -25000,      20992,          1,      -2000,
     20992,          0,     -25000,      20992,          1,
     -2000,      20992,          0,     -25000,      20992,
         1,      -2000,      20992,          0,     -25000,
     20992,          1,      -2000,      20992,          0,
    -25000,      20992,          1,      -2000,      20992,
         0,     -25000,      20992,          1,      -2000,
     20992,          0,     -25000,      20992,          1,
     -2000,      20992,          0,     -25000,      20992,
         1,      -2000,      20992,          0,     -25000,
     20992,          1,      -2000,      20992,          0,
    -25000,      20992,          1,      -2000,      20992,
         0,     -25000,      20992,          1,      -2000,
     20992,          0,     -25000,      20992,          1,
     -2000,      20992,          0,     -25000,      20992,
         1,      -2000,      20992,          0,     -25000,
     20992,          1,      -2000,      20992,          0,
    -25000,      20992,          1,      -2000,      20992,
         0,     -25000,      20992,          1,      -2000,
     20992,          0,     -25000,      20992,          1,
     -2000,      20992,          0,     -25000,      20992,
         1,      -2000,      20992,          0,     -25000,
     20992,          1,      -2000,      20992,          0,
    -25000,      20992,          1,      -2000,      20992,
         0,     -25000,      20992,          1,      -2000,
     20992,          0,     -25000,      20992,          1,
     -2000,      20992,          0,     -25000,      20992,
         1,      -2000,      20992,          0,     -25000,
     20992,          1,      -2000,      20992,          0,
    -25000,      20992,          1,      -2000,      20992,
         0,     -25000,      20992,          1,      -7000,
     20736,          1,      23040,          0,      20992,
         0,      23552,          0,      -7000,      24576,
         0,      24320,          1,         -8,      24320,
         0,      24064,  234881024,       -152,      24065,
  67116808,         -8,      24064, 1073698134,       -536,
     24065,   67116810,         -8,      24064,  183904959,
      -536,      24065,   67110664,       -252,      24576,
         1,     -39000,      24576,          0,         -4,
     24320,          1,         -8,      24320,          0,
     24064,  234881024,       -152,      24065,   67116808,
        -8,      24064, 1073676289,       -536,      24065,
  67116810,         -8,      24064,  183904959,       -536,
     24065,   67110664,       -248,      24576,          1,
    -39000,      24576,          0,      24320,          1,
        -8,      24320,          0,      24064,  234881024,
      -152,      24065,   67116808,         -8,      24064,
         1,       -536,      24065,   67116810,         -8,
     24064,  183904959,       -788,      24576,          1,
    -39000,      24576,          0,         -4,      24320,
         1,         -8,      24320,          0]

The data is uploaded (repeated 11 times in a list) either by returning it from a rpc call with return type TList(TArray(TInt32)) or by assigning it to a field of the experiment object and accessing the field in the kernel. Both method can trigger the panic though the rpc one may be triggering it somewhat more frequently. Each experiment runs through these ~20k times and this can happen anywhere between the 1st experiment and the 20th.

Expected Behavior

The kernel runs.

Actual (undesired) Behavior

The kernel panics. The actual error vary a lot but the most frequent one is

panic at src/libcore/slice/mod.rs:1965:5: index 31 out of range for slice of length 19

The value 31 and 19 are very stable.

Decoded backtrace.

unwind_backtrace::backtrace at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/libunwind_backtrace/lib.rs:42
 (inlined by) rust_begin_unwind at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:329
core::panicking::panic_fmt at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/panicking.rs:92
<core::ops::range::Range<usize> as core::slice::SliceIndex<[T]>>::index_mut at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/slice/mod.rs:2142
 (inlined by) <core::ops::range::RangeTo<usize> as core::slice::SliceIndex<[T]>>::index_mut at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/slice/mod.rs:2181
 (inlined by) core::slice::<impl core::ops::index::IndexMut<I> for [T]>::index_mut at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/slice/mod.rs:1958
 (inlined by) smoltcp::wire::tcp::Repr::emit at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/wire/tcp.rs:913
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1497
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ip::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1681
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ethernet::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1516
 (inlined by) <smoltcp::phy::tracer::TxToken<Tx, P> as smoltcp::phy::TxToken>::consume::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/phy/tracer.rs:99
 (inlined by) <board_misoc::ethmac::EthernetTxSlot as smoltcp::phy::TxToken>::consume at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/libboard_misoc/ethmac.rs:120
 (inlined by) <smoltcp::phy::tracer::TxToken<Tx, P> as smoltcp::phy::TxToken>::consume at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/phy/tracer.rs:98
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ethernet at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1511
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ip at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1668
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1477
<smoltcp::iface::ethernet::Interface<'b, 'c, 'e, DeviceT>>::socket_egress::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:595
 (inlined by) smoltcp::socket::tcp::TcpSocket::dispatch at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/socket/tcp.rs:1599
 (inlined by) <smoltcp::iface::ethernet::Interface<'b, 'c, 'e, DeviceT>>::socket_egress at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:625
 (inlined by) <smoltcp::iface::ethernet::Interface<'b, 'c, 'e, DeviceT>>::poll at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:495
 (inlined by) runtime::startup at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:221
core::ops::function::FnOnce::call_once at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/ops/function.rs:223
 (inlined by) logger_artiq::BufferLogger::register at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/liblogger_artiq/lib.rs:67
 (inlined by) main at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:255

A related question is how much memory (on artiq 6) is available for different types of memory? Static data in kernel, rpc data per-package, rpc returned data that's allocated on the kernel stack. etc. I cannot easily reproduce this without uploading data buffer so I want to know what are the limit and if any of the limits can be exceeded without an explicit error message.

Your System (omit irrelevant parts)

  • Operating System: Ubuntu 22.04
  • ARTIQ version: 6
  • Version of the gateware and runtime loaded in the core device: 6.7684.9a3f6f85;gold_master
  • Hardware involved: A kasli master connected to a kasli satellite. The master is the one thats panicing. After the master panics, the satalite sometimes needs to be restart as well but that could just be that the master hang up on a communication mid-way.
@yuyichao
Copy link
Author

yuyichao commented Aug 2, 2024

fw_and_kernel.zip

Here's the kernel dump and the firmware binaries. Note that the kernel dumped had a if target == 20992: condition on the rtio_output. This was added for debugging and was sufficient to trigger the issue.

@dnadlinger
Copy link
Collaborator

Could you post the code/related snippets for everything involving the handling of rtios? There are some related bugs where the compiler accepts invalid code (see #1394 and related). I've seen "non-local" crashes also in network or logging code due to memory corruption.

@yuyichao
Copy link
Author

yuyichao commented Aug 2, 2024

Thanks for the quick reply.

For running it's literally just

    @kernel
    def run_seq_kernel(self):
        run_rtios(self._scan_artiq_seqs[self._dax_scan_index])

where self._dax_scan_index is a int32 and self._scan_artiq_seq is the list of numpy int32 array.

For the static data version, self._scan_artiq_seqs is assigned on the host. (I believe it was not set to kernel invariant when I tested it this time but I've seen less reproducible crashes before when I did set it to be kernel invariant.)

For the rpc version, I have seen corruptions and realizing that the allocation was on the stack so I moved the allocation out to the upper most level of the code. Hence, the setting of the self._scan_artiq_seqs look like this,

    @kernel
    def _dax_control_flow_run_kernel(self):
        if self.is_seq_scan:
            self._scan_artiq_seqs = self.get_rtio_data()
        try:
            DaxScan._dax_control_flow_run_kernel(self)
        finally:
            if self.is_seq_scan:
                self._scan_artiq_seqs = self._old_scan_artiq_seqs

This is the kernel entry point and all the usage are within DaxScan._dax_control_flow_run_kernel(self). self._scan_artiq_seqs and self._old_scan_artiq_seqs both started as [] outside the kernel. The last assignment makes sure that neither the kernel finalization code or my code would try to access the rpc returned data that would be invalid upon the function return.

I believe there is not other use of the array in our code.

@yuyichao
Copy link
Author

yuyichao commented Aug 2, 2024

And yes, the kind of call pattern in #1394 was exactly what I noticed and fixed by moving the rpc to top level. Looking at the LLVM ir, it does seem like the rpc_recv is called on alloca'd buffer in a loop to receive the array which matches what I expect.

@dnadlinger
Copy link
Collaborator

Just to be sure I understand correctly (since that wasn't clear to me from the initial post), for the "non-RPC" case you assign rtios to say self.rtios statically (on the host before the kernel is compiled) and then just call _run_rtios(self.rtios, …)?

Also, when you that the experiment runs these 20k times, does that mean that the crash happens even without ever returning from the kernel? I'm asking because attribute writeback is another fraught issue with lists (as you pointed out).

@yuyichao
Copy link
Author

yuyichao commented Aug 2, 2024

Just to be sure I understand correctly (since that wasn't clear to me from the initial post), for the "non-RPC" case you assign rtios to say self.rtios statically (on the host before the kernel is compiled) and then just call _run_rtios(self.rtios, …)?

More or less, more specifically, the host assigned to self._scan_artiq_seqs and the assignment to it in the _dax_control_flow_run_kernel above is removed.

Also, when you that the experiment runs these 20k times, does that mean that the crash happens even without ever returning from the kernel? I'm asking because attribute writeback is another fraught issue with lists (as you pointed out).

I have seen this happening on the literal first run after rebooting kasli a few times yesterday so returning shouldn't have an effect there I hope. The kernel does return after running this 20k times and at least for today it usually takes ~ 10 runs to trigger.

@dnadlinger
Copy link
Collaborator

dnadlinger commented Aug 2, 2024

If your experiment doesn't actually require any hardware readback, you might be interested in a branch I have that adds an emulation layer to run kernels on the host by compiling them to x86_64 Linux instead, where you then can use Valgrind/GDB/… to investigate memory corruption issues in the RPC code as usual. The more hardware syscalls (e.g. rtio_input/…) you need to mock out to get your kernel to compile, the more annoying this would be to use, though.

@yuyichao
Copy link
Author

yuyichao commented Aug 2, 2024

We do have some input elsewhere in the kernel that measure PMT counts and does an async rpc call to save them.

It roughly looks like

        self.histogram.append([self._detection.count(c) for c in channels])

where append is an async rpc call. I'm not sure how important this part is to repduce the issue, since I have not been able to reproduce it if I only include a loop on run_rtios as the kernel...

@yuyichao
Copy link
Author

yuyichao commented Aug 2, 2024

Oh, forgot to mention that I have not observed a crash if I disabled the rtio_output (or use a condition to disable most of the outputs) from run_rtios. As I mentioned the issue normally is triggered within 20 runs and I've done at least ~100 runs with reduced/disabled output without any error.

@dnadlinger
Copy link
Collaborator

Hmm, could this point to a hardware issue (e.g. power supply stability)? Or maybe the crash is related to the RTIO analyzer, if you have that enabled? (which would cause changes to the network activity based on RTIO activity)

Does enabling TRACE-level logging reveal anything/change the behaviour?

These are just random suggestions for debugging, of course. I don't have any particular good guesses at this stage (without time and access to a reproduction case), nor do we seem to have seen this particular crash in Oxford based on the log archives.

@yuyichao
Copy link
Author

yuyichao commented Aug 3, 2024

I was thinking about it and it's entirely possible that the correlation with rtio_output is pure timing related. When there's no output, my kernel was only calling at_mu basically and it's just writing to a global variable. The rtio_output call probably significantly affected the timing of the call including a few calls to start and stop ttl counting as well as rtio input followed by rpc calls to transfer the data back to the host.

Does enabling TRACE-level logging reveal anything/change the behaviour?

How do I do this?

I still feel like this is either a race/re-entrance issue or a memory corruption. A few conceptual question about the overall structure,

  1. Is the firmware code memory read-only? Could the kernel code corrupt code memory by means of, e.g. buffer overflow?
    I feel like corrupted code memory should cause a crash way more reliable than this but maybe there's a super cold code path that's corrupted. I believe we've reflashed the firmware but we can try again.
  2. Are the allocation sizes checked? The ones I can think about here are the stack/heap for both the network stack/rpc, and the kernel.
    The reason I'm asking this is that I've tried to see how much data I can return from a rpc call. I could transfer a single int32 array up to ~50M elements just fine. (I checked that the sum of the data is the same on the kernel and nothing seems to be crashing/panicing afterwards). When I tried to transfer 100M elements though, I got a invalid pointer exception at 0x38287a40 (from kernel::validate) suggesting a stack-type allocation starting from around 0x4ffffe40 (probably 0x4fffffff). I'm a little surprised that this overflown pointer could actually reach the kernel through the mailbox, even though nothing fundamental seems to have been overwriten. Additionally, trying to send two 100M elements arrays only moves this pointer by 8 bytes and sending two 50M elements array does not trigger this issue. This makes me somewhat worried that maybe I'm overflowing some buffers without knowing it.
  3. Related, the check in kernel.rs checks for KERNELCPU_EXEC_ADDRESS and not KERNELCPU_PAYLOAD_ADDRESS. I might be wrong but it seems that the kernel code is loaded at KERNELCPU_PAYLOAD_ADDRESS and if so, does this mean a carefully chosen rpc return value size could corrupt kernel code?
  4. Is there a description I can find somewhere about all these address ranges? I could probably figure out a way to get the kernel stack/frame/instruction pointer but not for most of the other address used by the firmware.

Thanks.

@dnadlinger
Copy link
Collaborator

dnadlinger commented Aug 8, 2024

How do I do this?

artiq_coremgmt log set_level and then also artiq_coremgmt log set_uart_level if monitoring via UART (which I tend to do, and is probably a good idea when debugging panics).

Is the firmware code memory read-only?

The firmware is loaded in from flash, which should only be written to in very limited circumstances (e.g. when uploading the startup/idle kernels).

Are the allocation sizes checked?

On the comms CPU, all the dynamic allocations should be served from the heap. The heap allocator will panic (and print a memory map) when it runs out of free space. IIRC only the kernel CPU allocas on the stack. On the kernel CPU side, this should be checked with a guard page (see code around STACK_GUARD_BASE/STACK_GUARD_SIZE in fault handler). Maybe somebody else can chime in about the comms CPU side.

@yuyichao
Copy link
Author

yuyichao commented Aug 9, 2024

artiq_coremgmt log set_level and then also artiq_coremgmt log set_uart_level if monitoring via UART (which I tend to do, and is probably a good idea when debugging panics).

OK thanks I'll try this. We are logging via UART.

The firmware is loaded in from flash, which should only be written to in very limited circumstances (e.g. when uploading the startup/idle kernels).

So I assume that even if the firmware code is corrupted in memory, it shouldn't persist across reboot. This would be consistent with what we saw since we did try to reflash all firmware and it didn't make a difference.

On the comms CPU, all the dynamic allocations should be served from the heap. The heap allocator will panic (and print a memory map) when it runs out of free space. IIRC only the kernel CPU allocas on the stack. On the kernel CPU side, this should be checked with a guard page (see code around STACK_GUARD_BASE/STACK_GUARD_SIZE in fault handler). Maybe somebody else can chime in about the comms CPU side.

I see. The part I specifically refers to was https://github.com/m-labs/artiq/blob/release-6/artiq/firmware/runtime/session.rs#L161, which seems to be talking about a pointer the kernel receives so I thought it was allocated somewhere else. Or is it a pointer the kernel send to the firmware and then received? Is 0x4ffffe40 around the highest address for the kernel stack?

We've also noticed that disabling the rpc call and just use static data reduces the frequency of the panic happening. It has also happened many time for us in other experiments that makes extensive use of rpc (the PMT monitor experiment) without the aforementioned experiments.

@dnadlinger
Copy link
Collaborator

dnadlinger commented Aug 9, 2024

Oh, apologies, the stack guard page was only added with the switch to RISC-V on ARTIQ 7 and newer. I don't recall whether it was possible to just corrupt memory when exceeding the allotted stack in the old version. If I were in your shoes, I'd consider upgrading before sinking more time into debugging memory corruption issues, as the PMP support on the RISC-V cores helps, and some bugs related to RPC codegen might already be fixed now. (Of course, we ran with or1k-based firmware here in Oxford for a long time, but annectodally, unexplained corruption issues seemed more frequent there.)

@yuyichao
Copy link
Author

Upgrading artiq won't be trivial for us at the moment but could happen in a year or so I hope. In the mean time, I got a trace that I assume is related to the issue,

[2024-08-19 17:24:04] [186391.447563s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.453188s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.460196s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.473418s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.479043s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.486051s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.499273s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.504898s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.511906s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.525041s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.530665s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.537673s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 1, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.550895s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.556520s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.563527s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.576750s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.582375s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.589382s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.602605s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.608230s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.615237s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.628459s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.634084s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.641092s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.654314s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.659939s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.666946s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.680082s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.685707s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.692714s] DEBUG(runtime::session): [0, 0, 0, 5, 79, 0, 0, 0, 134, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.701458s] DEBUG(runtime): ethernet mac: rx dropped: 3361378
[2024-08-19 17:24:04] [186391.706934s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 82, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.715710s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 86, probe: 0, value: 1 }
[2024-08-19 17:24:04] [186391.724759s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 90, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.733808s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 92, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.743883s]  INFO(runtime::session): no connection, starting idle kernel
[2024-08-19 17:24:04] [186391.781101s] DEBUG(runtime::session): comm->kern LoadRequest(...)
[2024-08-19 17:24:04] [186391.787091s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(()))
[2024-08-19 17:24:04] [186391.793107s] DEBUG(runtime::session): comm<-kern RtioInitRequest
[2024-08-19 17:24:04] [186391.798046s]  INFO(runtime::kern_hwreq): resetting RTIO
[2024-08-19 17:24:05] [186392.014488s] DEBUG(runtime): ethernet mac: rx dropped: 3361380
[2024-08-19 17:24:05] [186392.025847s] DEBUG(runtime): ethernet mac: rx dropped: 3361382
[2024-08-19 17:24:05] [186392.030908s] DEBUG(runtime): ethernet mac: rx dropped: 3361384
[2024-08-19 17:24:05] [186392.250652s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:05] [186392.381189s] DEBUG(runtime): ethernet mac: rx dropped: 3361387
[2024-08-19 17:24:05] [186392.560313s] DEBUG(runtime): ethernet mac: rx dropped: 3361388
[2024-08-19 17:24:05] [186392.739541s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:05] [186392.749505s] DEBUG(runtime): ethernet mac: rx dropped: 3361390
[2024-08-19 17:24:06] [186393.262891s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:06] [186393.442988s] DEBUG(runtime): ethernet mac: rx dropped: 3361392
[2024-08-19 17:24:06] [186393.484065s] DEBUG(runtime): ethernet mac: rx dropped: 3361393
[2024-08-19 17:24:07] [186394.056741s] DEBUG(runtime): ethernet mac: rx dropped: 3361395
[2024-08-19 17:24:07] [186394.061779s] DEBUG(runtime): ethernet mac: rx dropped: 3361397
[2024-08-19 17:24:07] [186394.220737s] DEBUG(runtime): ethernet mac: rx dropped: 3361398
[2024-08-19 17:24:07] [186394.261993s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:07] [186394.351301s] DEBUG(runtime): ethernet mac: rx dropped: 3361400
[2024-08-19 17:24:08] [186394.871530s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186394.958543s] DEBUG(runtime): ethernet mac: rx dropped: 3361402
[2024-08-19 17:24:08] [186395.025193s] DEBUG(runtime): ethernet mac: rx dropped: 3361418
[2024-08-19 17:24:08] [186395.030547s] DEBUG(runtime): ethernet mac: rx dropped: 3361434
[2024-08-19 17:24:08] [186395.036008s] DEBUG(runtime): ethernet mac: rx dropped: 3361437
[2024-08-19 17:24:08] [186395.262112s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186395.323511s] DEBUG(runtime): ethernet mac: rx dropped: 3361438
[2024-08-19 17:24:08] [186395.457569s] DEBUG(runtime): ethernet mac: rx dropped: 3361440
[2024-08-19 17:24:08] [186395.692156s] DEBUG(runtime): ethernet mac: rx dropped: 3361441
[2024-08-19 17:24:08] [186395.704066s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186395.783088s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:09] [186396.130303s] DEBUG(runtime): ethernet mac: rx dropped: 3361443
[2024-08-19 17:24:09] [186396.135346s] DEBUG(runtime): ethernet mac: rx dropped: 3361447
[2024-08-19 17:24:09] [186396.265620s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:09] [186396.415945s] DEBUG(runtime): ethernet mac: rx dropped: 3361449
[2024-08-19 17:24:09] [186396.579860s] DEBUG(runtime): ethernet mac: rx dropped: 3361450
[2024-08-19 17:24:09] [186396.665844s] DEBUG(runtime): ethernet mac: rx dropped: 3361458
[2024-08-19 17:24:09] [186396.670794s] DEBUG(runtime): ethernet mac: rx dropped: 3361469
[2024-08-19 17:24:09] [186396.676664s] DEBUG(runtime): ethernet mac: rx dropped: 3361485
[2024-08-19 17:24:09] [186396.682610s] DEBUG(runtime): ethernet mac: rx dropped: 3361491
[2024-08-19 17:24:09] [186396.688371s] DEBUG(runtime): ethernet mac: rx dropped: 3361493
[2024-08-19 17:24:09] [186396.694143s] DEBUG(runtime): ethernet mac: rx dropped: 3361497
[2024-08-19 17:24:09] [186396.750400s] DEBUG(runtime): ethernet mac: rx dropped: 3361499
[2024-08-19 17:24:10] [186397.135122s] DEBUG(runtime): ethernet mac: rx dropped: 3361501
[2024-08-19 17:24:10] [186397.153003s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:10] [186397.269216s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:10] [186397.471916s] DEBUG(runtime): ethernet mac: rx dropped: 3361503
[2024-08-19 17:24:11] [186397.873078s] DEBUG(runtime): ethernet mac: rx dropped: 3361505
[2024-08-19 17:24:11] [186397.999247s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.043252s] DEBUG(runtime): ethernet mac: rx dropped: 3361507
[2024-08-19 17:24:11] [186398.177074s] DEBUG(runtime): ethernet mac: rx dropped: 3361509
[2024-08-19 17:24:11] [186398.182097s] DEBUG(runtime): ethernet mac: rx dropped: 3361511
[2024-08-19 17:24:11] [186398.277497s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.441406s] DEBUG(runtime): ethernet mac: rx dropped: 3361512
[2024-08-19 17:24:11] [186398.446348s] DEBUG(runtime): ethernet mac: rx dropped: 3361516
[2024-08-19 17:24:11] [186398.537037s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.559164s] DEBUG(runtime): ethernet mac: rx dropped: 3361518
[2024-08-19 17:24:11] [186398.739636s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.744469s] DEBUG(runtime): ethernet mac: rx dropped: 3361520
[2024-08-19 17:24:12] [186398.946034s] DEBUG(runtime): ethernet mac: rx dropped: 3361521
[2024-08-19 17:24:12] [186399.276128s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:12] [186399.290595s] DEBUG(runtime): ethernet mac: rx dropped: 3361522
[2024-08-19 17:24:12] [186399.485839s] DEBUG(runtime): ethernet mac: rx dropped: 3361524
[2024-08-19 17:24:13] [186400.023088s] DEBUG(runtime): ethernet mac: rx dropped: 3361526
[2024-08-19 17:24:13] [186400.261228s] DEBUG(runtime): ethernet mac: rx dropped: 3361528
[2024-08-19 17:24:13] [186400.266298s] DEBUG(runtime): ethernet mac: rx dropped: 3361530
[2024-08-19 17:24:13] [186400.285947s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:13] [186400.371844s] DEBUG(runtime): ethernet mac: rx dropped: 3361531
[2024-08-19 17:24:13] [186400.749750s] DEBUG(runtime): ethernet mac: rx dropped: 3361532
[2024-08-19 17:24:13] [186400.771122s] DEBUG(runtime): ethernet mac: rx dropped: 3361533
[2024-08-19 17:24:14] [186401.062476s] DEBUG(runtime): ethernet mac: rx dropped: 3361547
[2024-08-19 17:24:14] [186401.067460s] DEBUG(runtime): ethernet mac: rx dropped: 3361565
[2024-08-19 17:24:14] [186401.073385s] DEBUG(runtime): ethernet mac: rx dropped: 3361572
[2024-08-19 17:24:14] [186401.079105s] DEBUG(runtime): ethernet mac: rx dropped: 3361573
[2024-08-19 17:24:14] [186401.283993s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:14] [186401.299478s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:14] [186401.490921s] DEBUG(runtime): ethernet mac: rx dropped: 3361575
[2024-08-19 17:24:14] [186401.499787s] DEBUG(runtime): ethernet mac: rx dropped: 3361577
[2024-08-19 17:24:14] [186401.747178s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:15] [186402.223281s] DEBUG(runtime): ethernet mac: rx dropped: 3361579
[2024-08-19 17:24:15] [186402.297437s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:15] [186402.337746s] DEBUG(runtime): ethernet mac: rx dropped: 3361581
[2024-08-19 17:24:15] [186402.342792s] DEBUG(runtime): ethernet mac: rx dropped: 3361583
[2024-08-19 17:24:15] [186402.407814s] DEBUG(runtime): ethernet mac: rx dropped: 3361584
[2024-08-19 17:24:15] [186402.724380s] DEBUG(runtime): ethernet mac: rx dropped: 3361604
[2024-08-19 17:24:15] [186402.729764s] DEBUG(runtime): ethernet mac: rx dropped: 3361617
[2024-08-19 17:24:15] [186402.735183s] DEBUG(runtime): ethernet mac: rx dropped: 3361623
[2024-08-19 17:24:16] [186402.954917s] DEBUG(runtime): ethernet mac: rx dropped: 3361625
[2024-08-19 17:24:16] [186403.290909s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:16] [186403.496201s] DEBUG(runtime): ethernet mac: rx dropped: 3361626
[2024-08-19 17:24:16] [186403.676772s] DEBUG(runtime): ethernet mac: rx dropped: 3361628
[2024-08-19 17:24:17] [186404.054814s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.070318s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.294415s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.399270s] DEBUG(runtime): ethernet mac: rx dropped: 3361630
[2024-08-19 17:24:17] [186404.404316s] DEBUG(runtime): ethernet mac: rx dropped: 3361634
[2024-08-19 17:24:17] [186404.410175s] DEBUG(runtime): ethernet mac: rx dropped: 3361639
[2024-08-19 17:24:17] [186404.769059s] DEBUG(runtime): network error: malformed packet

The dashboard shows a connection reset by peer error at 17:24:05, which is similar to/same as the one I got previously when the firmware paniced. This was also triggered by switching back to using rpc to upload the data.

The difference though, is that the firmware did not panic and I can run another experiment without any problem. Also as soon as I've enabled the more verbose logging, I can see these network errors popping up. I have not tried restarting kasli to see if these error persists.

@yuyichao
Copy link
Author

Oh, and the rpc's should all be a call to an function that was defined like def append(self, data) with the data being a list of 3 integers roughly [0, 20, 0] I don't know what's the significance of the last rpc data showing

[2024-08-19 17:24:04] [186391.692714s] DEBUG(runtime::session): [0, 0, 0, 5, 79, 0, 0, 0, 134, 0, 0, 0, 0, 1, 110]

that's significantly shorter than all the previous ones....

@yuyichao
Copy link
Author

I there a way I can figure out the content/reason for those dropped ethernet packages without separate hardware monitor?

@dnadlinger
Copy link
Collaborator

These are likely just broadcast packets that are part of the regular background chatter on the network. The easiest way to monitor them (apart from modifying smoltcp to print them out) would be to insert a PC with two network cards between the Kasli and the rest of the network, such that you can monitor everything using Wireshark.

As mentioned above, though, personally I'd really recommend spending the time on an ARTIQ upgrade instead. Even the big or1k/RISC-V switch was painless for us here, and any work spent on this is not "wasted" in the same way that hunting down a bug in an out-of-date version is. Of course, your circumstances might differ.

@yuyichao
Copy link
Author

Unfortunately I'm not the one who decide what version to use and I'm the only one having this issue... I'm fully aware that debugging the issue on an older version of the firmware would likely have little/no interest from everywhere else but with 80% of my time running the experiment spent on restarting artiq I don't have a much better choice.

@yuyichao
Copy link
Author

Upgraded to artiq-7 and just got another crash with this backtrace,

[2024-10-23 12:25:35] panic at runtime/rtio_mgt.rs:57:40: called `Result::unwrap()` on an `Err` value: Interrupted
[2024-10-23 12:25:35] backtrace for software version 7.8208.38c72fd;gold_master:
0x4003cae0: unwind_backtrace::backtrace at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/libunwind_backtrace/lib.rs:42
 (inlined by) rust_begin_unwind at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:360
0x40008b90: core::panicking::panic_fmt at /nix/store/lp3wfd7jqiyljq5q43zkglcval0dyhv5-rust-1.51.0-nightly-2021-01-28-c0b64d97b/lib/rustlib/src/rust/library/core/src/panicking.rs:92
0x400084b4: core::result::unwrap_failed at /nix/store/lp3wfd7jqiyljq5q43zkglcval0dyhv5-rust-1.51.0-nightly-2021-01-28-c0b64d97b/lib/rustlib/src/rust/library/core/src/result.rs:1284
0x400267bc: core::result::Result<T,E>::unwrap at /nix/store/lp3wfd7jqiyljq5q43zkglcval0dyhv5-rust-1.51.0-nightly-2021-01-28-c0b64d97b/lib/rustlib/src/rust/library/core/src/result.rs:1037
 (inlined by) runtime::rtio_mgt::drtio::aux_transact at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/rtio_mgt.rs:57
0x40021460: runtime::rtio_mgt::drtio::reset at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/rtio_mgt.rs:316
 (inlined by) runtime::rtio_mgt::reset at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/rtio_mgt.rs:383
 (inlined by) runtime::kern_hwreq::process_kern_hwreq at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/kern_hwreq.rs:245
 (inlined by) runtime::session::process_kern_message::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:350
 (inlined by) runtime::session::kern_recv_notrace at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:167
 (inlined by) runtime::session::process_kern_message at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:334
0x400236f4: runtime::session::flash_kernel_worker at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:548
0x4002d9b4: runtime::session::thread::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:666
 (inlined by) runtime::sched::Thread::new::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/sched.rs:68
 (inlined by) fringe::generator::Generator<Input,Output,Stack>::unsafe_new::generator_wrapper at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/generator.rs:123
0x4000ba10: fringe::arch::imp::init::trampoline_2 at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/arch/riscv32.rs:91
0x4000ba00: fringe::arch::imp::init::trampoline_1 at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/arch/riscv32.rs:56
0x400384e0: fringe::arch::imp::swap at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/arch/riscv32.rs:210
 (inlined by) fringe::generator::Generator<Input,Output,Stack>::resume at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/generator.rs:157
 (inlined by) runtime::sched::Scheduler::run at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/sched.rs:159
 (inlined by) runtime::startup at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:213
0x40005ac8: board_misoc::arch::boot::start_user at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/libboard_misoc/riscv32/boot.rs:23
0x40002454: runtime::main::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:257
0x400005c4: logger_artiq::BufferLogger::register at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/liblogger_artiq/lib.rs:67
0x4003bc60: ?? at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:256

It seems that this might be #2123 though

@yuyichao
Copy link
Author

Looking at the backtrace for some of the crashes I've seen before on artiq 6, I did find a few of,

[2024-07-31 20:41:39] panic at src/libcore/result.rs:945:5: called `Result::unwrap()` on an `Err` value: Interrupted
[2024-07-31 20:41:39] backtrace for software version 6.7684.9a3f6f85;gold_master:
0x40031554: unwind_backtrace::backtrace::h6343ca921fbdb659 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/libunwind_backtrace/lib.rs:42
 (inlined by) rust_begin_unwind at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:329
0x40023a54: core::panicking::panic_fmt::h7960887f32d6fb90 at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/panicking.rs:92
0x40046cf4: core::result::unwrap_failed::h6d381b79cd7d9cfe at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/macros.rs:26
0x400244a4: _$LT$core..result..Result$LT$T$C$$u20$E$GT$$GT$::unwrap::h66d533d2edf31a0e at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/result.rs:782
 (inlined by) runtime::rtio_mgt::drtio::recv_aux_timeout::hd44cf32d1f413b05 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:48
0x400242fc: runtime::rtio_mgt::drtio::aux_transact::h7d3f04c90af8ef9a at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:56
0x40041164: runtime::rtio_mgt::drtio::reset::hdcf8efa2625e8f2d at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:307
 (inlined by) runtime::rtio_mgt::reset::h82c8a54532ee9111 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:365
 (inlined by) runtime::kern_hwreq::process_kern_hwreq::h9b6f79c92c0aaf87 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/kern_hwreq.rs:223
 (inlined by) runtime::session::process_kern_message::_$u7b$$u7b$closure$u7d$$u7d$::h3ec47ffcd4d9b57a at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:346
 (inlined by) runtime::session::kern_recv_notrace::h867666726f00443d at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:165
 (inlined by) runtime::session::process_kern_message::he2beabed0033436a at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:330
0x40045ea4: runtime::session::flash_kernel_worker::haae8c34773cc2e1c at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:544
0x4003669c: runtime::session::thread::_$u7b$$u7b$closure$u7d$$u7d$::h7d91e19584b9d77b at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:662
 (inlined by) runtime::sched::Thread::new::_$u7b$$u7b$closure$u7d$$u7d$::h4ff3fc4ac11fc7e5 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/sched.rs:67
 (inlined by) _$LT$fringe..generator..Generator$LT$Input$C$$u20$Output$C$$u20$Stack$GT$$GT$::unsafe_new::generator_wrapper::h2cf61a3bced688ac at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/fringe/src/generator.rs:118
0x40036500: core::ops::function::FnOnce::call_once::h9c0b441e0efb4ad6 at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/ops/function.rs:223
 (inlined by) runtime::sched::Thread::new::_$u7b$$u7b$closure$u7d$$u7d$::ha24ec7529130396c at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/sched.rs:67
 (inlined by) _$LT$fringe..generator..Generator$LT$Input$C$$u20$Output$C$$u20$Stack$GT$$GT$::unsafe_new::generator_wrapper::hce0c78d2ad111f4a at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/fringe/src/generator.rs:118

The call stack looks very similar even though the exact error location was different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants