BUG: kernel NULL pointer dereference, address: 0000000000000008 #32

koo5 · 2021-11-16T18:12:51Z

lis 16 19:01:52 r6 systemd[1]: Finished Suspend.
lis 16 19:01:52 r6 systemd[1]: systemd-suspend.service: Consumed 2.774s CPU time.
lis 16 19:01:52 r6 systemd[1]: Stopped target Sleep.
lis 16 19:01:52 r6 systemd[1]: Reached target Suspend.
lis 16 19:01:52 r6 systemd-logind[4628]: Operation 'sleep' finished.
lis 16 19:01:52 r6 NetworkManager[4886]: <info>  [1637085712.5737] manager: sleep: wake requested (sleeping: yes  enabled: yes)
lis 16 19:01:52 r6 upowerd[4632]: TI:19:01:52        Acquired inhibitor lock (7)
lis 16 19:01:52 r6 upowerd[4632]: TI:19:01:52        Woke up from sleep; about to refresh devices
lis 16 19:01:52 r6 upowerd[4632]: TI:19:01:52        Polling will be resumed
lis 16 19:01:52 r6 ModemManager[4736]: <info>  [sleep-monitor] system is resuming
lis 16 19:01:52 r6 NetworkManager[4886]: <info>  [1637085712.5738] device (enp6s0): state change: unavailable -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
lis 16 19:01:52 r6 NetworkManager[4886]: <info>  [1637085712.6037] device (enp6s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
lis 16 19:01:52 r6 systemd[1]: Starting NVIDIA system resume actions...
lis 16 19:01:52 r6 systemd[1]: Stopped target Suspend.
lis 16 19:01:52 r6 suspend[10954]: nvidia-resume.service
lis 16 19:01:52 r6 logger[10954]: <13>Nov 16 19:01:52 suspend: nvidia-resume.service
lis 16 19:01:52 r6 systemd[1]: nvidia-resume.service: Deactivated successfully.
lis 16 19:01:52 r6 systemd[1]: Finished NVIDIA system resume actions.
lis 16 19:01:52 r6 NetworkManager[4886]: <info>  [1637085712.6306] device (enp5s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
lis 16 19:01:53 r6 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
lis 16 19:01:53 r6 kernel: #PF: supervisor read access in kernel mode
lis 16 19:01:53 r6 kernel: #PF: error_code(0x0000) - not-present page
lis 16 19:01:53 r6 kernel: PGD 0 P4D 0 
lis 16 19:01:53 r6 kernel: Oops: 0000 [#1] SMP NOPTI
lis 16 19:01:53 r6 kernel: CPU: 2 PID: 4886 Comm: NetworkManager Tainted: P           OE     5.13.0-21-generic #21-Ubuntu
lis 16 19:01:53 r6 kernel: Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 4021 08/09/2021
lis 16 19:01:53 r6 kernel: RIP: 0010:aq_ring_rx_fill+0xd5/0x210 [atlantic]
lis 16 19:01:53 r6 kernel: Code: 24 31 d2 83 c0 01 41 3b 44 24 28 48 0f 43 c2 41 89 44 24 24 41 83 ef 01 0f 84 fc 00 00 00 48 8d 1c 40 48 c1 e3 04 49 03 1c 24 <48> 8b 43 08 48 c7 43 28 00 08 00 00 48 85 c0 75 81 49 8b 44 24 10
lis 16 19:01:53 r6 kernel: RSP: 0018:ffffb92282827368 EFLAGS: 00010246
lis 16 19:01:53 r6 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
lis 16 19:01:53 r6 kernel: RDX: 0000000000000000 RSI: 0000000000006100 RDI: ffff98cecec0e798
lis 16 19:01:53 r6 kernel: RBP: ffffb922828273a0 R08: 0000000000000000 R09: 0000000000008000
lis 16 19:01:53 r6 kernel: R10: 00000000ffffffff R11: fffffb56040428c0 R12: ffff98cecec0e798
lis 16 19:01:53 r6 kernel: R13: 0000000000001000 R14: 0000000000000000 R15: 00000000ffffffff
lis 16 19:01:53 r6 kernel: FS:  00007f54971ba140(0000) GS:ffff98dd4ea80000(0000) knlGS:0000000000000000
lis 16 19:01:53 r6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
lis 16 19:01:53 r6 kernel: CR2: 0000000000000008 CR3: 0000000122536000 CR4: 0000000000750ee0
lis 16 19:01:53 r6 kernel: PKRU: 55555554
lis 16 19:01:53 r6 kernel: Call Trace:
lis 16 19:01:53 r6 kernel:  ? hw_atl_b0_hw_ring_rx_init+0x10e/0x120 [atlantic]
lis 16 19:01:53 r6 kernel:  aq_vec_init+0x91/0xf0 [atlantic]
lis 16 19:01:53 r6 kernel:  aq_nic_init+0xfd/0x1e0 [atlantic]
lis 16 19:01:53 r6 kernel:  aq_ndev_open+0x1c/0x70 [atlantic]
lis 16 19:01:53 r6 kernel:  __dev_open+0xef/0x1a0
lis 16 19:01:53 r6 kernel:  __dev_change_flags+0x1a3/0x210
lis 16 19:01:53 r6 kernel:  ? __snmp6_fill_stats64.constprop.0+0xe7/0x110
lis 16 19:01:53 r6 kernel:  dev_change_flags+0x26/0x60
lis 16 19:01:53 r6 kernel:  do_setlink+0x284/0xc40
lis 16 19:01:53 r6 kernel:  ? __nla_validate_parse+0x4c/0x1a0
lis 16 19:01:53 r6 kernel:  __rtnl_newlink+0x605/0xa00
lis 16 19:01:53 r6 kernel:  ? security_sock_rcv_skb+0x2f/0x50
lis 16 19:01:53 r6 kernel:  ? skb_queue_tail+0x43/0x50
lis 16 19:01:53 r6 kernel:  ? sock_def_readable+0x4b/0x80
lis 16 19:01:53 r6 kernel:  ? __netlink_sendskb+0x62/0x80
lis 16 19:01:53 r6 kernel:  ? netlink_unicast+0x2f3/0x330
lis 16 19:01:53 r6 kernel:  ? rtnl_getlink+0x392/0x410
lis 16 19:01:53 r6 kernel:  ? kmem_cache_alloc_trace+0xfb/0x240
lis 16 19:01:53 r6 kernel:  rtnl_newlink+0x49/0x70
lis 16 19:01:53 r6 kernel:  rtnetlink_rcv_msg+0x139/0x370
lis 16 19:01:53 r6 kernel:  ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
lis 16 19:01:53 r6 kernel:  ? rtnl_calcit.isra.0+0x130/0x130
lis 16 19:01:53 r6 kernel:  netlink_rcv_skb+0x55/0x100
lis 16 19:01:53 r6 kernel:  rtnetlink_rcv+0x15/0x20
lis 16 19:01:53 r6 kernel:  netlink_unicast+0x21d/0x330
lis 16 19:01:53 r6 kernel:  netlink_sendmsg+0x246/0x490
lis 16 19:01:53 r6 kernel:  sock_sendmsg+0x65/0x70
lis 16 19:01:53 r6 kernel:  ____sys_sendmsg+0x24e/0x290
lis 16 19:01:53 r6 kernel:  ? import_iovec+0x31/0x40
lis 16 19:01:53 r6 kernel:  ? sendmsg_copy_msghdr+0x7b/0xa0
lis 16 19:01:53 r6 kernel:  ___sys_sendmsg+0x81/0xc0
lis 16 19:01:53 r6 kernel:  ? kvfree+0x2a/0x30
lis 16 19:01:53 r6 kernel:  ? kvfree+0x2a/0x30
lis 16 19:01:53 r6 kernel:  ? __fget_files+0x5f/0x90
lis 16 19:01:53 r6 kernel:  ? __fget_light+0x32/0x80
lis 16 19:01:53 r6 kernel:  __sys_sendmsg+0x62/0xb0
lis 16 19:01:53 r6 kernel:  __x64_sys_sendmsg+0x1d/0x20
lis 16 19:01:53 r6 kernel:  do_syscall_64+0x61/0xb0
lis 16 19:01:53 r6 kernel:  ? exit_to_user_mode_loop+0xec/0x160
lis 16 19:01:53 r6 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
lis 16 19:01:53 r6 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
lis 16 19:01:53 r6 kernel:  ? __x64_sys_close+0x11/0x40
lis 16 19:01:53 r6 kernel:  ? do_syscall_64+0x6e/0xb0
lis 16 19:01:53 r6 kernel:  ? do_syscall_64+0x6e/0xb0
lis 16 19:01:53 r6 kernel:  ? sysvec_call_function+0x4e/0x90
lis 16 19:01:53 r6 kernel:  ? asm_sysvec_call_function+0xa/0x20
lis 16 19:01:53 r6 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
lis 16 19:01:53 r6 kernel: RIP: 0033:0x7f54981d03fd
lis 16 19:01:53 r6 kernel: Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 fa a4 f6 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 3e a5 f6 ff 48
lis 16 19:01:53 r6 kernel: RSP: 002b:00007fff2ac06630 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
lis 16 19:01:53 r6 kernel: RAX: ffffffffffffffda RBX: 00000000000000b0 RCX: 00007f54981d03fd
lis 16 19:01:53 r6 kernel: RDX: 0000000000000000 RSI: 00007fff2ac06670 RDI: 000000000000000c
lis 16 19:01:53 r6 kernel: RBP: 00005648aeb6d030 R08: 0000000000000000 R09: 0000000000000000
lis 16 19:01:53 r6 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
lis 16 19:01:53 r6 kernel: R13: 00007fff2ac067c0 R14: 00007fff2ac067bc R15: 0000000000000000
lis 16 19:01:53 r6 kernel: Modules linked in: atlantic nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CHECKSUM xt_MASQUERADE nf_nat_tftp ip6t_rpfilter nft_chain_nat nf_nat nf_conntrack_tftp xt_CT bridge stp llc ip_set binfmt_misc ip6t_REJECT nf_reject_ipv6 xt_hl snd_hda_codec_realtek intel_rapl_msr ip6_tables snd_hda_codec_generic intel_rapl_common ledtrig_audio ip6t_rt edac_mce_amd snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_amd snd_usb_audio snd_hda_codec snd_usbmidi_lib snd_hda_core mc kvm snd_hwdep snd_pcm rapl snd_seq_midi snd_seq_midi_event snd_rawmidi eeepc_wmi ipt_REJECT snd_seq nf_reject_ipv4 wmi_bmof mxm_wmi snd_seq_device xt_LOG snd_timer efi_pstore nf_log_syslog zenpower(OE) k10temp ccp snd nft_limit joydev input_leds soundcore xt_limit mac_hid xt_addrtype nvidia_uvm(POE) xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter sch_fq_codel nf_tables nct6775 hwmon_vid nfnetlink msr parport_pc ppdev lp parport sunrpc ip_tables
lis 16 19:01:53 r6 kernel:  x_tables autofs4 btrfs blake2b_generic zstd_compress dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) mfd_aaeon asus_wmi sparse_keymap video drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt aesni_intel fb_sys_fops cec crypto_simd rc_core cryptd igb ahci dca macsec libahci drm nvme i2c_algo_bit xhci_pci i2c_piix4 xhci_pci_renesas nvme_core wmi [last unloaded: atlantic]
lis 16 19:01:53 r6 kernel: CR2: 0000000000000008
lis 16 19:01:53 r6 kernel: ---[ end trace 9cf65ec46006a312 ]---
lis 16 19:01:53 r6 kernel: RIP: 0010:aq_ring_rx_fill+0xd5/0x210 [atlantic]
lis 16 19:01:53 r6 kernel: Code: 24 31 d2 83 c0 01 41 3b 44 24 28 48 0f 43 c2 41 89 44 24 24 41 83 ef 01 0f 84 fc 00 00 00 48 8d 1c 40 48 c1 e3 04 49 03 1c 24 <48> 8b 43 08 48 c7 43 28 00 08 00 00 48 85 c0 75 81 49 8b 44 24 10
lis 16 19:01:53 r6 kernel: RSP: 0018:ffffb92282827368 EFLAGS: 00010246
lis 16 19:01:53 r6 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
lis 16 19:01:53 r6 kernel: RDX: 0000000000000000 RSI: 0000000000006100 RDI: ffff98cecec0e798
lis 16 19:01:53 r6 kernel: RBP: ffffb922828273a0 R08: 0000000000000000 R09: 0000000000008000
lis 16 19:01:53 r6 kernel: R10: 00000000ffffffff R11: fffffb56040428c0 R12: ffff98cecec0e798
lis 16 19:01:53 r6 kernel: R13: 0000000000001000 R14: 0000000000000000 R15: 00000000ffffffff
lis 16 19:01:53 r6 kernel: FS:  00007f54971ba140(0000) GS:ffff98dd4ea80000(0000) knlGS:0000000000000000
lis 16 19:01:53 r6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
lis 16 19:01:53 r6 kernel: CR2: 0000000000000008 CR3: 0000000122536000 CR4: 0000000000750ee0
lis 16 19:01:53 r6 kernel: PKRU: 55555554

The text was updated successfully, but these errors were encountered:

koo5 · 2021-11-16T18:42:41Z

Linux r6 5.13.0-21-generic #21-Ubuntu SMP Tue Oct 19 08:59:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

cail · 2021-11-22T16:56:22Z

Could you please validate latest 5.13.x ? there was a similar problem which was fixed/backported in there.

koo5 · 2021-11-22T18:52:31Z

thanks, but not sure where to find a 5.13 newer than 5.13.21?
latest 5.13.19 here: https://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D
i tried v5.16-rc2 but there's a different crash right after boot:
https://gist.github.com/koo5/4c6558a860475f8c4a1cac2f77a5a570

cail · 2021-11-22T19:46:18Z

So you mean 5.13.19 also has this problem??

koo5 · 2021-11-23T01:22:27Z

yes, the same as 5.13.21.

eitch · 2022-02-01T22:18:39Z

I seem to have the same issue:

$ uname -a
Linux eitchtower 5.16.4-051604-generic #202201310852-Ubuntu SMP PREEMPT Mon Jan 31 09:02:19 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Feb  1 23:13:13 mypc kernel: [    8.091191] RIP: 0010:rcu_note_context_switch+0x462/0x4a0
Feb  1 23:13:13 mypc kernel: [    8.091191] Code: 08 48 89 86 40 08 00 00 4c 89 86 48 08 00 00 49 89 10 e9 84 fe ff ff 48 c7 c7 e0 29 3c 9b c6 05 de 73 fe 01 01 e8 81 0e ab 00 <0f> 0b e9 ec fb ff ff 0f 0b e9 7e fd ff ff 0f 0b e9 60 fd ff ff 0f
Feb  1 23:13:13 mypc kernel: [    8.091192] RSP: 0018:ffffad63c25bee18 EFLAGS: 00010086
Feb  1 23:13:13 mypc kernel: [    8.091193] RAX: 0000000000000000 RBX: ffff97006ea31f00 RCX: 0000000000000027
Feb  1 23:13:13 mypc kernel: [    8.091194] RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff97006ea20988
Feb  1 23:13:13 mypc kernel: [    8.091195] RBP: ffffad63c25bee38 R08: ffff97006ea20980 R09: ffffad63c25bec00
Feb  1 23:13:13 mypc kernel: [    8.091196] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091196] R13: ffff96f18ce74c80 R14: ffff96f18ce74c80 R15: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091197] FS:  00007f8425114140(0000) GS:ffff97006ea00000(0000) knlGS:0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091198] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  1 23:13:13 mypc kernel: [    8.091199] CR2: 00007fa0f0df9710 CR3: 0000000109a5c000 CR4: 0000000000750ef0
Feb  1 23:13:13 mypc kernel: [    8.091199] PKRU: 55555554
Feb  1 23:13:13 mypc kernel: [    8.091200] Call Trace:
Feb  1 23:13:13 mypc kernel: [    8.091200]  <TASK>
Feb  1 23:13:13 mypc kernel: [    8.091201]  __schedule+0xbb/0x7f0
Feb  1 23:13:13 mypc kernel: [    8.091203]  ? ttwu_do_activate+0xb6/0x510
Feb  1 23:13:13 mypc kernel: [    8.091206]  schedule+0x55/0xd0
Feb  1 23:13:13 mypc kernel: [    8.091207]  schedule_timeout+0x200/0x290
Feb  1 23:13:13 mypc kernel: [    8.091208]  wait_for_completion+0x8b/0xf0
Feb  1 23:13:13 mypc kernel: [    8.091209]  virt_efi_query_variable_info+0x146/0x160
Feb  1 23:13:13 mypc kernel: [    8.091212]  efi_query_variable_store+0x53/0x190
Feb  1 23:13:13 mypc kernel: [    8.091214]  ? vsnprintf+0x71/0x4e0
Feb  1 23:13:13 mypc kernel: [    8.091216]  efivar_entry_set_safe+0xbf/0x210
Feb  1 23:13:13 mypc kernel: [    8.091219]  efi_pstore_write+0x125/0x1a0 [efi_pstore]
Feb  1 23:13:13 mypc kernel: [    8.091221]  pstore_dump+0x11c/0x300
Feb  1 23:13:13 mypc kernel: [    8.091223]  ? vprintk_emit+0xf9/0x230
Feb  1 23:13:13 mypc kernel: [    8.091225]  kmsg_dump+0x4e/0x70
Feb  1 23:13:13 mypc kernel: [    8.091227]  oops_exit+0x27/0x30
Feb  1 23:13:13 mypc kernel: [    8.091228]  oops_end+0x68/0xd0
Feb  1 23:13:13 mypc kernel: [    8.091230]  die+0x43/0x50
Feb  1 23:13:13 mypc kernel: [    8.091231]  do_trap+0xd7/0x100
Feb  1 23:13:13 mypc kernel: [    8.091232]  do_error_trap+0x77/0xa0
Feb  1 23:13:13 mypc kernel: [    8.091233]  ? aq_nic_start+0x363/0x370 [atlantic]
Feb  1 23:13:13 mypc kernel: [    8.091237]  exc_invalid_op+0x53/0x70
Feb  1 23:13:13 mypc kernel: [    8.091238]  ? aq_nic_start+0x363/0x370 [atlantic]
Feb  1 23:13:13 mypc kernel: [    8.091241]  asm_exc_invalid_op+0x12/0x20
Feb  1 23:13:13 mypc kernel: [    8.091243] RIP: 0010:aq_nic_start+0x363/0x370 [atlantic]
Feb  1 23:13:13 mypc kernel: [    8.091246] Code: cc 50 c0 e8 4f fa a1 d9 41 89 c4 85 c0 0f 88 57 fd ff ff 41 8b 8d 94 01 00 00 b8 01 00 00 00 d3 e0 41 09 85 f8 05 00 00 eb 84 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5
Feb  1 23:13:13 mypc kernel: [    8.091247] RSP: 0018:ffffad63c25bf468 EFLAGS: 00010202
Feb  1 23:13:13 mypc kernel: [    8.091247] RAX: 0000000000000008 RBX: 0000000000000008 RCX: ffff96f1a15b6000
Feb  1 23:13:13 mypc kernel: [    8.091248] RDX: 0000000000000010 RSI: ffffad63c18e5be8 RDI: ffff96f19fbbe428
Feb  1 23:13:13 mypc kernel: [    8.091249] RBP: ffffad63c25bf480 R08: 0000000020000800 R09: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091249] R10: 0000000000000000 R11: 000000000000003f R12: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091250] R13: ffff96f1a15b6980 R14: 0000000000000001 R15: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091251]  ? aq_nic_start+0x9c/0x370 [atlantic]
Feb  1 23:13:13 mypc kernel: [    8.091254]  aq_ndev_open+0x49/0x70 [atlantic]
Feb  1 23:13:13 mypc kernel: [    8.091257]  __dev_open+0xef/0x180
Feb  1 23:13:13 mypc kernel: [    8.091259]  __dev_change_flags+0x188/0x1e0
Feb  1 23:13:13 mypc kernel: [    8.091260]  dev_change_flags+0x26/0x60
Feb  1 23:13:13 mypc kernel: [    8.091261]  do_setlink+0x908/0xc40
Feb  1 23:13:13 mypc kernel: [    8.091262]  ? __nla_validate_parse+0x4c/0x1a0
Feb  1 23:13:13 mypc kernel: [    8.091264]  __rtnl_newlink+0x597/0xa10
Feb  1 23:13:13 mypc kernel: [    8.091265]  ? __nla_put+0x20/0x30
Feb  1 23:13:13 mypc kernel: [    8.091266]  ? pskb_expand_head+0x75/0x310
Feb  1 23:13:13 mypc kernel: [    8.091268]  ? security_sock_rcv_skb+0x2f/0x50
Feb  1 23:13:13 mypc kernel: [    8.091270]  ? netlink_deliver_tap+0x49/0x240
Feb  1 23:13:13 mypc kernel: [    8.091271]  ? sk_filter_trim_cap+0xbf/0x220
Feb  1 23:13:13 mypc kernel: [    8.091272]  ? skb_queue_tail+0x48/0x50
Feb  1 23:13:13 mypc kernel: [    8.091273]  ? sock_def_readable+0x50/0x80
Feb  1 23:13:13 mypc kernel: [    8.091275]  ? __netlink_sendskb+0x42/0x50
Feb  1 23:13:13 mypc kernel: [    8.091276]  ? netlink_unicast+0x213/0x250
Feb  1 23:13:13 mypc kernel: [    8.091277]  ? rtnl_getlink+0x376/0x400
Feb  1 23:13:13 mypc kernel: [    8.091280]  ? security_capable+0x3d/0x60
Feb  1 23:13:13 mypc kernel: [    8.091281]  ? kmem_cache_alloc_trace+0x193/0x470
Feb  1 23:13:13 mypc kernel: [    8.091283]  rtnl_newlink+0x49/0x70
Feb  1 23:13:13 mypc kernel: [    8.091284]  rtnetlink_rcv_msg+0x160/0x3b0
Feb  1 23:13:13 mypc kernel: [    8.091286]  ? __mark_inode_dirty+0x105/0x350
Feb  1 23:13:13 mypc kernel: [    8.091288]  ? rtnl_calcit.isra.0+0x130/0x130
Feb  1 23:13:13 mypc kernel: [    8.091290]  netlink_rcv_skb+0x55/0x100
Feb  1 23:13:13 mypc kernel: [    8.091291]  rtnetlink_rcv+0x15/0x20
Feb  1 23:13:13 mypc kernel: [    8.091292]  netlink_unicast+0x1a8/0x250
Feb  1 23:13:13 mypc kernel: [    8.091294]  netlink_sendmsg+0x245/0x490
Feb  1 23:13:13 mypc kernel: [    8.091295]  ? _copy_from_user+0x2b/0x60
Feb  1 23:13:13 mypc kernel: [    8.091296]  sock_sendmsg+0x65/0x70
Feb  1 23:13:13 mypc kernel: [    8.091297]  ____sys_sendmsg+0x218/0x290
Feb  1 23:13:13 mypc kernel: [    8.091298]  ? copy_msghdr_from_user+0x5c/0x90
Feb  1 23:13:13 mypc kernel: [    8.091299]  ? addrconf_sysctl_forward+0x111/0x280
Feb  1 23:13:13 mypc kernel: [    8.091301]  ? _copy_from_iter+0x92/0x590
Feb  1 23:13:13 mypc kernel: [    8.091303]  ___sys_sendmsg+0x81/0xc0
Feb  1 23:13:13 mypc kernel: [    8.091304]  ? mntput_no_expire+0x53/0x260
Feb  1 23:13:13 mypc kernel: [    8.091305]  ? __call_rcu+0xac/0x270
Feb  1 23:13:13 mypc kernel: [    8.091306]  ? call_rcu+0xe/0x10
Feb  1 23:13:13 mypc kernel: [    8.091308]  ? __fget_light+0xc5/0x100
Feb  1 23:13:13 mypc kernel: [    8.091310]  __sys_sendmsg+0x62/0xb0
Feb  1 23:13:13 mypc kernel: [    8.091311]  __x64_sys_sendmsg+0x1f/0x30
Feb  1 23:13:13 mypc kernel: [    8.091312]  do_syscall_64+0x5c/0xc0
Feb  1 23:13:13 mypc kernel: [    8.091313]  ? do_syscall_64+0x69/0xc0
Feb  1 23:13:13 mypc kernel: [    8.091314]  ? asm_exc_page_fault+0x8/0x30
Feb  1 23:13:13 mypc kernel: [    8.091315]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Feb  1 23:13:13 mypc kernel: [    8.091317] RIP: 0033:0x7f842612a3fd
Feb  1 23:13:13 mypc kernel: [    8.091318] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 fa a4 f6 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 3e a5 f6 ff 48
Feb  1 23:13:13 mypc kernel: [    8.091318] RSP: 002b:00007ffe82ca9510 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
Feb  1 23:13:13 mypc kernel: [    8.091319] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f842612a3fd
Feb  1 23:13:13 mypc kernel: [    8.091320] RDX: 0000000000000000 RSI: 00007ffe82ca9550 RDI: 000000000000000c
Feb  1 23:13:13 mypc kernel: [    8.091320] RBP: 000055bf626d4030 R08: 0000000000000000 R09: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091321] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091322] R13: 00007ffe82ca96a0 R14: 00007ffe82ca969c R15: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.091323]  </TASK>
Feb  1 23:13:13 mypc kernel: [    8.091323] ---[ end trace 96a2807216b1f271 ]---
Feb  1 23:13:13 mypc kernel: [    8.239615] RIP: 0010:aq_nic_start+0x363/0x370 [atlantic]
Feb  1 23:13:13 mypc kernel: [    8.239622] Code: cc 50 c0 e8 4f fa a1 d9 41 89 c4 85 c0 0f 88 57 fd ff ff 41 8b 8d 94 01 00 00 b8 01 00 00 00 d3 e0 41 09 85 f8 05 00 00 eb 84 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5
Feb  1 23:13:13 mypc kernel: [    8.239624] RSP: 0018:ffffad63c25bf468 EFLAGS: 00010202
Feb  1 23:13:13 mypc kernel: [    8.239625] RAX: 0000000000000008 RBX: 0000000000000008 RCX: ffff96f1a15b6000
Feb  1 23:13:13 mypc kernel: [    8.239626] RDX: 0000000000000010 RSI: ffffad63c18e5be8 RDI: ffff96f19fbbe428
Feb  1 23:13:13 mypc kernel: [    8.239627] RBP: ffffad63c25bf480 R08: 0000000020000800 R09: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.239627] R10: 0000000000000000 R11: 000000000000003f R12: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.239628] R13: ffff96f1a15b6980 R14: 0000000000000001 R15: 0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.239629] FS:  00007f8425114140(0000) GS:ffff97006eec0000(0000) knlGS:0000000000000000
Feb  1 23:13:13 mypc kernel: [    8.239630] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  1 23:13:13 mypc kernel: [    8.239631] CR2: 00007f97ec008008 CR3: 0000000109a5c000 CR4: 0000000000750ee0
Feb  1 23:13:13 mypc kernel: [    8.239631] PKRU: 55555554

cail · 2022-02-03T10:35:03Z

The problem is floating between the kernels, still no root cause found.

labre · 2022-04-07T21:49:49Z

The problem is floating between the kernels, still no root cause found.

Are you sure, it’s inconsistently present? I get consistently a null deref in aq_ring_rx_fill for the latest 3 stable/longterm kernel releases and mainline¹ when hibernating with the atlantic network interface up. I’ve found a line to be consistently referenced in vmcore dumps of null pointer dereferences.

@koo5: I see resume log messages in your log. Was that resume from hibernation and was the device turned off by that hibernation or stayed it on?

For me the hibernation immediately resumes without switching the PC off. After that “resume” the null pointer deref occurs. There is no indication from kernel messages, why the system resumes immediately after writing the image. I captured vmcores for the latter three of ¹5.10.109, 5.15.32, 5.17.1, 5.18-rc1 with crashkernel and panic_on_oops and tracked it down to this line in 5.18-rc1:

buff->flags = 0U; aq_ring.c#L545

dmesg-5.18-rc1.log
crash-5.18-rc1.log
aq_ring-5.18-rc1_objectDumped.txt
kernel-config-5.18-rc1-x86_64.txt

The other logs from Redhat’s crash utility look similar, even with the same offset. I could not retrieve the vmcore dump from 5.10.109 because crashkernel could not reinitialize USB and SATA when booting. Note, kernel is tainted by an amdgpu warning. Probably some module/builtin incompatibility. Needed it builtin, because crashkernel would leave me blind otherwise.

In contrary to that, here are dmesg logs of one hibernation with the oldest AQC107 supporting longterm kernels:
dmesg-4.19.237-x86_64.log
dmesg-5.4.188-x86_64.log

Bisecting is not easily possible unfortunately, because all the latest, but unsupported minor versions between 5.4 and 5.10 do not compile with my toolchain. Tried also older gcc versions and -Wno-error, but to no avail.

I hope you can just build the kernel and reproduce the issue, because I can’t provide the core dumps. They contain my LUKS key in kmem and since it is distributed all over it 1, I’d rather not try to find and scramble it. If you need any more info like other gdb/crash outputs or the other used kernel configs (basically an olddefconfig of the one provided), leave me a note.

OS and build environment

Portage 3.0.30 (python 3.9.9-final-0, default/linux/amd64/17.1/hardened, gcc-11.2.1, glibc-2.34-r10, 5.17.1-gentoo-r1-x86_64 x86_64)

System uname: Linux-5.17.1-gentoo-r1-x86_64-x86_64-AMD_Ryzen_Threadripper_3990X_64-Core_Processor-with-glibc2.34
KiB Mem: 131811644 total, 123581920 free
KiB Swap: 268435452 total, 268435452 free
Timestamp of repository gentoo: Tue, 05 Apr 2022 20:30:01 +0000 ← last update
Head commit of repository gentoo: f3363e64a5c00c663f0935feff6b7f10e8f1d5f9
Timestamp of repository xwing: Tue, 05 Apr 2022 20:00:05 +0000
sh bash 5.1_p16
ld GNU ld (Gentoo 2.37_p1 p2) 2.37
distcc 3.4 x86_64-pc-linux-gnu [disabled]
app-misc/pax-utils: 1.3.3::gentoo
app-shells/bash: 5.1_p16::gentoo
dev-java/java-config: 2.3.1::gentoo
dev-lang/perl: 5.34.0-r6::gentoo
dev-lang/python: 3.8.13::gentoo, 3.9.9-r1::gentoo, 3.10.2_p1::gentoo
dev-lang/rust: 1.58.1::gentoo
dev-lang/rust-bin: 1.58.1::gentoo
dev-util/cmake: 3.22.2::gentoo
dev-util/meson: 0.61.4-r2::gentoo
sys-apps/baselayout: 2.7-r3::gentoo
sys-apps/openrc: 0.44.10::gentoo
sys-apps/sandbox: 2.29::gentoo
sys-devel/autoconf: 2.13-r1::gentoo, 2.71-r1::gentoo
sys-devel/automake: 1.16.4::gentoo
sys-devel/binutils: 2.37_p1-r2::gentoo
sys-devel/binutils-config: 5.4.1::gentoo
sys-devel/clang: 13.0.1::gentoo
sys-devel/gcc: 9.4.0::gentoo, 10.3.1_p20211126::gentoo, 11.2.1_p20220115::gentoo
sys-devel/gcc-config: 2.5-r1::gentoo
sys-devel/libtool: 2.4.6-r6::gentoo
sys-devel/lld: 13.0.1::gentoo
sys-devel/llvm: 13.0.1::gentoo
sys-devel/make: 4.3::gentoo
sys-kernel/linux-headers: 5.15-r3::gentoo (virtual/os-headers)
sys-libs/glibc: 2.34-r10::gentoo

From 6c4cd8210835489da84bf4ee5049945dc0f2c986 Mon Sep 17 00:00:00 2001 Date: Mon, 18 Apr 2022 00:20:01 +0200 This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]>

labre · 2022-04-18T12:34:30Z

I’ve posted a patch for this, since I found the root cause. The deep parameter for pm functions in aq_pci_func.c was inverted for all calls, so suspend/resume did a deep reset, while freeze/thaw did a normal reset.

This was present since the introduction of the new pm API usage in 5.4-rc5 with 8aaa112a57c1 ("net: atlantic: refactoring pm logic"), however it had no effect until aq_vec_{free,alloc} were split in 8ce84271697a("net: atlantic: changes for multi-TC support"), leading to ring buffers not being reinitialized on non-deep pm calls. The latter commit is found by bisect.

@koo5: Thank you for reporting this; I mentioned you in the Reported-by tag.

https://lore.kernel.org/netdev/[email protected]/T/#u

This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>

labre · 2022-04-18T14:14:20Z

Hm, taking a closer look at koo5’s reported trace, this is a null dereference on resume, so it could not be related, unless the deep reset on suspend/resume causes null derefs for other devices. Getting some retesting from the reporters might help to clear this up.

@koo5, @eitch: Could you please retest any of the patches, applied against the kernel version, indicated by filename? Thanks in advance.

5.10.111-invert-deep-par.patch.txt
5.15.34-invert-deep-par.patch.txt
5.17.3-invert-deep-par.patch.txt
5.18-rc2_invert-deep-par.patch.txt

commit cbe6c3a upstream. This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

koo5 · 2022-04-28T21:33:02Z

Much appreciated. Sorry i did not get to testing your patches yet. But:

Was that resume from hibernation and was the device turned off by that hibernation or stayed it on?

it was from an attempted suspend, the suspend never works, my machine resumes immediately.

labre · 2022-04-29T11:40:48Z

Much appreciated. Sorry i did not get to testing your patches yet. But:

Never mind. This is patched now in stable as of 5.17.5, 5.15.36 and 5.10.113. Could you try any of those kernel versions?

Was that resume from hibernation and was the device turned off by that hibernation or stayed it on?

it was from an attempted suspend, the suspend never works, my machine resumes immediately.

Oh, I see. I’ve also seen that behaviour with hibernation shifting between null deref during hibernation (causing immediate thaw) and after hibernation while bisecting, so Igor’s impression might be valid. I could not reproduce the kernel oops before hibernation with the latest stable patches before my patch though, e.g. 5.10.110. Which kernel version do you use?

I never had issues with suspend in 5.15.32 (dist-kernel), any 5.16 version (can’t recall them) and 5.17. Since I was investigating hibernation, I did not try suspend in older kernel releases, but I can give them a try. If you can still reproduce this, knowing the reproducing kernel version would be much appreciated. :)

Side note:
I could not bisect with the whole tree, so other bug sources were not excluded. Any version between 5.4 and 5.10 booted into a panic before the console was initialized. Neither have an USB debug port nor a serial port on my mainboard, so I just applied the commit state of atlantic on top of v5.10.110 and limited bisect to commits changing it.

commit cbe6c3a upstream. This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

amarshall · 2022-05-06T15:26:39Z

@labre Hi, the change associated with this appears to have caused a regression. See https://bugzilla.kernel.org/show_bug.cgi?id=215949

labre · 2022-05-06T17:25:30Z

@amarshall Thanks for your bug report and I’m sorry for the regression. I screwed up suspend testing by kexec-rebooting in an unpatched kernel. This is handled in
https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

commit cbe6c3a8f8f4315b96e46e1a1c70393c06d95a4c upstream. This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/1972905 commit cbe6c3a upstream. This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Kamal Mostafa <[email protected]> Signed-off-by: Kleber Sacilotto de Souza <[email protected]>

stable inclusion from stable-v5.10.113 commit ba2716da233618c3f361b8ece818a0e91de7a8f0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ISAH Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ba2716da233618c3f361b8ece818a0e91de7a8f0 -------------------------------- commit cbe6c3a upstream. This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Zheng Zengkai <[email protected]> Acked-by: Xie XiuQi <[email protected]>

This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>

koo5 · 2023-06-25T00:12:57Z

yup, not getting any errors in 5.15.0-75

jira LE-1907 Rebuild_History Non-Buildable kernel-4.18.0-372.32.1.el8_6 commit-author Manuel Ullmann <[email protected]> commit cbe6c3a This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit cbe6c3a) Signed-off-by: Jonathan Maple <[email protected]>

jira LE-1907 Rebuild_History Non-Buildable kernel-4.18.0-448.el8 commit-author Manuel Ullmann <[email protected]> commit cbe6c3a This will reset deeply on freeze and thaw instead of suspend and resume and prevent null pointer dereferences of the uninitialized ring 0 buffer while thawing. The impact is an indefinitely hanging kernel. You can't switch consoles after this and the only possible user interaction is SysRq. BUG: kernel NULL pointer dereference RIP: 0010:aq_ring_rx_fill+0xcf/0x210 [atlantic] aq_vec_init+0x85/0xe0 [atlantic] aq_nic_init+0xf7/0x1d0 [atlantic] atl_resume_common+0x4f/0x100 [atlantic] pci_pm_thaw+0x42/0xa0 resolves in aq_ring.o to ``` 0000000000000ae0 <aq_ring_rx_fill>: { /* ... */ baf: 48 8b 43 08 mov 0x8(%rbx),%rax buff->flags = 0U; /* buff is NULL */ ``` The bug has been present since the introduction of the new pm code in 8aaa112 ("net: atlantic: refactoring pm logic") and was hidden until 8ce8427 ("net: atlantic: changes for multi-TC support"), which refactored the aq_vec_{free,alloc} functions into aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The original functions just always reinitialized the buffers on S3/S4. If the interface is down before freezing, the bug does not occur. It does not matter, whether the initrd contains and loads the module before thawing. So the fix is to invert the boolean parameter deep in all pm function calls, which was clearly intended to be set like that. First report was on Github [1], which you have to guess from the resume logs in the posted dmesg snippet. Recently I posted one on Bugzilla [2], since I did not have an AQC device so far. #regzbot introduced: 8ce8427 #regzbot from: koo5 <[email protected]> #regzbot monitor: Aquantia/AQtion#32 Fixes: 8aaa112 ("net: atlantic: refactoring pm logic") Link: Aquantia/AQtion#32 [1] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215798 [2] Cc: [email protected] Reported-by: koo5 <[email protected]> Signed-off-by: Manuel Ullmann <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit cbe6c3a) Signed-off-by: Jonathan Maple <[email protected]>

cail mentioned this issue Feb 3, 2022

atlantic driver fails to load on Kernel 5.16.4 for XG-C100F SFP+ #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: kernel NULL pointer dereference, address: 0000000000000008 #32

BUG: kernel NULL pointer dereference, address: 0000000000000008 #32

koo5 commented Nov 16, 2021

koo5 commented Nov 16, 2021

cail commented Nov 22, 2021

koo5 commented Nov 22, 2021

cail commented Nov 22, 2021

koo5 commented Nov 23, 2021

eitch commented Feb 1, 2022

cail commented Feb 3, 2022

labre commented Apr 7, 2022

labre commented Apr 18, 2022

labre commented Apr 18, 2022

koo5 commented Apr 28, 2022

labre commented Apr 29, 2022 •

edited

Loading

amarshall commented May 6, 2022

labre commented May 6, 2022

koo5 commented Jun 25, 2023

BUG: kernel NULL pointer dereference, address: 0000000000000008 #32

BUG: kernel NULL pointer dereference, address: 0000000000000008 #32

Comments

koo5 commented Nov 16, 2021

koo5 commented Nov 16, 2021

cail commented Nov 22, 2021

koo5 commented Nov 22, 2021

cail commented Nov 22, 2021

koo5 commented Nov 23, 2021

eitch commented Feb 1, 2022

cail commented Feb 3, 2022

labre commented Apr 7, 2022

labre commented Apr 18, 2022

labre commented Apr 18, 2022

koo5 commented Apr 28, 2022

labre commented Apr 29, 2022 • edited Loading

amarshall commented May 6, 2022

labre commented May 6, 2022

koo5 commented Jun 25, 2023

labre commented Apr 29, 2022 •

edited

Loading