Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Panic with XDP in driver mode on c7gn #328

Closed
3 tasks done
efagerho opened this issue Nov 5, 2024 · 4 comments
Closed
3 tasks done

[Bug]: Panic with XDP in driver mode on c7gn #328

efagerho opened this issue Nov 5, 2024 · 4 comments
Labels
bug Report errors or unexpected behavior Linux ENA driver triage Determine the priority and severity

Comments

@efagerho
Copy link

efagerho commented Nov 5, 2024

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

6.1.112-122.189.amzn2023.aarch64

Custom Code

No

OS Platform and Distribution

6.1.112-122.189.amzn2023.aarch64

Bug description

Kernel panics when running XDP program in driver mode when it returns XDP_TX on a packet. Only happens when NIC is bombarded with traffic.

Reproduction steps

1. Bombard machine with traffic
2. Run test XDP program in driver mode that returns XDP_TX on a packet
3. Kernel instantly panics

The test bed for this that I created to replicate the problem is the following: https://github.com/efagerho/udp-router

Expected Behavior

Does not panic

Actual Behavior

Panics

Additional Data

No response

Relevant log output

[  345.008097] list_add double add: new=ffff0003c55a4c90, prev=ffff0003c55a4c90, next=ffff00075c379e40.
[  345.008896] ------------[ cut here ]------------
[  345.009292] kernel BUG at lib/list_debug.c:33!
[  345.009687] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[  345.010204] Modules linked in: nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel dm_mod fuse loop configfs dax dmi_sysfs sha2_ce sha256_arm64 efivarfs
[  345.012098] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 6.1.112-122.189.amzn2023.aarch64 #1
[  345.012878] Hardware name: Amazon EC2 c7gn.2xlarge/, BIOS 1.0 11/1/2018
[  345.013416] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  345.013995] pc : __list_add_valid+0x70/0xc0
[  345.014359] lr : __list_add_valid+0x70/0xc0
[  345.014722] sp : ffff80000ac63ea0
[  345.015016] x29: ffff80000ac63ea0 x28: ffff0003c0850000 x27: 0000000000000000
[  345.015610] x26: ffff800009578008 x25: ffff0003c78c1000 x24: ffff800009578008
[  345.016201] x23: ffff80000989ec58 x22: 0000000000000039 x21: ffff0003c55a4c90
[  345.016797] x20: ffff0003c55a4c90 x19: ffff00075c379e40 x18: ffff80008ac63b17
[  345.017390] x17: 3039633461353563 x16: 3330303066666666 x15: 3d76657270202c30
[  345.017976] x14: 3963346135356333 x13: 2e30346539373363 x12: 3537303030666666
[  345.018568] x11: 663d7478656e202c x10: ffff800009e59450 x9 : ffff8000086caedc
[  345.019166] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : 0000000000000000
[  345.019763] x5 : 0000000000000008 x4 : 0000000000000040 x3 : 0000000000000000
[  345.020361] x2 : 0000000000000000 x1 : ffff0003c0850000 x0 : 0000000000000058
[  345.020958] Call trace:
[  345.021180]  __list_add_valid+0x70/0xc0
[  345.021513]  __napi_schedule_irqoff+0x88/0xdc
[  345.021891]  ena_intr_msix_io+0x60/0x74 [ena]
[  345.022290]  __handle_irq_event_percpu+0x5c/0x1e0
[  345.022688]  handle_irq_event+0x50/0x138
[  345.023023]  handle_fasteoi_irq+0xc4/0x258
[  345.023370]  generic_handle_domain_irq+0x34/0x4c
[  345.023761]  __gic_handle_irq_from_irqson.isra.0+0x114/0x1ec
[  345.024248]  gic_handle_irq+0x2c/0xa0
[  345.024571]  call_on_irq_stack+0x24/0x30
[  345.024904]  do_interrupt_handler+0x88/0x8c
[  345.025260]  el1_interrupt+0x48/0xac
[  345.025570]  el1h_64_irq_handler+0x18/0x24
[  345.025925]  el1h_64_irq+0x78/0x7c
[  345.026224]  arch_cpu_idle+0x18/0x58
[  345.026540]  default_idle_call+0x50/0x114
[  345.026888]  cpuidle_idle_call+0x160/0x184
[  345.027242]  do_idle+0xb8/0x13c
[  345.027526]  cpu_startup_entry+0x3c/0x44
[  345.027874]  secondary_start_kernel+0xf0/0x158
[  345.028270] Code: aa0103e3 aa0003e1 91158080 9414292b (d4210000) 
[  345.028785] SMP: stopping secondary CPUs
[  345.029916] Starting crashdump kernel...
[  345.030256] Bye!

Contact Details

[email protected]

@efagerho efagerho added bug Report errors or unexpected behavior triage Determine the priority and severity labels Nov 5, 2024
@davidarinzon
Copy link
Contributor

Hi @efagerho

Thanks for raising this.
We've identified the issue recently and we're planning to have a release with this fix soon.
In the meanwhile, you can use the attached patch to resolve the issue.
Please let us know if it works for you.
0001-Bug-Fix-Don-t-complete-napi-in-XDP-if-budget-is-cons.patch

@derlaft
Copy link

derlaft commented Nov 6, 2024

Does this look like the same issue? We also experienced kernel panic on our ec2 instances, under heavy XDP load.

[ 6926.134643] list_add double add: new=ffff00047e724c90, prev=ffff00047e724c90, next=ffff0012e751d000.
[ 6926.135367] kernel BUG at lib/list_debug.c:33!
[ 6926.135706] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 6926.136158] Modules linked in: cls_bpf sch_ingress ipvlan 8021q garp stp mrp llc veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nf_tables nfnetlink ip6table_filter ip6table_nat iptable_nat nf_nat ip6table_mangle xt_conntrack xt_comment iptable_mangle iptable_filter squashfs loop overlay aes_ce_blk aes_ce_cipher ghash_ce sha2_ce ena sha256_arm64 sha1_ce button sch_fq_codel nf_conntrack drm nf_defrag_ipv6 nf_defrag_ipv4 i2c_core fuse backlight configfs bpf_preload efivarfs dmi_sysfs
[ 6926.139687] CPU: 8 PID: 55 Comm: ksoftirqd/8 Not tainted 6.1.112 #1
[ 6926.140156] Hardware name: Amazon EC2 m6g.4xlarge/, BIOS 1.0 11/1/2018
[ 6926.140636] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6926.141145] pc : __list_add_valid+0x70/0xc0
[ 6926.141471] lr : __list_add_valid+0x70/0xc0
[ 6926.141790] sp : ffff80000b3abea0
[ 6926.142048] x29: ffff80000b3abea0 x28: ffff0003c0f1a140 x27: ffff0012e751c838
[ 6926.142577] x26: ffff800009c04008 x25: ffff0003e64af800 x24: ffff800009c04008
[ 6926.143106] x23: ffff800009f7ec58 x22: 000000000000004e x21: ffff00047e724c90
[ 6926.143635] x20: ffff00047e724c90 x19: ffff0012e751d000 x18: ffffffffffffffff
[ 6926.144159] x17: 3039633432376537 x16: 3430303066666666 x15: 3d76657270202c30
[ 6926.144687] x14: 3963343237653734 x13: 656c6f736e6f6320 x12: 7265766f20676e69
[ 6926.145215] x11: 663d7478656e202c x10: ffff80000a5397d0 x9 : ffff80000812d9b0
[ 6926.145745] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
[ 6926.146269] x5 : ffff0012e750b450 x4 : ffff80000b3abd00 x3 : 0000000000000000
[ 6926.146795] x2 : 0000000000000000 x1 : ffff0003c0f1a140 x0 : 0000000000000058
[ 6926.147315] Call trace:
[ 6926.147510]  __list_add_valid+0x70/0xc0
[ 6926.147804]  __napi_schedule_irqoff+0x88/0xe0
[ 6926.148139]  ena_intr_msix_io+0x60/0x80 [ena]
[ 6926.148480]  __handle_irq_event_percpu+0x60/0x1e0
[ 6926.148847]  handle_irq_event+0x50/0x130
[ 6926.149146]  handle_fasteoi_irq+0xc4/0x260
[ 6926.149459]  generic_handle_domain_irq+0x34/0x50
[ 6926.149811]  __gic_handle_irq_from_irqson.isra.0+0x114/0x1f0
[ 6926.150234]  gic_handle_irq+0x2c/0x98
[ 6926.150517]  call_on_irq_stack+0x24/0x30
[ 6926.150815]  do_interrupt_handler+0x88/0x90
[ 6926.151128]  el1_interrupt+0x48/0xb0
[ 6926.151402]  el1h_64_irq_handler+0x18/0x30
[ 6926.151712]  el1h_64_irq+0x78/0x7c
[ 6926.151978]  rcu_cblist_dequeue+0x24/0x40
[ 6926.152281]  rcu_core+0x17c/0x1e0
[ 6926.152539]  rcu_core_si+0x18/0x30
[ 6926.152801]  handle_softirqs+0x120/0x310
[ 6926.153098]  run_ksoftirqd+0x6c/0xa0
[ 6926.153381]  smpboot_thread_fn+0x14c/0x190
[ 6926.153691]  kthread+0xd0/0xe0
[ 6926.153935] Code: aa0003e1 f0006cc0 91118000 9417e2a4 (d4210000) 
[ 6926.154381] ---[ end trace 0000000000000000 ]---
[ 6926.237971] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[ 6926.238938] SMP: stopping secondary CPUs
[ 6926.239253] Kernel Offset: disabled
[ 6926.239565] CPU features: 0x080000,12070084,66007a0b
[ 6926.239940] Memory Limit: none
[ 6926.337343] Rebooting in 10 seconds..

@efagerho
Copy link
Author

efagerho commented Nov 6, 2024

Looks like exactly the same.

@davidarinzon
Copy link
Contributor

Hi @efagerho

The issue has resolved and the suggested patch has been merged as part of https://github.com/amzn/amzn-drivers/releases/tag/ena_linux_2.13.1

Please let us know if it works for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Report errors or unexpected behavior Linux ENA driver triage Determine the priority and severity
Projects
None yet
Development

No branches or pull requests

3 participants