-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fixes in arc smp initializaiton and tests #22362
bug fixes in arc smp initializaiton and tests #22362
Conversation
All checks are passing now. checkpatch (informational only, not a failure)
Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages. |
bb08ce2
to
3790f21
Compare
tests/kernel/queue/prj.conf
Outdated
@@ -1,3 +1,4 @@ | |||
CONFIG_ZTEST=y | |||
CONFIG_IRQ_OFFLOAD=y | |||
CONFIG_TEST_USERSPACE=y | |||
CONFIG_MP_NUM_CPUS=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to explain, in detail, why you are disabling multi-core on this test. "kernel/queue requires single core" is not sufficient. especially since this passes on other arches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not needded after adding z_impl_z_test_1cpu_start. I will reverse it
3790f21
to
0e55d01
Compare
arch/arc/core/arc_smp.c
Outdated
|
||
old_thread = _current; | ||
|
||
/* Null out the switch handle, see wait_for_switch() above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment is copy & pasted from kswap.h or so, there is no "wait_for_switch() above".
Do we need this all this "switch_handle-ing" in the architecture specific code? No way to use the generic code, or call generic (inline) functions for this? (I'm a bit concerned with maintainability).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's copy from kswap.h. For ARC, thread switch in interrupt context cannot reuse the arch_switch, so we have to add handling for interrupt case. For ARM, pending_sv exception is the only entry of thread switch. This make things easier
@@ -49,10 +49,12 @@ | |||
|
|||
#define TICKLESS (IS_ENABLED(CONFIG_TICKLESS_KERNEL)) | |||
|
|||
#define SMP_TIMER_DRIVER (CONFIG_SMP && CONFIG_MP_NUM_CPUS > 1) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier, insn't CONFIG_SMP enough? SMP with 1 CPU does not really make sense.
2319609
to
3d00514
Compare
The CI failure is not caused by changes in this PR, it's an old issue caused by openocd driver. |
604f5f2
to
0664242
Compare
@ruuddw @abrodkin to fix smp related issues, i just overhaul the thread switch logic in the epilogue of irq and exception handle. codes are simpler, easier to maintain and understand. My sanitycheck results based on nsim are almost ok. Could you review the new changes? @abrodkin could you run some smp tests on HSDK with PR? I don't have hsdk at hand. What's more, @andyross , the new changes can fix the issue #20671 ? |
0664242
to
7ab68a8
Compare
@vonhust so I took a liberty and tried the SMP test ( Do you think there's some other test I should try to run (BTW I also tried my And for the record that's how I executed tests on the board (unfortunately OpenOCD requires quite some work to be usable with ARC SMP targets, see zephyrproject-rtos/openocd#23):
Execution 1:
Execution 2:
|
* be cleared. | ||
*/ | ||
lr r0, [_ARC_V2_AUX_IRQ_ACT] | ||
bclr r0, r0, 31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully sure here. The original code popped the IRQ_ACT and SEC_STAT/IRM below from the stack - and I believe the new code still pushes them to the stack on isr entry. Could you please double check if pushes and pops are balanced? A better desciption of the full stack frames for switching would help as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, the *_switch_from_irq had the pop instructions, not the original rirq_return_from_coop ( I think).
@vonhust The restructuring makes things cleaner (use more common Zephyr code) and avoid some code duplication (new assembler macro's factoring our irq/rirq common code). |
e2c81c1
to
c2b8b9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
#ifdef CONFIG_STACK_SENTINEL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we no longer check the stack sentinel in firqs?
@abrodkin the way I find best to smoke this stuff out is through There were some SMP-related races in the core kernel that Andy fixed recently. This seem stable nowadays...but we only have qemu_x86_64 to test with on our side. The frequency of the failures you are seeing (10%) suggests to me this may be an ARC issue.
This suggests to me that this may be a problem upon thread exit. I imagine a child thread in the test case crashed, but after it had done its job. |
@abrodkin Oh another thing ... for ARC I think a custom This might need to be addressed first before you can catch this with On x86_64 we issue a special output call which causes QEMU to instantly exit. Maybe some similar capability exists in nSIM? |
Well I do use sanitychecker script these days quite a lot indeed. But I wanted to focus more on a particular topic - SMP. Because I think we have some other tests failing for HSDK still. Moreover we're almost done with bring-up of automated sanity checker on real ARC boards including HSDK (though in UP mode only as OpenOCD for ARC turned to not support SMP, see zephyrproject-rtos/openocd#23 which we're about to start working on as well).
Very well might be the case and needs a deeper look obviously.
But that's not good still, right?
Well not really. Instead we need to add support of ARConnect Inter-Core Debug Unit and configure it so that whenever any core is halted all the rest get halted too. That BTW helps for interactive debugging as well so needs to be done anyways. See #23212 filed just now for tracking.
May I have a pointer to that one? In light of our work on QEMU port fot ARC that might be interesting to check if that could be re-used easily for us. P.S. I'm on FTO officially and so might not reply immediately being far from the laptop :) |
Problem is, most of the SMP bugs we have seen lately are not caught by the SMP tests. It's hard to design tests for some of these race conditions because they only break when specific events line up at the right time on multiple CPUs. Best we can do is run all tests and see what shakes out. This is something that keeps me up at night.
Nope. Definitely need to look into this. The test only passes because of arch_system_halt() not doing the right thing.
arch_system_halt() is what gets invoked any time a fatal error occurs. This is what you want to implement trust me. The default implementation is __weak and doesn't work right with SMP. You implement your own in the arch code. If what it does is talk to the inter-core debug unit, great.
grep for arch_system_halt() in arch/x86 |
for smp target, there is a case where just one core is running, then: * during init, the master core will run, others cores will halt/sleep * use timer driver for single core Signed-off-by: Wayne Ren <[email protected]>
hsdk and nsim_hs_smp are multicore targetes, enable smp by default Signed-off-by: Wayne Ren <[email protected]>
the arc issue is fixed, remove the filter Signed-off-by: Wayne Ren <[email protected]>
after applying commit 3235451, arc's arch_switch also need's corresponding fix. Signed-off-by: Wayne Ren <[email protected]>
…eption overhaul the thread switch code in epilogue of irq and exception handling: * add z_arch_get_next_switch_handle to call z_get_next_switch_handle, let the scheduler to decide the switch thread. This will also cover the case of SMP. * put lots of common codes in macros for thread switch to improve the maintainablity, readability. * clean up of some lables to make codes easier to understand Signed-off-by: Wayne Ren <[email protected]>
* update comments to match latest codes * add extra comments for some assembly, macros * use macro to replace duplcated codes * remove unused codes, lables, symobols Signed-off-by: Wayne Ren <[email protected]>
fix the trace of isr enter and exit Signed-off-by: Wayne Ren <[email protected]>
the old codes just work for single core, we need to consider the case of SMP. In SMP, it's not easy to get current thread of current cpu in assembly, so we'd better do it in C. Signed-off-by: Wayne Ren <[email protected]>
To make a fake exception, we should set not clear AE bit Signed-off-by: Wayne Ren <[email protected]>
@ruuddw @abrodkin I already added more comments to the assembly and macros, and did further clean up for better maintenance. If there are still un-clarified places, let met know, i will address them. |
current openocd tracing does not support SMP Signed-off-by: Wayne Ren <[email protected]>
This PR is designed to fix issue #21469. It fixes bugs related to SMP:
Because sanitycheck test does not support mdb which is required by ARC SMP, I mannually run some critical tests and samples, the resultes are ok.
Fixes #21469