Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fixes in arc smp initializaiton and tests #22362

Conversation

vonhust
Copy link

@vonhust vonhust commented Jan 31, 2020

This PR is designed to fix issue #21469. It fixes bugs related to SMP:

  • initialization
  • timer driver
  • test cases
  • board configuration

Because sanitycheck test does not support mdb which is required by ARC SMP, I mannually run some critical tests and samples, the resultes are ok.

Fixes #21469

@zephyrbot zephyrbot added area: Timer Timer area: Boards area: ARC ARC Architecture area: Tests Issues related to a particular existing or missing test area: Kernel labels Jan 31, 2020
@zephyrbot
Copy link
Collaborator

zephyrbot commented Jan 31, 2020

All checks are passing now.

checkpatch (informational only, not a failure)

-:1311: WARNING:LINE_SPACING: Missing a blank line after declarations
#1311: FILE: arch/arc/include/swap_macros.h:415:
+	and r3, r3, 0x80000000
+	push_s r3

-:1359: WARNING:LINE_SPACING: Missing a blank line after declarations
#1359: FILE: arch/arc/include/swap_macros.h:463:
+	bset r0, r0, _ARC_V2_SEC_STAT_IRM_BIT
+	sflag r0

-:1386: WARNING:LINE_SPACING: Missing a blank line after declarations
#1386: FILE: arch/arc/include/swap_macros.h:490:
+	mov r0, sp
+	bl z_arch_get_next_switch_handle

- total: 0 errors, 3 warnings, 1395 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

Your patch has style problems, please review.

NOTE: Ignored message types: AVOID_EXTERNS BRACES CONFIG_EXPERIMENTAL CONST_STRUCT DATE_TIME FILE_PATH_CHANGES MINMAX NETWORKING_BLOCK_COMMENT_STYLE PRINTK_WITHOUT_KERN_LEVEL SPLIT_STRING VOLATILE

NOTE: If any of the errors are false positives, please report
      them to the maintainers.

Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages.

@@ -1,3 +1,4 @@
CONFIG_ZTEST=y
CONFIG_IRQ_OFFLOAD=y
CONFIG_TEST_USERSPACE=y
CONFIG_MP_NUM_CPUS=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to explain, in detail, why you are disabling multi-core on this test. "kernel/queue requires single core" is not sufficient. especially since this passes on other arches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not needded after adding z_impl_z_test_1cpu_start. I will reverse it


old_thread = _current;

/* Null out the switch handle, see wait_for_switch() above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment is copy & pasted from kswap.h or so, there is no "wait_for_switch() above".
Do we need this all this "switch_handle-ing" in the architecture specific code? No way to use the generic code, or call generic (inline) functions for this? (I'm a bit concerned with maintainability).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's copy from kswap.h. For ARC, thread switch in interrupt context cannot reuse the arch_switch, so we have to add handling for interrupt case. For ARM, pending_sv exception is the only entry of thread switch. This make things easier

@@ -49,10 +49,12 @@

#define TICKLESS (IS_ENABLED(CONFIG_TICKLESS_KERNEL))

#define SMP_TIMER_DRIVER (CONFIG_SMP && CONFIG_MP_NUM_CPUS > 1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier, insn't CONFIG_SMP enough? SMP with 1 CPU does not really make sense.

@vonhust vonhust force-pushed the topic-test-smp branch 2 times, most recently from 2319609 to 3d00514 Compare February 4, 2020 15:36
@vonhust vonhust added this to the v2.2.0 milestone Feb 5, 2020
arch/arc/core/reset.S Show resolved Hide resolved
@vonhust
Copy link
Author

vonhust commented Feb 9, 2020

The CI failure is not caused by changes in this PR, it's an old issue caused by openocd driver.

@vonhust vonhust force-pushed the topic-test-smp branch 2 times, most recently from 604f5f2 to 0664242 Compare February 15, 2020 09:12
@vonhust
Copy link
Author

vonhust commented Feb 15, 2020

@ruuddw @abrodkin to fix smp related issues, i just overhaul the thread switch logic in the epilogue of irq and exception handle. codes are simpler, easier to maintain and understand. My sanitycheck results based on nsim are almost ok. Could you review the new changes? @abrodkin could you run some smp tests on HSDK with PR? I don't have hsdk at hand.

What's more, @andyross , the new changes can fix the issue #20671 ?

@abrodkin
Copy link
Collaborator

@vonhust so I took a liberty and tried the SMP test (cmake -DBOARD=hsdk ../tests/kernel/smp/) and even though mostly it works from time to time (like every 10th execution) ends-up having machine check exception, see 2 examples below. What's funny in the end it's said "Test suite smp succeeded" :)

Do you think there's some other test I should try to run (BTW I also tried my samples/smp/pi but there were no problems so far)?

And for the record that's how I executed tests on the board (unfortunately OpenOCD requires quite some work to be usable with ARC SMP targets, see zephyrproject-rtos/openocd#23):

mdb -pset=1 -psetname=core0 -digilent -nogoifmain -prop=dig_device=HSDK zephyr/zephyr.elf
mdb -pset=2 -psetname=core1 -digilent -prop=download=2 -nogoifmain -prop=dig_device=HSDK zephyr/zephyr.elf
mdb -pset=3 -psetname=core2 -digilent -prop=download=2 -nogoifmain -prop=dig_device=HSDK zephyr/zephyr.elf
mdb -pset=4 -psetname=core3 -digilent -prop=download=2 -nogoifmain -prop=dig_device=HSDK zephyr/zephyr.elf
mdb -multifiles=core0,core1,core2,core3 -run -cl

Execution 1:

*** Booting Zephyr OS build v2.2.0-rc1-139-g7ab68a87a83d  ***
Running test suite smp
===================================================================
starting test - test_smp_coop_threads
PASS - test_smp_coop_threads
===================================================================
starting test - test_cpu_id_threads
PASS - test_cpu_id_threads
===================================================================
starting test - test_coop_resched_threads
PASS - test_coop_resched_threads
===================================================================
starting test - test_preempt_resched_threads
PASS - test_preempt_resched_threads
===================================================================
starting test - test_yield_threads
PASS - test_yield_threads
===================================================================
starting test - test_sleep_threads
PASS - test_sleep_threads
===================================================================
starting test - test_wakeup_threads
E: ***** Exception vector: 0x3, cause code: 0x0, parameter 0x0
E: Address 0x80084196
E: Faulting instruction address = 0x80084196
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x90006b38 (unknown)
E: Halting system
PASS - test_wakeup_threads
===================================================================
Test suite smp succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL

Execution 2:

*** Booting Zephyr OS build v2.2.0-rc1-139-g7ab68a87a83d  ***
Running test suite smp
===================================================================
starting test - test_smp_coop_threads
PASS - test_smp_coop_threads
===================================================================
starting test - test_cpu_id_threads
PASS - test_cpu_id_threads
===================================================================
starting test - test_coop_resched_threads
PASS - test_coop_resched_threads
===================================================================
starting test - test_preempt_resched_threads
PASS - test_preempt_resched_threads
===================================================================
starting test - test_yield_threads
PASS - test_yield_threads
===================================================================
starting test - test_sleep_threads
PASS - test_sleep_threads
===================================================================
starting test - test_wakeup_threads
E: ***** Exception vector: 0x3, cause code: 0x0, parameter 0x0
E: Address 0xfffffffc
E: Faulting instruction address = 0xfffffffc
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 2
E: Current thread: 0x90006b38 (unknown)
E: Halting system
PASS - test_wakeup_threads
===================================================================
Test suite smp succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL

@andrewboie andrewboie dismissed their stale review February 17, 2020 21:26

addressed

* be cleared.
*/
lr r0, [_ARC_V2_AUX_IRQ_ACT]
bclr r0, r0, 31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully sure here. The original code popped the IRQ_ACT and SEC_STAT/IRM below from the stack - and I believe the new code still pushes them to the stack on isr entry. Could you please double check if pushes and pops are balanced? A better desciption of the full stack frames for switching would help as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, the *_switch_from_irq had the pop instructions, not the original rirq_return_from_coop ( I think).

@ruuddw
Copy link
Member

ruuddw commented Feb 19, 2020

@vonhust The restructuring makes things cleaner (use more common Zephyr code) and avoid some code duplication (new assembler macro's factoring our irq/rirq common code).
Would be good to better document the assumptions and use of registers by the macros to improve maintainability, or even add them more explicit as parameters (arguments, scratch register ok to use).

@vonhust vonhust force-pushed the topic-test-smp branch 2 times, most recently from e2c81c1 to c2b8b9e Compare March 2, 2020 05:59
Copy link
Contributor

@andrewboie andrewboie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, but I would defer to @abrodkin and @ruuddw opinions on the asm code.


#ifdef CONFIG_STACK_SENTINEL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we no longer check the stack sentinel in firqs?

@andrewboie andrewboie closed this Mar 3, 2020
@andrewboie andrewboie reopened this Mar 3, 2020
@andrewboie
Copy link
Contributor

andrewboie commented Mar 3, 2020

@vonhust so I took a liberty and tried the SMP test (cmake -DBOARD=hsdk ../tests/kernel/smp/) and even though mostly it works from time to time (like every 10th execution) ends-up having machine check exception, see 2 examples below.

@abrodkin the way I find best to smoke this stuff out is through sanitycheck and just run all tests. So run sanitycheck -p hsdk and see what falls out, repeatedly if need be. sanitycheck looks for nsimdrv in $PATH and uses it if found to run tests in the emulator.

There were some SMP-related races in the core kernel that Andy fixed recently. This seem stable nowadays...but we only have qemu_x86_64 to test with on our side. The frequency of the failures you are seeing (10%) suggests to me this may be an ARC issue.

What's funny in the end it's said "Test suite smp succeeded" :)

This suggests to me that this may be a problem upon thread exit. I imagine a child thread in the test case crashed, but after it had done its job.

@andrewboie andrewboie requested a review from abrodkin March 3, 2020 08:01
@andrewboie
Copy link
Contributor

andrewboie commented Mar 3, 2020

@abrodkin Oh another thing ... for ARC I think a custom arch_system_halt() is needed. The default arch_system_halt() in kernel/fatal.c falls over on SMP as it just halts one core. We haven't figured out a portable way to address this yet.

This might need to be addressed first before you can catch this with sanitycheck since as you say the suite passes even though there was a crash.

On x86_64 we issue a special output call which causes QEMU to instantly exit. Maybe some similar capability exists in nSIM?

@abrodkin
Copy link
Collaborator

abrodkin commented Mar 3, 2020

@vonhust so I took a liberty and tried the SMP test (cmake -DBOARD=hsdk ../tests/kernel/smp/) and even though mostly it works from time to time (like every 10th execution) ends-up having machine check exception, see 2 examples below.

@abrodkin the way I find best to smoke this stuff out is through sanitycheck and just run all tests. So run sanitycheck -p hsdk and see what falls out, repeatedly if need be. sanitycheck looks for nsimdrv in $PATH and uses it if found to run tests in the emulator.

Well I do use sanitychecker script these days quite a lot indeed. But I wanted to focus more on a particular topic - SMP. Because I think we have some other tests failing for HSDK still. Moreover we're almost done with bring-up of automated sanity checker on real ARC boards including HSDK (though in UP mode only as OpenOCD for ARC turned to not support SMP, see zephyrproject-rtos/openocd#23 which we're about to start working on as well).

There were some SMP-related races in the core kernel that Andy fixed recently. This seem stable nowadays...but we only have qemu_x86_64 to test with on our side. The frequency of the failures you are seeing (10%) suggests to me this may be an ARC issue.

Very well might be the case and needs a deeper look obviously.

What's funny in the end it's said "Test suite smp succeeded" :)

This suggests to me that this may be a problem upon thread exit. I imagine a child thread in the test case crashed, but after it had done its job.

But that's not good still, right?

@abrodkin Oh another thing ... for ARC I think a custom arch_system_halt() is needed. The default arch_system_halt() in kernel/fatal.c falls over on SMP as it just halts one core. We haven't figured out a portable way to address this yet.

Well not really. Instead we need to add support of ARConnect Inter-Core Debug Unit and configure it so that whenever any core is halted all the rest get halted too. That BTW helps for interactive debugging as well so needs to be done anyways. See #23212 filed just now for tracking.

This might need to be addressed first before you can catch this with sanitycheck since as you say the suite passes even though there was a crash.

On x86_64 we issue a special output call which causes QEMU to instantly exit. Maybe some similar capability exists in nSIM?

May I have a pointer to that one? In light of our work on QEMU port fot ARC that might be interesting to check if that could be re-used easily for us.

P.S. I'm on FTO officially and so might not reply immediately being far from the laptop :)

@andrewboie
Copy link
Contributor

Well I do use sanitychecker script these days quite a lot indeed. But I wanted to focus more on a particular topic - SMP. Because I think we have some other tests failing for HSDK still. Moreover we're almost done with bring-up of automated sanity checker on real ARC boards including HSDK (though in UP mode only as OpenOCD for ARC turned to not support SMP, see zephyrproject-rtos/openocd#23 which we're about to start working on as well).

Problem is, most of the SMP bugs we have seen lately are not caught by the SMP tests. It's hard to design tests for some of these race conditions because they only break when specific events line up at the right time on multiple CPUs. Best we can do is run all tests and see what shakes out. This is something that keeps me up at night.

But that's not good still, right?

Nope. Definitely need to look into this. The test only passes because of arch_system_halt() not doing the right thing.

Well not really. Instead we need to add support of ARConnect Inter-Core Debug Unit and configure it so that whenever any core is halted all the rest get halted too. That BTW helps for interactive debugging as well so needs to be done anyways. See #23212 filed just now for tracking.

arch_system_halt() is what gets invoked any time a fatal error occurs. This is what you want to implement trust me. The default implementation is __weak and doesn't work right with SMP. You implement your own in the arch code. If what it does is talk to the inter-core debug unit, great.

May I have a pointer to that one? In light of our work on QEMU port fot ARC that might be interesting to check if that could be re-used easily for us.

grep for arch_system_halt() in arch/x86

@jhedberg jhedberg modified the milestones: v2.2.0, v2.3.0 Mar 10, 2020
Wayne Ren added 9 commits March 12, 2020 15:47
for smp target, there is a case where just one core is running, then:
* during init, the master core will run, others cores will halt/sleep
* use timer driver for single core

Signed-off-by: Wayne Ren <[email protected]>
hsdk and nsim_hs_smp are multicore targetes, enable smp
by default

Signed-off-by: Wayne Ren <[email protected]>
the arc issue is fixed, remove the filter

Signed-off-by: Wayne Ren <[email protected]>
after applying commit 3235451, arc's arch_switch also need's
corresponding fix.

Signed-off-by: Wayne Ren <[email protected]>
…eption

overhaul the thread switch code in epilogue of irq and
exception handling:
* add z_arch_get_next_switch_handle to call z_get_next_switch_handle,
  let the scheduler to decide the switch thread. This will also cover
  the case of SMP.

* put lots of common codes in macros for thread switch to improve
  the maintainablity, readability.

* clean up of some lables to make codes easier to understand

Signed-off-by: Wayne Ren <[email protected]>
* update comments to match latest codes
* add extra comments for some assembly, macros
* use macro to replace duplcated codes
* remove unused codes, lables, symobols

Signed-off-by: Wayne Ren <[email protected]>
fix the trace of isr enter and exit

Signed-off-by: Wayne Ren <[email protected]>
the old codes just work for single core, we need to consider
the case of SMP.

In SMP, it's not easy to get current thread of current cpu in
assembly, so we'd better do it in C.

Signed-off-by: Wayne Ren <[email protected]>
To make a fake exception, we should set not clear AE bit

Signed-off-by: Wayne Ren <[email protected]>
@vonhust
Copy link
Author

vonhust commented Mar 12, 2020

@vonhust The restructuring makes things cleaner (use more common Zephyr code) and avoid some code duplication (new assembler macro's factoring our irq/rirq common code).
Would be good to better document the assumptions and use of registers by the macros to improve maintainability, or even add them more explicit as parameters (arguments, scratch register ok to use).

@ruuddw @abrodkin I already added more comments to the assembly and macros, and did further clean up for better maintenance. If there are still un-clarified places, let met know, i will address them.

current openocd tracing does not support SMP

Signed-off-by: Wayne Ren <[email protected]>
@nashif nashif merged commit 649c678 into zephyrproject-rtos:master Mar 12, 2020
@vonhust vonhust deleted the topic-test-smp branch March 16, 2020 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARC ARC Architecture area: Boards area: Kernel area: Tests Issues related to a particular existing or missing test area: Timer Timer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARC SMP is mostly untested in sanitycheck
8 participants