Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix regresion from https://github.com/apache/nuttx/pull/14881 #15073

Merged
merged 2 commits into from
Dec 9, 2024

Conversation

hujun260
Copy link
Contributor

@hujun260 hujun260 commented Dec 6, 2024

Summary

fix regresion from #14881
reason:
svc call may trigger hardfalt

Background
The origin of this issue is our desire to eliminate the function of storing
"regs" in g_current_regs and instead utilize (*running_task)->xcp.regs for storage.
The benefits of this approach include faster storage speed and
avoiding multiple accesses to g_current_regs during context switching,
thus ensuring that whether returning from an interrupt or an exception,
we consistently use this_task()->xcp.regs

Issue Encountered
However, when storing registers, we must ensure that (running_task)->xcp.regs is invalid
so that it can be safely overwritten.
According to the existing logic, the only scenario where (running_task)->xcp.regs
is valid is during restore_context. We must accurately identify this scenario.
Initially, we used the condition (running_task)==NULL for this purpose, but we deemed
this approach unsatisfactory as it did not align well with the actual logic.
(running_task) should not be NULL. Consequently, we adopted other arch-specific methods for judgment,
but due to special logic in some arch, the judgment was not accurate, leading to this issue.

Solution:
For armv6-m, we haven't found a more suitable solution, so we are sticking with (*running_task)==NULL.
For armv7-m/armv8-m, by removing support for primask, we can achieve accurate judgment.
PRIMASK is a design in armv6-m, that's why arm introduce BASEPRI from armv7-m. It's wrong to provide this option for armv7-m/armv8-m arch.

Impact

armv6-m
armv7-m
armv8-m

Testing

ci

@github-actions github-actions bot added Arch: arm Issues related to ARM (32-bit) architecture Size: S The size of the change in this PR is small labels Dec 6, 2024
@nuttxpr
Copy link

nuttxpr commented Dec 6, 2024

[Experimental Bot, please feedback here]

No, this PR does not fully meet the NuttX requirements. While it provides a link to the related PR and identifies affected architectures, it lacks crucial details. Here's a breakdown:

  • Insufficient Summary: It states what is fixed (a regression), but not how. It briefly mentions a hard fault possibility, but doesn't explain the root cause of the regression or the solution implemented. The functional part of the code being changed isn't specified.
  • Incomplete Impact: While it lists architectures, it doesn't use the YES/NO format and lacks descriptions. All other impact categories are missing. Does this change require documentation updates? Are there compatibility concerns?
  • Inadequate Testing: Stating "ci" isn't sufficient. The requirements ask for specific build host details and target details. Critically, it's missing "before" and "after" testing logs, which are essential to demonstrate the fix's effectiveness.

The PR needs substantial improvement in all three sections (Summary, Impact, and Testing) before it can be considered complete.

@jasonbu
Copy link
Contributor

jasonbu commented Dec 6, 2024

also influence this issue #14514

@hujun260 hujun260 marked this pull request as draft December 6, 2024 07:18
@hujun260 hujun260 marked this pull request as ready for review December 6, 2024 07:59
@github-actions github-actions bot added Size: XS The size of the change in this PR is very small and removed Size: S The size of the change in this PR is small labels Dec 6, 2024
@xiaoxiang781216 xiaoxiang781216 linked an issue Dec 6, 2024 that may be closed by this pull request
1 task
@raiden00pl
Copy link
Member

I confirm that this PR fixes this issue on nrf52 #14514

@raiden00pl raiden00pl linked an issue Dec 6, 2024 that may be closed by this pull request
1 task
arch/arm/Kconfig Outdated Show resolved Hide resolved
Copy link
Member

@raiden00pl raiden00pl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences of this change? Does it have any disadvantages compared to PRIMASK? This is not a trivial change, so it's worth discuss it more as we basically force users to use BASEPRI here.

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented Dec 6, 2024

What are the consequences of this change?

syscall doesn't go through hardfault, which make the syscall process is simple and fast, also avoid confusing the developer with JTAG debugging.

Does it have any disadvantages compared to PRIMASK?

I don't see any bad side effect.

This is not a trivial change, so it's worth discuss it more as we basically force users to use BASEPRI here.

PRIMASK is a bad design in armv6-m, that's why arm introduce BASEPRI from armv7-m. It's wrong to provide this option for armv7-m/armv8-m arch.

@raiden00pl
Copy link
Member

@hujun260 what about armv6-m ? this arch doesn't support BASEPRI, so this architecture is still broken.

@raiden00pl
Copy link
Member

PRIMASK is a bad design in armv6-m, that's why arm introduce BASEPRI from armv7-m. It's wrong to provide this option for armv7-m/armv8-m arch.

I think the timeline is different. armv7-m was introduced before armv6-m. The second one is a cut version of the first one and if ARM decided to get rid of BASEPRI they must have had a reason for this (but of course they could have made a design mistake here)

I don't see any bad side effect.

I don't see any flaws in this solution either and it can simplify a lot of things. But people use NuttX in different ways. A change like this should be more widely communicated in the community, maybe someone sees drawbacks.

@hujun260
Copy link
Contributor Author

hujun260 commented Dec 6, 2024

@hujun260 what about armv6-m ? this arch doesn't support BASEPRI, so this architecture is still broken.

It has been restored to the old implementation.

@github-actions github-actions bot added Size: S The size of the change in this PR is small and removed Size: XS The size of the change in this PR is very small labels Dec 6, 2024
@github-actions github-actions bot added Board: arm Size: L The size of the change in this PR is large and removed Size: S The size of the change in this PR is small labels Dec 6, 2024
@hujun260 hujun260 force-pushed the apache_9 branch 2 times, most recently from 103c9ea to a36b0bd Compare December 6, 2024 13:11
arch/arm/src/armv8-m/arm_tcbinfo.c Show resolved Hide resolved
arch/arm/src/armv7-m/arm_tcbinfo.c Show resolved Hide resolved
arch/arm/Kconfig Outdated Show resolved Hide resolved
@acassis
Copy link
Contributor

acassis commented Dec 6, 2024

@hujun260 Please include a complete explanation in the commit log message. I saw @xiaoxiang781216 try to explain a little bit about it, but it is important to explain what was modified initially that caused the issue (and why it was modified) and then how the new modification fixes the issue. It could help us understand it better and help other people in the future when looking this commit.

@hujun260
Copy link
Contributor Author

hujun260 commented Dec 8, 2024

was not accurate, leading to this issue.

Solution: For armv6-m, we haven't found a more suitable solution, so we are sticking with (*running_task)==NULL. For armv7-m/armv8-m, by remo

done

@acassis
Copy link
Contributor

acassis commented Dec 8, 2024

@hujun260 please extend the explanation from @xiaoxiang781216 and include it in the commit log message

reason:
svc call may trigger hardfalt

Signed-off-by: hujun5 <[email protected]>
@xiaoxiang781216
Copy link
Contributor

@hujun260 please extend the explanation from @xiaoxiang781216 and include it in the commit log message

aleady add to the second commit

@anchao
Copy link
Contributor

anchao commented Dec 9, 2024

@hujun260 @xiaoxiang781216 Could we remove the word "bad" from the commit message? I don't think this is a bad design, right? These are two interrupt masking mechanisms provided by the CPU. Similar to cortex-a/r, the CPU also provides cpsr and gic mechanisms, but the application scenarios are different.

reason:
svc call may trigger hardfault

Background
    The origin of this issue is our desire to eliminate the function of storing
"regs" in g_current_regs and instead utilize (*running_task)->xcp.regs for storage.
The benefits of this approach include faster storage speed and
avoiding multiple accesses to g_current_regs during context switching,
thus ensuring that whether returning from an interrupt or an exception,
we consistently use this_task()->xcp.regs

Issue Encountered
    However, when storing registers, we must ensure that (running_task)->xcp.regs is invalid
so that it can be safely overwritten.
According to the existing logic, the only scenario where (running_task)->xcp.regs
is valid is during restore_context. We must accurately identify this scenario.
Initially, we used the condition (running_task)==NULL for this purpose, but we deemed
this approach unsatisfactory as it did not align well with the actual logic.
(running_task) should not be NULL. Consequently, we adopted other arch-specific methods for judgment,
but due to special logic in some arch, the judgment was not accurate, leading to this issue.

Solution:
    For armv6-m, we haven't found a more suitable solution, so we are sticking with (*running_task)==NULL.
    For armv7-m/armv8-m, by removing support for primask, we can achieve accurate judgment.

    PRIMASK is a design in armv6-m, that's why arm introduce BASEPRI from armv7-m.
It's wrong to provide this option for armv7-m/armv8-m arch.

Signed-off-by: hujun5 <[email protected]>
@hujun260
Copy link
Contributor Author

hujun260 commented Dec 9, 2024

@hujun260 @xiaoxiang781216 Could we remove the word "bad" from the commit message? I don't think this is a bad design, right? These are two interrupt masking mechanisms provided by the CPU. Similar to cortex-a/r, the CPU also provides cpsr and gic mechanisms, but the application scenarios are different.

ok

@xiaoxiang781216 xiaoxiang781216 merged commit 0e1b432 into apache:master Dec 9, 2024
25 checks passed
@hartmannathan
Copy link
Contributor

hartmannathan commented Dec 9, 2024

EDIT: The following comment, which I made earlier, is incorrect. This PR removes support for PRIMASK, not BASEPRI. PRIMASK is not needed because BASEPRI is a superset of the functionality. This PR does not break High Priority, Zero Latency Interrupts! (Leaving the earlier text in place below...)

Sorry, I missed this PR before it was merged, but this PR breaks High Priority, Zero Latency Interrupts!

See the help text for ARMV7M_USEBASEPRI, which was removed in this PR:

config ARMV7M_USEBASEPRI
	bool "Use BASEPRI Register"
	default ARCH_HIPRI_INTERRUPT
	depends on ARCH_CORTEXM3 || ARCH_CORTEXM4 || ARCH_CORTEXM7
	---help---
		Use the BASEPRI register to enable and disable interrupts. By
		default, the PRIMASK register is used for this purpose. This
		usually results in hardfaults when supervisor calls are made.
		Though, these hardfaults are properly handled by the RTOS, the
		hardfaults can confuse some debuggers. With the BASEPRI
		register, these hardfaults, will be avoided. For more details see
		https://cwiki.apache.org/confluence/display/NUTTX/ARMv7-M+Hardfaults%2C+SVCALL%2C+and+Debuggers
		WARNING:  If CONFIG_ARCH_HIPRI_INTERRUPT is selected, then you
		MUST select CONFIG_ARMV7M_USEBASEPRI.  The Kconfig dependencies
		here will permit to select an invalid configuration because it
		cannot enforce that requirement.  If you create this invalid
		configuration, you will encounter some problems that may be
		very difficult to debug.

Also see https://nuttx.apache.org/docs/latest/guides/zerolatencyinterrupts.html

Important note: If this feature is removed from NuttX, then NuttX is completely useless to me and my employer! We depend on the Zero Latency Interrupts for critical parts of our firmware!

@hartmannathan
Copy link
Contributor

Please see Issue #15100

@hartmannathan
Copy link
Contributor

What are the consequences of this change? Does it have any disadvantages compared to PRIMASK? This is not a trivial change, so it's worth discuss it more as we basically force users to use BASEPRI here.

@raiden00pl the consequences of removing CONFIG_ARMV7M_USEBASEPRI is that high priority zero latency interrupts cannot work anymore. Unfortunately I think several PRs will need to be reverted. It seems the issue starts with PR-14881 which attempts to optimize storing
"regs" in g_current_regs and instead utilize (*running_task)->xcp.regs for storage, but unfortunately it seems this leads to hardfault regression; unfortunately, if the Zero Latency Interrupts are broken, I cannot use NuttX anymore. It is a complete showstopper. I recommend to revert the PRs that led to this issue. I have filed a bug #15100 to track this issue.

@hujun260
Copy link
Contributor Author

Important note: If this feature is removed from NuttX, then NuttX is completely useless to me and my employer! We depend on the Zero Latency Interrupts for critical parts of our firmware!

We haven't removed this feature; we've just set it to be enabled by default.

@hartmannathan
Copy link
Contributor

Important note: If this feature is removed from NuttX, then NuttX is completely useless to me and my employer! We depend on the Zero Latency Interrupts for critical parts of our firmware!

We haven't removed this feature; we've just set it to be enabled by default.

Thank you for clarifying. I misunderstood the changes earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: arm Issues related to ARM (32-bit) architecture Board: arm Size: L The size of the change in this PR is large
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Crash Introduced when calling app from nsh [BUG] hardware timers broken for ARM devices
8 participants