-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arch/xtensa: Add a call0 ABI variant #62117
Conversation
See docs in the PR itself, but I'll paste here for visibility: Zephyr with the Xtensa CALL0 ABIThe Xtensa register window mechanism is a poor fit for memory Naively, the kernel would have to write out the entire 64-entry Instead, for userspace apps, Zephyr builds using the "call0" ABI, ToolchainExisting Xtensa toolchains support a Unfortunately that doesn't extend to binary artifacts. In particular Cadence toolchains have an automatic multilib scheme and will select a But for now, you have to have a separate toolchain (or at least a #!/bin/sh
set -ex
TC=$1
if [ -z "$TC" ]; then
TC=sample_controller
fi
# Grab source (these are small)
git clone https://github.com/zephyrproject-rtos/sdk-ng
git clone https://github.com/crosstool-ng/crosstool-ng
# Build ct-ng itself
cd crosstool-ng
./bootstrap
./configure --enable-local
make -j$(nproc)
# Configure for the desired toolchain
ln -s ../sdk-ng/overlays
cp ../sdk-ng/configs/xtensa-${TC}_zephyr-elf.config .config
grep -v CT_TARGET_CFLAGS .config > asdf
mv asdf .config
echo CT_TARGET_CFLAGS="-mabi=call0" >> .config
echo CT_LIBC_NONE=y >> .config
./ct-ng olddefconfig
./ct-ng build.$(nproc)
echo "##"
echo "## Set these values to enable this toolchain:"
echo "##"
echo export CROSS_COMPILE=$HOME/x-tools/xtensa-${TC}_zephyr-elf/bin/xtensa-${TC}_zephyr-elf-
echo export ZEPHYR_TOOLCHAIN_VARIANT=cross-compile Note you don't really need to use the toolchain that was built, it's Longer term we need proper integration into the SDK, obviously. Also as a quick start: it's possible to build Zephyr in such a way as Bugs and Missing FeaturesNo support in the SDK. See above about toolchains. The existing syscall implementation is a mock, because this happened Likewise the MMU integration is a blindly-copied stub and unexercised. There's no C implementation of the Zephyr "call syscalls" API, just For simplicity, this code is written to save context to the arch Right now the Zephyr HAL integration doesn't build when the call0 flag The ARCH_EXCEPT() handling is a stub and doesn't actually do anything. Backtrace logging is likewise specific to the older frame format and FPU support isn't enabled. Likewise this isn't any different (except A good security audit is needed to make sure all the holes are I realized when writing this that our existing handling for NMI |
The context follows discussion from #61303 so be sure to check there also. Basically: it's real and it works. The code is comparatively tiny vs. the maybe-too-clever-for-its-own-good asm2 integration (which I still really like, but really isn't a good fit for privilege separation). I'm very happy on the whole. The libgcc issue is going to be a pain, and I'm really hoping someone with more SDK fu and toolchain interest can help with that? @jcmvbkbc 's name is all over the GNU toolchain patches for the architecture, so he surely has suggestions on how to make this work best. The only missing feature still causing any test failures I can find is ARCH_EXCEPT(), which I should have patched up RSN (it's just a code motion thing). The other missing stuff is all in the nice-to-have category and IMHO shouldn't block merge or further work. This was all done against sample_controller, with a few tries for validation on my mt8195 at various stages. It should ("should") work everywhere but in practice there will be burps. First up on my list is to get it working on dc233c to validate the MMU handling, though that will need to wait a few days while I catch up on my day job tasks which have been woefully neglected while I work on fun stuff. |
(added a selection of Xtensa stakeholders, please pull in anyone I've forgotten) Also it's worth pointing out that at least a little thought has been given to a re-unification path, since I don't think people will want to give up on windows for platforms where they provide value, and no one wants to maintain two completely separate context layers in the long term. Most of the code isn't involved in the "register-twirl on interrupt" trick at all. We can partition things out into paths like "pre-call context save" (which can be e.g. save-all-64, or twirl, etc...), "do the call" (which can be a simple ABI call or the cross stack trick), "post-call-switch", "post-call-no-switch", etc... The state needed to discriminate that is really just the RING bits of PS, so the overhead isn't awful even if getting it to generate optimally might be a pain. Definitely not for the first version but maybe at some point. |
arch/xtensa/core/xtensa-win0.S
Outdated
*/ | ||
movi a1, ~(PS_RING_MASK | PS_UM_MASK | PS_INTLEVEL_MASK) | ||
and a0, a0, a1 | ||
addi a0, a0, 1 /* INTLEVEL = 1 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm realizing while re-reading this post-submit that this is actually wrong. I blindly copied the need to adjust intlevel from the L1 handler, but in fact syscalls (after setup) are thread contexts and should be running at an interrupt level of zero, obviously. My mock rig didn't catch that, but obviously the test suite will start flagging errors if you're not able to receive L1 interrupts from syscalls.
Should work to just remove that one line and adjust all the comments talking about this.
FYI, I did a quick tinkering with sdk-ng trying to build with multilib. However, the usually way of specifying |
You got farther than I did. I discovered those MULTILIB_ argument hooks via google, saw that they weren't touched by the xtensa stuff, and gave up hoping quite frankly that you'd know how to fix it. :) But we also have Max's attention here who's (I assume) the maintainer, so maybe there's an even better resource. @jcmvbkbc ? |
Not at the moment, unfortunately. In linux it was solved by getting rid of the libgcc dependency and adding all required helper sources to the kernel. |
Oh hey, well look at that, there they are. That's really clever. Unfortunately we're likely to run into licensing issues with importing GPL code in Zephyr (but I'm really not the expert on this stuff); any idea if the source is available in Apache-compatible form anywhere? But the code in the Linux tree is a TON simpler (and smaller -- looks like most of libgcc is never used by xtensa compiler output?) than the original libgcc, so if nothing else it would be a good guide for us building our own artifacts for the SDK without building a whole compiler to throw away. |
I only know of GPL-licensed version of this code. OTOH LLVM should have something equivalent, perhaps under different license. |
This comment was marked as outdated.
This comment was marked as outdated.
New ABI, much simpler, slightly bigger/slower code, but MUCH cleaner syscall entry. Signed-off-by: Andy Ross <[email protected]>
This register alias was originally introduced to allow A0 to be used as a scratch register when handling exceptions from MOVSP instructions. (It replaced some upstream code from Cadence that hard-coded EXCSAVE1). Now the MMU code is now using too, and for exactly the same purpose. Calling it "ALLOCA" is only confusing. Rename it to make it clear what it's doing. Signed-off-by: Andy Ross <[email protected]>
This file doesn't need the asm2 header, and the preprocessor logic around whether to include the backtrace header is needless (all it does is declare functions). Signed-off-by: Andy Ross <[email protected]>
New context layer for building Zephyr with the call0 ABI. The system call overhead of having to spill 64 GPRs, and the general complexity of the existing asm2 layer, is prohibitive when trying to build a userspace kernel entry path for Xtensa. Note: it still uses the windowed registers, but as a quick-switch register set for optimized interrupt entry. This is not a mechanism for building Zephyr on hardware that actually lacks windows (though with some work it could be). This is much (much) leaner in terms of entry/exit overhead, is significantly smaller, and should scale better with changes needed for userspace. There is a cost in terms of function call code size and execution speed, but in practice it seems manageable (I'm measuring ~4% with a no-frame-pointer build, Max Filippov reports more like ~13% in Linux builds -- both numbers are well under the kind of overhead we see enabling userspace at all). Signed-off-by: Andy Ross <[email protected]>
The backtrace mechanism is specific to the windowed call ABI currently, so doesn't work with call0. This was hidden previously because backtraces were disabled by default at the kconfig level for the old ("sample_controller") qemu_xtensa, but are on now with dc233c for essentially arbitrary reasons. There was some protection against this already, extend it to the case where "we have windows, but aren't using them". Signed-off-by: Andy Ross <[email protected]>
There was a system call handler added as part of the call0 ABI integration. Wire it up to the standard Zephyr userspace system call API. Signed-off-by: Andy Ross <[email protected]>
The linkage for the "after vector" RAM region was specifying files explicitly, which breaks when other files (or new files) want to add code to these regions. In particular call0 needs to use iram to store local jump targets for the vector entry. There's really no good spec on this. Conventionally xtensa platforms have always put the .iram and iram0 sections (and their derived sections like literals) immediately after the vector table, but I doubt there's a promise to that extent anywhere. Just try to adhere to what convention there is. (I note there's also a ".iram1" section referenced here that I'm not moving because I don't know what's in it) Signed-off-by: Andy Ross <[email protected]>
The job of this file is almost exclusively page table management. MMU hardware interaction is in there but a small part, and they don't interact except in a few small APIs. Rename it in advance of doing work to the hardware side. Signed-off-by: Andy Ross <[email protected]>
This is a reworked MMU layer, sitting cleanly below the page table handling in the OS, intended to integrate with call0 ABI handling and to be more performant and efficient. Notable differences from the original work: + Significantly smaller code and simpler API (just three functions to be called from the OS/userspace/ptable layer). + Big README-MMU document containing my learnings over the process, so hopefully fewer people need to go through this in the future. + No TLB flushing needed. Clean separation of ASIDs, just requires that the upper levels match the ASID to the L1 page table page consistently. + Validation layer to check page table caching attributes. Caching the PTEs is poison to SMP, as it requires a data cache flush on every CPU for every change of a page table anywhere. Even on single-core systems, having one mapping be write-through and the other writeback (which is what we had!) can break things. Hopefully this catches all the mistakes. + Vector mapping is done with a 4k page and not a 4M page, leading to much more flexibility with hardware memory layout. The original scheme required that the 4M region containing vecbase be mapped virtually to a location other than the hardware address, which makes confusing linkage with call0 and difficult initialization constraints where the exception vectors run at different addresses before and after MMU setup (effectively forcing them to be PIC code). + More provably correct initialization, all MMU changes happen in a single asm block with no memory accesses which would generate a refill. + Requires CONFIG_XTENSA_CALL0_ABI. This isn't strictly an issue of behavior: the MMU code doesn't care about reguster use per se. But with the register-windowed context management code in the older asm2, it's impossible to provide security across memory domains. Just rip the bandaid off. Signed-off-by: Andy Ross <[email protected]>
We were trying to initialize the MMU extremely early (out of crt1.S), which was forcing some odd gymnastics with splitting up the initialization paths. Just don't do that. The MMU boots with a pefectly good mapping for physical memory. Use that until we're set up and about to launch the post-arch initialization in z_cstart(). Just in general, trying to run compiled C code before z_cstart() calls kernel_arch_init() is asking for trouble. We do it only in extreme situations, and we don't need it here. Signed-off-by: Andy Ross <[email protected]>
Another spin, now with full MMU support for dc233c. This is intended to go under a rework of the userspace PR (which I
|
* | ||
* + System calls have their own optimized handler | ||
* + MMU TLB exceptions are handled directly, as they must not | ||
* re-enable interrupts at any point before returning to the trapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you mean here ? syscall handler is expected to run with interruptions enabled
https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/kernel/mem_protect/userspace/src/main.c#L934
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the typography is confusing? Two items in a bulleted list:
- System calls have their own optimizer handler
- TLB exceptions don't unmask EXCM
Syscalls absolutely do unmask interrupts, obviously.
s32i a0, a10, ___thread_t_arch_OFFSET + __xtensa_win0_ctx_t_user_ps_OFFSET | ||
|
||
/* Further cook PS to set RING=0 and UM=0, then set our sp and unmask */ | ||
movi a1, ~(PS_RING_MASK | PS_UM_MASK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you clearing up UM
? As far as I could see kernel threads and user threads are using user vector mode, the exception is only here ...
In other hand, you have to unmask EXCM, since syscall handlers expects it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just performance, really. The user vector has more work to do than the kernel vector: it needs to check for syscall entry. Obviously system call handlers won't be making syscalls of their own, so there's no point.
And the code already removed EXCM from this the PS value in a0 a few lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and actually in this version of the PR, all threads are initialized with a PS of zero, so all threads are kernel mode by default. Shimming the userspace stuff on top as we speak, at which point we'll be consistent between RING and UM (i.e. you either have RING=3/UM=1 or RING=0/UM=0, and no other states)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, just noticed that thread ctx
ps
is not set :P
l32i a1, a1, 0 | ||
#endif | ||
wsr a0, PS | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming that you have to re-enable interruptions / exceptions before jump into the handler
you can be switched after setting PS
and syscall parameters are not saved in this context stack ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interrupted userspace a0/a1/pc/ps values (the ones we need to preserve) are already stashed in the thread struct at this point. All a context switch arriving now has to do is save our kernel space GPRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that happens only with interrutptions ritght ? inside must_switch
I remember having seen case where a syscall triggers a context switch without an interruption happens and AFAIU xtensa_switch
does not save them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do a cooperative context switchin inside a syscall too (consider e.g. k_sleep()). But again system calls aren't interrupts to the userspace code, we can rely on userspace having preserved its own state too. In this particular case, the system call is just a regular ABI call: we're responsible for preserving the return value in a0 and the stack pointer, and a12-a15 (which are already preserved in kernel space by the ABI call we make to the handler). And we restore the PC and PS values upon returning too. So we only need four registers of state.
(Actually writing this I realize that it's 3: the return PC is just the address of the RET instruction in the stub that's going to jump to a0, we don't really need to store both and could just restore to the value in a0. Will add an optimization note)
once I built the toolchain, I had to add
it seems to be happening in Removing some printks the code runs a little bit further but it stucks with:
It seems I am missing something to reproduce the sample, just after that I noticed that the build it generates a lot of warnings:
What is missing in the toolchain ? |
Yeah, that's what you get if you have a windowed toolchain (strictly the libgcc.a in the toolchain) building against a call0 Zephyr, or vice versa. It will crash any time it hits the ENTRY instruction at the start of a libgcc instruction (because PS.WOE=0). And indeed printk() is one of the big offenders, as it uses 64 bit math by default. If you look way upthread, there are some kconfigs I found that suppress enough 64 bit-isms to make the test run, and that probably works still. Another option is to use a cadence toolchain, as those are multilib and don't have this problem. Finally remember @dcpleung had patches to the SDK working which build a multilib gcc; I never got to trying them myself but for sure that's a better solution than the hand-built one in my instructions above. Maybe he could just hand you a copy? |
Yep, I know it. The point is that I follow the whole instructions to get sdk-ng and crosstool and to build them. I thought with that we had a proper libgcc ...
One of the things you pointed in the other pr is not having a way to test in CI, that is where we are here ... |
Can you check that the relevant CFLAGS kconfig is in the ct-ng .config file? It can get dropped if you reconfigure it via menuconfig or whatever, IIRC. You can also check the disassembly of libgcc.a, if you see any CALLXn/ENTRY/RETW instructions, it's a windowed library.
I know. :) This is still a DNM PR for sure, pending a toolchain release. I swear I'm not trying to monopolize this or turn it into a fight. We need one codebase that meets all requirements (though to be fair this one has/will-have some genuine advantages in performance/size). Clearly neither is there yet, the intent is to turn this one into a merged PR with the userspace support from the other. But I'm very happy to go in the other direction too. |
As for toolchain, grab zephyrproject-rtos/sdk-ng#701 and make sure do a submodule update to grab the GCC patch. |
regs->ptevaddr = CONFIG_XTENSA_MMU_PTE_BASE + user_asid * 0x400000; | ||
|
||
/* The ptables code doesn't add the mapping for the l1 page itself */ | ||
l1_page[l1_idx(regs->ptevaddr)] = (uint32_t) l1_page; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manipulating the page table here does not look nice if the goal is to split the mmu / page table logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cache bits may not match as well ...
__ASSERT_NO_MSG(user_asid != 1 && user_asid < CONFIG_XTENSA_MMU_ASID_MAX); | ||
|
||
/* We don't use ring 1/2, ring 0 ASID must be 1 */ | ||
regs->rasid = (user_asid << 24) | 0x000001; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know each ring in RASID must be different, we can't leave rings 1/2 with the same value
This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time. |
New context layer for building Zephyr with the call0 ABI. The system call overhead of having to spill 64 GPRs, and the general complexity of the existing asm2 layer, is prohibitive when trying to build a userspace kernel entry path for Xtensa.
Note: it still uses the windowed registers, but as a quick-switch register set for optimized interrupt entry. This is not a mechanism for building Zephyr on hardware that actually lacks windows (though with some work it could be).
This is much (much) leaner in terms of entry/exit overhead, is significantly smaller, and should scale better with changes needed for userspace. There is a cost in terms of function call code size and execution speed, but in practice it seems manageable (I'm measuring ~4% with a no-frame-pointer build, Max Filippov reports more like ~13% in Linux builds -- both numbers are well under the kind of overhead we see enabling userspace at all).