Rework hypervisor concept #47

sevenautumns · 2023-02-16T10:01:16Z

As it stands, it is not clear whether or not the current concept is arinc 653 compliant.
Also the current concept may not be extendable for arinc 653 part 1.

Issues:

Are the responsibilities distributed correctly (hypervisor/partition responsibilities)
The current error system is not easy to use and should be compatible with the Arinc 653 Healthmonitor
Currently, calls from the partition to the hypervisor do not allow for answers from the hypervisor (No Syscall/RPC behaviour)

Possible Solution:

Error System / Health Monitor

Remove the level from errors (errors only have a type)
Use a state machine for all possible states of a partition (as far as the hypervisor is concerned)
- Init
- Running
  - Cold/Warm start
  - Normal
  - Idle
- Transition to
- Paused (when not within its scheduling window)
- Restart
- Error
The Error state handles the error according to the health monitor table

"Systemcall" from partition to hypervisor

Use ptrace(2) together with PTRACE_SYSEMU (which is already used for realizing user-mode Linux) for trapping partitions processes on systemcalls, replacing the call with the desired behaviour within the hypervisor.
Theoretically, non-existent systemcall ids could be used for identifying APEX functions when using ptrace(2).
When clone(3) is used for spawning the main process of a partition, PTRACE_TRACEME can be called for allowing ptrace.
The hypervisor can wait on the partitions 'SIGTRAP' with sigtimedwait from sigwaitinfo(2) utilizing a timeout.

Hypervisor Main Loop

Maintain "EventList"
- Candidate: SkipMap
- Events:
  - Start Partition
  - Stop Partition
  - Start Process
  - Process Deadline (Revisit ARINC deadline actions)
  - Health event (For having a central handling authority)
Wait for SIGCHLD
- Utilize signalfd(2)
- Wait with Poller on SIGCHLD or timeout elapse (timeout from remaining time until next event in "EventList")
On either SIGCHLD or timeout elapse
- Check if new more recent event is due (for example new "Start Process" or "Health event")
- Give every active partition a chance to check their processes for a SIGTRAP
  - Spawn handler thread for serving the catched syscall
    - Use rayon::ThreadPool
    - TODO somehow remember which processes with SIGTRAP are already served

TODO

Check if we can actually use non-existent systemcall ids
Check if we can return custom data when emulating APEX systemcalls

The text was updated successfully, but these errors were encountered:

cvengler · 2023-03-28T11:22:54Z

I'm curious about how we should deal with fork(2)'s in traced processes... Ignore them? I mean, the reason why we continue to stick to cgroups is, -- if I understood it correctly -- to be able to handle multiple processes in one partition.

sevenautumns · 2023-03-28T14:16:29Z

I'm curious about how we should deal with fork(2)'s in traced processes... Ignore them? I mean, the reason why we continue to stick to cgroups is, -- if I understood it correctly -- to be able to handle multiple processes in one partition.

We can not really spawn new processes for a partition. This is why we should allow a partition to fork, with us intercepting the fork. Should the partition fork when it is not allowed to, we can do an action according to the health Monitor table. Through the interception, we know the process id and can throw it in its own cgroup

cvengler · 2023-03-28T14:33:53Z

The more I think about the possibility of using ptrace, the uglier it gets. Although I enjoyed ptrace at first, it seems like a hack to me now, especially after I digged myself a bit deeper into the material.

Why ptrace is terrible

ptrace is slow! Not just a little bit slower, but significantly. I have written a very small C program that just prints all numbers from 0 to $10^7 - 1$. Running this natively (while redirecting all I/O to /dev/null) takes $5.473s$, while running the same program under strace(1), with the same I/O redirection, takes $135.371s$. This is more than 24x slower or in other words: a performance decrease of 2442%! Totally unacceptable, especially in a context in which time frames matter.
ptrace is unportable! The entire ptrace-API is centered around architecture specific behavior. For example, we have to read from and poke into the native CPU registers, in order to intercept with system calls. While the new PTRACE_GET_SYSCALL_INFO solves the portability issue for fetching the syscalls, it does not solve the issue with poking inside the registers. You may argue that this argument is not really important, because x86_64 is the prevalent architecture on modern day desktop hardware, but I still consider architecture specific development to be very ugly. IMO, the only valid reasons for this kind of architecture-specific code is when working at an extremely low-level or to get out most of the performance, both which does not apply in our case.
ptrace is centered around single processes, not partitions! How should we continue if a process inside a partition decides to fork(2)? Does the PID namespace causes any potential problems here?
ptrace probably adds much more code complexity, especially when we could re-use other parts of the code (more on this later).

What are the alternatives

My idea would be, that the parent inherits a socketpair (or pipe), through which the child will send syscalls in a fixed data structure. The requests would be executed in sequential order, with the parent sending a fixed response data structure. If possible, we might use stdin and stdout for this, as they have fixed file descriptors. Alternatively, we could inherit a socketpair with a welcome message in it's buffer. The child process would then (at its startup) try to read that welcome message from all file descriptors found inside /proc/$$/fd, in order to determine the appropriate fd early on. Maybe, we could re-use some things from @dadada's PR.

Some questions:

Are UDP datagrams guranteed to be delivered and in order, when operating in the AF_UNIX domain?

Some resources:

OpenBSD's very POSIXish way for IPC

sevenautumns · 2023-03-28T15:24:14Z

@emilengler could you do a simple performance analysis of your pipe idea, as well?

cvengler · 2023-03-29T09:52:57Z

@emilengler could you do a simple performance analysis of your pipe idea, as well?

Sure, but I will probably only be able to do so the week after next weel, if that's okay. I want to redeem my overhours in order to study for my exams.

cvengler · 2023-04-11T16:26:42Z

Okay my benchmarks are effectively done. I"ll do some adjustments tomorrow and publish the code afterwards.

Emitting $10^7$ syscalls takes 2:55 minutes through ptrace and 1:07 minutes with my approach. In both cases, the syscall result is printed to stdout. Removing the final stdout reduces the ptrace approach to 1:30 minutes, which is only slightly slower than my approach. However, ptrace scales in a linear fashion with every invoked syscall, whereas my approach only scales linear when a custom syscall is invoked, not with a regular Linux syscall. Because of that, I opt for my approach. However, I will try to do some adjustments tomorrow and give you the code for reproduction.

cvengler · 2023-04-12T12:49:17Z

Done. I have published the code in this semi-public repository.

The benchmark results are as follows:

Name	Time
ptrace	4m47.608s
sockets	2m7.185s

The sockets approach truly wins.

cvengler · 2023-04-21T11:48:29Z

Update:

The current solution will probably be centered around ptrace(2) with the hypervisor being the monitoring process. A combination of signalfd(2) and waitpid(2) will be used to get event notifications whenever a child process changes state. Here is a small example.

sevenautumns added ARINC 653 Part 1 ARINC 653 Part 4 labels Feb 16, 2023

sevenautumns added this to the ARINC 653 P4 Compliance milestone Feb 16, 2023

sevenautumns self-assigned this Feb 16, 2023

sevenautumns mentioned this issue Mar 21, 2023

Inform user that partitions are not to be executed alone #21

Open

cvengler mentioned this issue Mar 27, 2023

remove leveled errors #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework hypervisor concept #47

Rework hypervisor concept #47

sevenautumns commented Feb 16, 2023 •

edited

Loading

cvengler commented Mar 28, 2023

sevenautumns commented Mar 28, 2023

cvengler commented Mar 28, 2023

sevenautumns commented Mar 28, 2023

cvengler commented Mar 29, 2023

cvengler commented Apr 11, 2023

cvengler commented Apr 12, 2023

cvengler commented Apr 21, 2023

Rework hypervisor concept #47

Rework hypervisor concept #47

Comments

sevenautumns commented Feb 16, 2023 • edited Loading

Issues:

Possible Solution:

Error System / Health Monitor

"Systemcall" from partition to hypervisor

Hypervisor Main Loop

TODO

cvengler commented Mar 28, 2023

sevenautumns commented Mar 28, 2023

cvengler commented Mar 28, 2023

Why ptrace is terrible

What are the alternatives

sevenautumns commented Mar 28, 2023

cvengler commented Mar 29, 2023

cvengler commented Apr 11, 2023

cvengler commented Apr 12, 2023

cvengler commented Apr 21, 2023

sevenautumns commented Feb 16, 2023 •

edited

Loading