Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework hypervisor concept #47

Open
sevenautumns opened this issue Feb 16, 2023 · 8 comments
Open

Rework hypervisor concept #47

sevenautumns opened this issue Feb 16, 2023 · 8 comments

Comments

@sevenautumns
Copy link
Collaborator

sevenautumns commented Feb 16, 2023

As it stands, it is not clear whether or not the current concept is arinc 653 compliant.
Also the current concept may not be extendable for arinc 653 part 1.

Issues:

  • Are the responsibilities distributed correctly (hypervisor/partition responsibilities)
  • The current error system is not easy to use and should be compatible with the Arinc 653 Healthmonitor
  • Currently, calls from the partition to the hypervisor do not allow for answers from the hypervisor (No Syscall/RPC behaviour)

Possible Solution:

Error System / Health Monitor

  • Remove the level from errors (errors only have a type)
  • Use a state machine for all possible states of a partition (as far as the hypervisor is concerned)
    • Init
    • Running
      • Cold/Warm start
      • Normal
      • Idle
    • Transition to
    • Paused (when not within its scheduling window)
    • Restart
    • Error
  • The Error state handles the error according to the health monitor table

"Systemcall" from partition to hypervisor

Use ptrace(2) together with PTRACE_SYSEMU (which is already used for realizing user-mode Linux) for trapping partitions processes on systemcalls, replacing the call with the desired behaviour within the hypervisor.
Theoretically, non-existent systemcall ids could be used for identifying APEX functions when using ptrace(2).
When clone(3) is used for spawning the main process of a partition, PTRACE_TRACEME can be called for allowing ptrace.
The hypervisor can wait on the partitions 'SIGTRAP' with sigtimedwait from sigwaitinfo(2) utilizing a timeout.

Hypervisor Main Loop

  • Maintain "EventList"
    • Candidate: SkipMap
    • Events:
      • Start Partition
      • Stop Partition
      • Start Process
      • Process Deadline (Revisit ARINC deadline actions)
      • Health event (For having a central handling authority)
  • Wait for SIGCHLD
    • Utilize signalfd(2)
    • Wait with Poller on SIGCHLD or timeout elapse (timeout from remaining time until next event in "EventList")
  • On either SIGCHLD or timeout elapse
    • Check if new more recent event is due (for example new "Start Process" or "Health event")
    • Give every active partition a chance to check their processes for a SIGTRAP
      • Spawn handler thread for serving the catched syscall
        • Use rayon::ThreadPool
        • TODO somehow remember which processes with SIGTRAP are already served

TODO

  • Check if we can actually use non-existent systemcall ids
  • Check if we can return custom data when emulating APEX systemcalls
@cvengler
Copy link
Member

I'm curious about how we should deal with fork(2)'s in traced processes... Ignore them? I mean, the reason why we continue to stick to cgroups is, -- if I understood it correctly -- to be able to handle multiple processes in one partition.

@sevenautumns
Copy link
Collaborator Author

I'm curious about how we should deal with fork(2)'s in traced processes... Ignore them? I mean, the reason why we continue to stick to cgroups is, -- if I understood it correctly -- to be able to handle multiple processes in one partition.

We can not really spawn new processes for a partition. This is why we should allow a partition to fork, with us intercepting the fork. Should the partition fork when it is not allowed to, we can do an action according to the health Monitor table. Through the interception, we know the process id and can throw it in its own cgroup

@cvengler
Copy link
Member

The more I think about the possibility of using ptrace, the uglier it gets. Although I enjoyed ptrace at first, it seems like a hack to me now, especially after I digged myself a bit deeper into the material.

Why ptrace is terrible

  • ptrace is slow! Not just a little bit slower, but significantly. I have written a very small C program that just prints all numbers from 0 to $10^7 - 1$. Running this natively (while redirecting all I/O to /dev/null) takes $5.473s$, while running the same program under strace(1), with the same I/O redirection, takes $135.371s$. This is more than 24x slower or in other words: a performance decrease of 2442%! Totally unacceptable, especially in a context in which time frames matter.
  • ptrace is unportable! The entire ptrace-API is centered around architecture specific behavior. For example, we have to read from and poke into the native CPU registers, in order to intercept with system calls. While the new PTRACE_GET_SYSCALL_INFO solves the portability issue for fetching the syscalls, it does not solve the issue with poking inside the registers. You may argue that this argument is not really important, because x86_64 is the prevalent architecture on modern day desktop hardware, but I still consider architecture specific development to be very ugly. IMO, the only valid reasons for this kind of architecture-specific code is when working at an extremely low-level or to get out most of the performance, both which does not apply in our case.
  • ptrace is centered around single processes, not partitions! How should we continue if a process inside a partition decides to fork(2)? Does the PID namespace causes any potential problems here?
  • ptrace probably adds much more code complexity, especially when we could re-use other parts of the code (more on this later).

What are the alternatives

My idea would be, that the parent inherits a socketpair (or pipe), through which the child will send syscalls in a fixed data structure. The requests would be executed in sequential order, with the parent sending a fixed response data structure. If possible, we might use stdin and stdout for this, as they have fixed file descriptors. Alternatively, we could inherit a socketpair with a welcome message in it's buffer. The child process would then (at its startup) try to read that welcome message from all file descriptors found inside /proc/$$/fd, in order to determine the appropriate fd early on. Maybe, we could re-use some things from @dadada's PR.

Some questions:

  • Are UDP datagrams guranteed to be delivered and in order, when operating in the AF_UNIX domain?

Some resources:

@sevenautumns
Copy link
Collaborator Author

@emilengler could you do a simple performance analysis of your pipe idea, as well?

@cvengler
Copy link
Member

@emilengler could you do a simple performance analysis of your pipe idea, as well?

Sure, but I will probably only be able to do so the week after next weel, if that's okay. I want to redeem my overhours in order to study for my exams.

@cvengler
Copy link
Member

Okay my benchmarks are effectively done. I"ll do some adjustments tomorrow and publish the code afterwards.

Emitting $10^7$ syscalls takes 2:55 minutes through ptrace and 1:07 minutes with my approach. In both cases, the syscall result is printed to stdout. Removing the final stdout reduces the ptrace approach to 1:30 minutes, which is only slightly slower than my approach. However, ptrace scales in a linear fashion with every invoked syscall, whereas my approach only scales linear when a custom syscall is invoked, not with a regular Linux syscall. Because of that, I opt for my approach. However, I will try to do some adjustments tomorrow and give you the code for reproduction.

@cvengler
Copy link
Member

Done. I have published the code in this semi-public repository.

The benchmark results are as follows:

Name Time
ptrace 4m47.608s
sockets 2m7.185s

The sockets approach truly wins.

@cvengler
Copy link
Member

Update:

The current solution will probably be centered around ptrace(2) with the hypervisor being the monitoring process. A combination of signalfd(2) and waitpid(2) will be used to get event notifications whenever a child process changes state. Here is a small example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants