NetBSD's Linux system call (syscall) emulation provides near seamless ability to run Linux binaries, but traditionally it has been hard to answer the question "will it work with program X?" This Google Summer of Code Project aims to put a dent in that issue by taking a more systematic approach to syscall implementation by using real-world programs to gauge which syscalls are worth implementing, and not use them to decide when a syscall is done. Additionally, a comprehensive test suite (the Linux Test Project) was ported and support was added to test emulation using NetBSD's test suite (ATF(7)).
A full diff of the changes to the main source tree can be found here, and a full diff of the main pkgsrc tree an be found here.
The following table summarizes the status of the deliverables from the original proposal as of 25 August 2023.
Deliverable | Status |
---|---|
Implement getrandom(2) | Merged |
Implement waitid(2) | Merged |
Implement epoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll_pwait2(2) | Merged (also) and then partially removed |
Implement memfd_create(2) | Merged (also, also) |
Implement inotify_init(2), inotify_init1(2), inotify_add_watch(2), inotify_rm_watch(2) | Merged (also, also, also) |
Implement readahead(2) | Merged |
Implement newfstatat(2) | Merged |
Implement statx(2) | Merged |
Implement close_range(2) | Merged |
Implement ioprio_set(2) | Not feasible |
Package the Linux Test Project | Done, not yet merged |
Document system call versioning (extra) | Merged (rendered) |
Add support Linux emulation testing in ATF(7) (extra) | Merged |
As expected, many of the implementation plans from the original proposal turned out to be flawed. This section outlines how the syscalls were actually implemented, and some of the limitations of the implementations.
memfd_create(2) was implemented directly in terms of uvm(9) operations, in particular the backing is provided by a uvm_object created by uao_create(9). Since it was convenient, we also decided to make it a native NetBSD syscall. As was pointed out, the memfd_create(2) does not currently have any limits that can be imposed from the outside.
The epoll_*(2) syscalls were implemented by directly porting FreeBSD's Linux compatibility version. It is implemented as argument translation over kqueue(2)'s EVFILT_READ and EVFILT_WRITE, and so it necessitated versioning kqueue(2) to more closely match FreeBSD (hence why I also wrote a man page for syscall versioning). Unfortunately this design suffers from the limitation that an epoll file descriptor under Linux emulation will not survive a fork(2). After some initial discussion I decided to also add native NetBSD stubs to allow for better testing, but this proved to be controversial. Although despite this limitation, the epoll implementation is sufficient to allow a large swath of programs (ie. Go programs) to run.
The inotify_*(2) syscalls were also implemented in terms of kqueue(2). The main challenge with inotify is that it preserves the exact ordering events, which kqueue(2) does not. To accomplish this the implementation hooks into the event callbacks of kqueue(2), but uses its own queue. Since kqueue(2) attaches to file descriptors, which are a scarce resource, there are some events which this implementation will not generate (reading from files inside a watched directory). Additionally moves cannot always be correlated, so in some cases a rename may be reported as a delete and create, which is fine for its purpose as a compatibility shim. Finally as a bit of a hack, some operations that could have gone through kevent1() to be done by hand because filterops::f_touch could not be used due to the locking situation in the kqueue(2) subsystem (see kqueue_register).
getrandom(2), waitid(2), readahead(2), and close_range(2) have direct analogues in NetBSD, and so the implementation consists of translating arguments and calling the respective NetBSD functions.
statx(2) and newfstatat(2) already existed, statx(2) had a bug and newfstatat(2) already existed, but under the name fstatat64(2) (the name changes based on the Linux architecture, but the functionality is otherwise the same). Besides fixing the bug, all that was necessary was to add the correct stubs to the relevant syscalls.master file.
NetBSD does not currently have an I/O scheduler and so ioprio_set(2) could not be feasibly implemented given the amount of time available (adding an I/O scheduler is an entire project).
Nebula version 1.6.1 generally works, however the fact that TUN devices function differently on Linux limits its usefulness to just acting as a lighthouse and/or a relay.
Syncthing version 1.23.7 works, it can reliably sync files. It does, however, emit a single warning on startup because of the non-existence of ioprio_set(2).