Skip to content

Commit

Permalink
Automatic merge of 'master' into merge-test (2024-06-19 21:38)
Browse files Browse the repository at this point in the history
  • Loading branch information
mpe committed Jun 19, 2024
2 parents e2b06d7 + 92e5605 commit 03cb18e
Show file tree
Hide file tree
Showing 61 changed files with 675 additions and 499 deletions.
6 changes: 0 additions & 6 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2192,12 +2192,6 @@
Format: 0 | 1
Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.

init_mlocked_on_free= [MM] Fill freed userspace memory with zeroes if
it was mlock'ed and not explicitly munlock'ed
afterwards.
Format: 0 | 1
Default set by CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON

init_pkru= [X86] Specify the default memory protection keys rights
register contents for all processes. 0x55555554 by
default (disallow access to all but pkey 0). Can
Expand Down
1 change: 1 addition & 0 deletions Documentation/userspace-api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Security-related interfaces
seccomp_filter
landlock
lsm
mfd_noexec
spec_ctrl
tee

Expand Down
86 changes: 86 additions & 0 deletions Documentation/userspace-api/mfd_noexec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
.. SPDX-License-Identifier: GPL-2.0
==================================
Introduction of non-executable mfd
==================================
:Author:
Daniel Verkamp <[email protected]>
Jeff Xu <[email protected]>

:Contributor:
Aleksa Sarai <[email protected]>

Since Linux introduced the memfd feature, memfds have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting
it differently.

However, in a secure-by-default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by verified
boot), this executable nature of memfd opens a door for NoExec bypass
and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code
and root escalation. [2] lists more VRP of this kind.

On the other hand, executable memfd has its legit use: runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them. For such a system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].

To address those above:
- Let memfd_create() set X bit at creation time.
- Let memfd be sealed for modifying X bit when NX is set.
- Add a new pid namespace sysctl: vm.memfd_noexec to help applications in
migrating and enforcing non-executable MFD.

User API
========
``int memfd_create(const char *name, unsigned int flags)``

``MFD_NOEXEC_SEAL``
When MFD_NOEXEC_SEAL bit is set in the ``flags``, memfd is created
with NX. F_SEAL_EXEC is set and the memfd can't be modified to
add X later. MFD_ALLOW_SEALING is also implied.
This is the most common case for the application to use memfd.

``MFD_EXEC``
When MFD_EXEC bit is set in the ``flags``, memfd is created with X.

Note:
``MFD_NOEXEC_SEAL`` implies ``MFD_ALLOW_SEALING``. In case that
an app doesn't want sealing, it can add F_SEAL_SEAL after creation.


Sysctl:
========
``pid namespaced sysctl vm.memfd_noexec``

The new pid namespaced sysctl vm.memfd_noexec has 3 values:

- 0: MEMFD_NOEXEC_SCOPE_EXEC
memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_EXEC was set.

- 1: MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_NOEXEC_SEAL was set.

- 2: MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old software that
doesn't set the executable bit; for example, a container with
vm.memfd_noexec=1 means the old software will create non-executable memfd
by default while new software can create executable memfd by setting
MFD_EXEC.

The value of vm.memfd_noexec is passed to child namespace at creation
time. In addition, the setting is hierarchical, i.e. during memfd_create,
we will search from current ns to root ns and use the most restrictive
setting.

[1] https://crbug.com/1305267

[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20memfd%20escalation&can=1

[3] https://lwn.net/Articles/781013/
21 changes: 15 additions & 6 deletions Documentation/virt/hyperv/clocks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,21 @@ shared page with scale and offset values into user space. User
space code performs the same algorithm of reading the TSC and
applying the scale and offset to get the constant 10 MHz clock.

Linux clockevents are based on Hyper-V synthetic timer 0. While
Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
timer 0. Interrupts from stimer0 are recorded on the "HVS" line in
/proc/interrupts. Clockevents based on the virtualized PIT and
local APIC timer also work, but the Hyper-V synthetic timer is
preferred.
Linux clockevents are based on Hyper-V synthetic timer 0 (stimer0).
While Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
timer 0. In older versions of Hyper-V, an interrupt from stimer0
results in a VMBus control message that is demultiplexed by
vmbus_isr() as described in the Documentation/virt/hyperv/vmbus.rst
documentation. In newer versions of Hyper-V, stimer0 interrupts can
be mapped to an architectural interrupt, which is referred to as
"Direct Mode". Linux prefers to use Direct Mode when available. Since
x86/x64 doesn't support per-CPU interrupts, Direct Mode statically
allocates an x86 interrupt vector (HYPERV_STIMER0_VECTOR) across all CPUs
and explicitly codes it to call the stimer0 interrupt handler. Hence
interrupts from stimer0 are recorded on the "HVS" line in /proc/interrupts
rather than being associated with a Linux IRQ. Clockevents based on the
virtualized PIT and local APIC timer also work, but Hyper-V stimer0
is preferred.

The driver for the Hyper-V synthetic system clock and timers is
drivers/clocksource/hyperv_timer.c.
22 changes: 11 additions & 11 deletions Documentation/virt/hyperv/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Linux guests communicate with Hyper-V in four different ways:
arm64, these synthetic registers must be accessed using explicit
hypercalls.

* VMbus: VMbus is a higher-level software construct that is built on
* VMBus: VMBus is a higher-level software construct that is built on
the other 3 mechanisms. It is a message passing interface between
the Hyper-V host and the Linux guest. It uses memory that is shared
between Hyper-V and the guest, along with various signaling
Expand All @@ -54,8 +54,8 @@ x86/x64 architecture only.

.. _Hyper-V Top Level Functional Spec (TLFS): https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs

VMbus is not documented. This documentation provides a high-level
overview of VMbus and how it works, but the details can be discerned
VMBus is not documented. This documentation provides a high-level
overview of VMBus and how it works, but the details can be discerned
only from the code.

Sharing Memory
Expand All @@ -74,7 +74,7 @@ follows:
physical address space. How Hyper-V is told about the GPA or list
of GPAs varies. In some cases, a single GPA is written to a
synthetic register. In other cases, a GPA or list of GPAs is sent
in a VMbus message.
in a VMBus message.

* Hyper-V translates the GPAs into "real" physical memory addresses,
and creates a virtual mapping that it can use to access the memory.
Expand Down Expand Up @@ -133,9 +133,9 @@ only the CPUs actually present in the VM, so Linux does not report
any hot-add CPUs.

A Linux guest CPU may be taken offline using the normal Linux
mechanisms, provided no VMbus channel interrupts are assigned to
the CPU. See the section on VMbus Interrupts for more details
on how VMbus channel interrupts can be re-assigned to permit
mechanisms, provided no VMBus channel interrupts are assigned to
the CPU. See the section on VMBus Interrupts for more details
on how VMBus channel interrupts can be re-assigned to permit
taking a CPU offline.

32-bit and 64-bit
Expand Down Expand Up @@ -169,14 +169,14 @@ and functionality. Hyper-V indicates feature/function availability
via flags in synthetic MSRs that Hyper-V provides to the guest,
and the guest code tests these flags.

VMbus has its own protocol version that is negotiated during the
initial VMbus connection from the guest to Hyper-V. This version
VMBus has its own protocol version that is negotiated during the
initial VMBus connection from the guest to Hyper-V. This version
number is also output to dmesg during boot. This version number
is checked in a few places in the code to determine if specific
functionality is present.

Furthermore, each synthetic device on VMbus also has a protocol
version that is separate from the VMbus protocol version. Device
Furthermore, each synthetic device on VMBus also has a protocol
version that is separate from the VMBus protocol version. Device
drivers for these synthetic devices typically negotiate the device
protocol version, and may test that protocol version to determine
if specific device functionality is present.
Expand Down
Loading

0 comments on commit 03cb18e

Please sign in to comment.