Skip to content

Commit

Permalink
test with 1GiB huge pages
Browse files Browse the repository at this point in the history
  • Loading branch information
evanj committed Jan 13, 2023
1 parent 496bec9 commit e0cf875
Show file tree
Hide file tree
Showing 5 changed files with 257 additions and 28 deletions.
29 changes: 29 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 3 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,12 @@ debug = true
#rustflags = ["-C", "target-cpu=native"]

[dependencies]
argh = "0.1.9"
humanunits = {git="https://github.com/evanj/humanunits"}
lazy_static = "1"
memory-stats = "1"
nix = {version="0", features=["mman"]}
rand = "0"
rand_xoshiro = "0"
lazy_static = "1"
regex = "1"
argh = "0.1.9"
strum = { version = "0", features = ["derive"] }
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ all: aligned_alloc_demo
-D clippy::pedantic \
-A clippy::cast_precision_loss \
-A clippy::cast-sign-loss \
-A clippy::cast-possible-truncation
-A clippy::cast-possible-truncation \
-A clippy::too-many-lines

clang-format -i '-style={BasedOnStyle: Google, ColumnLimit: 100}' *.c

Expand Down
53 changes: 40 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,65 @@
# Huge Page Demo

This is a demonstration of using huge pages on Linux to get better performance. It allocates a 4 GiB chunk using a Vec (which calls libc's malloc), then using mmap to get a 2 MiB-aligned region. It then uses `madvise(..., MADV_HUGEPAGE)` to mark the region for huge pages, then will touch the entire region to fault it in to memory. Finally, it does a random-access benchmark. This is probably the "best case" scenario for huge pages.
This is a demonstration of using huge pages on Linux to get better performance. It allocates a 4 GiB chunk using a Vec (which calls libc's malloc), then using mmap to get a 2 MiB-aligned region. It then uses `madvise(..., MADV_HUGEPAGE)` to mark the region for huge pages, then will touch the entire region to fault it in to memory. Finally, it does a random-access benchmark. This is probably the "best case" scenario for huge pages. I also test allocating an explicit huge page region with `mmap(..., MAP_HUGETLB | MAP_HUGE_1GB)`.

On a "11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz", the huge page version is about 2.9X faster. On an older "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz" (AWS m5d.4xlarge), the huge page version is about 2X faster. This seems to suggest that programs that make random accesses to large amounts of memory will benefit from huge pages.
On a "11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz", the transparent 2MiB huge page version is about 2.9X faster, and the 1GiB huge page version is 3.1X faster (8% more than 2MiB pages). On an older "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz" (AWS m5d.4xlarge), the transparent 2MiB huge page version is about 2X faster, and I did not test the GiB huge pages. This seems to suggest that programs that make random accesses to large amounts of memory will benefit from huge pages. The benefit from the gigabyte huge pages is minimal, so probably not worth the pain of having to manually configure them.

As of 2022-01-10, the Linux kernel only supports a single size of transparent huge pages. The size will be reported as `Hugepagesize` in `/proc/meminfo`. On x86_64, this will be 2 MiB. For Arm (aarch64), most recent Linux distributions also defalut to 4 kiB/2 MiB pages. Redhat used to use 64 kiB pages, but [RHEL 9 changed it to 4 kiB around 2021-07](https://bugzilla.redhat.com/show_bug.cgi?id=1978730).

When running as root, it is possible to check if a specific address is a huge page. It is also possible to get the amount of memory allocated for a specific range as huge pages, by examining the `AnonHugePages` line in `/proc/self/smaps`. The `thp_` statistics in `/proc/vmstat` also can tell you if this worked by checking `thp_fault_alloc` and `thp_fault_fallback` before and after the allocation. See [the Monitoring usage section in the kernel's transhuge.txt for details](https://www.kernel.org/doc/Documentation/vm/transhuge.txt).
When running as root, it is possible to check if a specific address is a huge page. It is also possible to get the amount of memory allocated for a specific range as huge pages, by examining the `AnonHugePages` line in `/proc/self/smaps`. The `thp_` statistics in `/proc/vmstat` also can tell you if this worked by checking `thp_fault_alloc` and `thp_fault_fallback` before and after the allocation. Sometimes the kernel will not be able to find huge pages. This program only tests the first page, so it won't be able to tell if the huge page allocation fails. See [the Monitoring usage section in the kernel's transhuge.txt for details](https://www.kernel.org/doc/Documentation/vm/transhuge.txt).

This demo compiles and runs on Mac OS X, but won't use huge pages.

For more details, see [Reliably allocating huge pages in Linux](https://mazzo.li/posts/check-huge-page.html), which I more or less copied.
### Testing GiB huge pages

To allocate 1 GiB pages, you must run:

```
echo 4 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
```

On my machine after running for a while, this will "succeed", but checking with cat shows the number does not change, and calling mmap will fail with `ENOMEM`. I needed to test this shortly after boot to get it to work.

This demo compiles and runs on Mac OS X, but won't use huge pages.


## Results

From a system where `/proc/cpuinfo` reports "11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz", using `perf stat -e dTLB-load-misses,iTLB-load-misses,page-faults`:
From a system where `/proc/cpuinfo` reports "11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz", using `perf stat -e dTLB-load-misses,iTLB-load-misses,page-faults,dtlb_load_misses.walk_completed,dtlb_load_misses.stlb_hit`:

### Vec

```
200000000 accessses in 6.421793881s; 31143945.7 accesses/sec
199,681,103 dTLB-load-misses
4,316 iTLB-load-misses
1,048,700 page-faults
199,687,753 dTLB-load-misses
4,432 iTLB-load-misses
1,048,699 page-faults
199,687,753 dtlb_load_misses.walk_completed
5,801,701 dtlb_load_misses.stlb_hit
```

### Huge Page mmap
### Transparent 2MiB Huge Page mmap

```
200000000 in 2.193096392s; 91195262.0 accesses/sec
123,624,814 dTLB-load-misses
1,854 iTLB-load-misses
2,196 page-faults
112,933,198 dTLB-load-misses
2,431 iTLB-load-misses
2,197 page-faults
112,933,198 dtlb_load_misses.walk_completed
84,037,596 dtlb_load_misses.stlb_hit
```

### 1GiB Huge Page mmap HUGE_TLB

```
200000000 accesses in 2.01655466s; 99179062.2 accesses/sec
908 dTLB-load-misses
647 iTLB-load-misses
127 page-faults
908 dtlb_load_misses.walk_completed
9,781 dtlb_load_misses.stlb_hit
```


Expand All @@ -52,3 +77,5 @@ X86-64 supports 2MiB and 1GiB huge pages.
Newer Arm CPUs support a huge range of huge pages: https://github.com/lgeek/arm_tlb_huge_pages

Google's TCMalloc/Temeraire is a huge page aware allocator. They found it improved request per second performance of user code by about 7% fleet-wide. https://www.usenix.org/conference/osdi21/presentation/hunter

For a C version, see [Reliably allocating huge pages in Linux](https://mazzo.li/posts/check-huge-page.html), which I used to develop this version.
Loading

0 comments on commit e0cf875

Please sign in to comment.