-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark AmpereOne A192-32X #43
Comments
CPU was self-reporting 297W power draw, I don't have my wall power measurement set up yet, but will soon. The power strip showed 8A at 120V (for the full rack), so somewhere between 500-600W for the full system probably.
2.146 Tflops! |
That score would put this system at rank #460 on the 2006/06 Top500 List (see https://hpl-calculator.sourceforge.net/hpl-calculations.php and https://www.top500.org/lists/top500/list/2006/06/?page=5). It about matches the Saguaro cluster at Arizona State University, which consumed around 80 kilowatts of power in 2006. |
Hi Jeff, Note: We are working on tweaking the performance even further. |
@rbapat-ampere - Sure! Makes sense if the NUMA layout means the memory access is better allocated like that. It seems highly dependent on the core layout, on most single-socket systems just having the number of cores for Qs helps, but when we hit 192 cores, things behave funny :) I'm going to use the following values in my hpl_dat_opts:
# sqrt((Memory in GB * 1024 * 1024 * 1024 * Node count) / 8) * 0.9
#Ns: "{{ (((((ram_in_gb | int) * 1024 * 1024 * 1024 * (nodecount | int)) / 8) | root) * 0.90) | int }}"
Ns: 203788
NBs: 256
# (P * Q) should be roughly equivalent to total core count, with Qs higher.
# If running on a single system, Ps should be 1 and Qs should be core count.
Ps: 12
Qs: 16 |
2,156 Gflops at 685 W, for 3.14 Gflops/W Average is closer to 685W on this run, with the above settings: Final result:
The contents of the HPL.dat I used:
@rbapat-ampere - Is it possible you're using a more optimized BLIS library, e.g. like https://github.com/AmpereComputing/HPL-on-Ampere-Altra ? |
@geerlingguy ================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 239872 256 12 16 3095.07 2.9729e+03
HPL_pdgesv() start time Fri Oct 25 10:23:52 2024
HPL_pdgesv() end time Fri Oct 25 11:15:27 2024
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.50966052e-03 ...... PASSED
================================================================================ Can you share your build and run procedure? |
@rbapat-ampere / @joespeed - I noticed your I will do another run with that updated value. The entire build procedure is done automatically in an isolated build directory (all assets stored in The build instructions are:
The Ansible playbook makes it easy for me to swap out libraries, re-run the tests dozens of times without having to touch anything, and verify that my test runs are exactly the same across systems and time (as I also control the versions of libraries I install). In my previous runs I've used flame/blis, but I will switch to ---
hpl_root: /opt/top500
mpich_version: "4.2.3"
linear_algebra_library: openblas # 'atlas', 'openblas', or 'blis'
linear_algebra_blis_version: master # only used for blis
linear_algebra_openblas_version: develop # only used for openblas
ssh_user: ubuntu
ssh_user_home: /home/ubuntu
hpl_dat_opts:
Ns: 239872
NBs: 256
Ps: 12
Qs: 16 |
@geerlingguy
During my runs, its 100% from get go all the way to the end. |
@rbapat-ampere - To match your setup, you would need to change the I just re-ran the benchmark and it errored out. Checking dmesg, I found a bunch of errors like:
And at the end, as I got an error from HPL, I saw this in
Indeed, as I'm running the test, I'm seeing thousands of these messages every second...
It seems like |
After re-seating, the RAM module is not throwing any errors visible in However, I'm seeing OOM killing the |
Aww, this time it ran for 30 minutes before bailing with OOM...
Going to give it one more go and then maybe back off
...maybe OpenBLAS handles memory allocation differently? We'll see if a 2nd run works or if it errors out at the same point. Note that power usage is dramatically reduced, more like 430W. I can't get a result yet, though, so we'll see what that means for overall efficiency! |
Even at @rbapat-ampere - When you're doing your runs, how long do they take? And have you monitored RAM consumption over time to see if/when it fills up with 512 GB of system memory? Re-running with Second question - do you know what version of OpenBLAS you're running @rbapat-ampere? I'm on devel, from earlier this morning, so maybe there are some arm64 changes that broke it... |
End result with
|
That test failed with more:
So I'm switching gears and will run the scripts sent over from @rbapat-ampere to compile the versions of OpenMPI 5.0.5, BLIS and HPL he has running. I believe a repo with that configuration will be forthcoming... |
If I attempt to modify my config to use the I'm happy enough accepting the result from the Ampere build though. To be more complete, I should probably denote runtime options used by each run in the issues—I don't always do that, but if there are platform-specific optimizations (besides |
Going to run once more tonight, to see if I can pass 3 Tflops (got 2.996 lol). |
I may do another run later downclocked to 2.6 GHz or some lower clock, to see if efficiency can scale up a bit, and what the sweet spot is for the AmpereOne. |
Closing this issue for now as I have the results—I will reopen if we get any other optimizations back from Ampere! |
(Just to be complete, after getting the extra stick of RAM, I re-ran the benchmark and got about the same 3 Tflop result) |
I have a system Ampere / Supermicro sent over for testing, it's the 192-core CPU, with 512 GB of DDR5 RAM at 5200 MT/s.
See: geerlingguy/sbc-reviews#52
The text was updated successfully, but these errors were encountered: