Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark AmpereOne A192-32X #43

Closed
geerlingguy opened this issue Oct 16, 2024 · 29 comments
Closed

Benchmark AmpereOne A192-32X #43

geerlingguy opened this issue Oct 16, 2024 · 29 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Oct 16, 2024

I have a system Ampere / Supermicro sent over for testing, it's the 192-core CPU, with 512 GB of DDR5 RAM at 5200 MT/s.

See: geerlingguy/sbc-reviews#52

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 16, 2024

CPU was self-reporting 297W power draw, I don't have my wall power measurement set up yet, but will soon. The power strip showed 8A at 120V (for the full rack), so somewhere between 500-600W for the full system probably.

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  203788
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :     192
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      203788   256     1   192            2628.74             2.1464e+03
HPL_pdgesv() start time Wed Oct 16 17:08:19 2024

HPL_pdgesv() end time   Wed Oct 16 17:52:08 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.21811736e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

2.146 Tflops!

@geerlingguy
Copy link
Owner Author

That score would put this system at rank #460 on the 2006/06 Top500 List (see https://hpl-calculator.sourceforge.net/hpl-calculations.php and https://www.top500.org/lists/top500/list/2006/06/?page=5).

It about matches the Saguaro cluster at Arizona State University, which consumed around 80 kilowatts of power in 2006.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 25, 2024

Re-ran the test again today with my ThirdReality Smart Outlet for power monitoring. It is within 10W of Supermicro's BMC power reporting. See image and full results below, but here's the summary:

2141 Gflops at 724 W for 2.96 Gflops/W

Screenshot 2024-10-25 at 3 06 21 PM
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  203788
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :     192
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      203788   256     1   192            2635.11             2.1412e+03
HPL_pdgesv() start time Fri Oct 25 19:20:11 2024

HPL_pdgesv() end time   Fri Oct 25 20:04:06 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.21811736e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Ampere is going to provide instructions for a more optimized HPL run since AmpereOne is a newer architecture and could use a few optimizations to sneak in a bit more efficiency.

Note also this chassis is likely using a bit more power than the other Ampere systems I've tested on. I'm measuring wall power, keep that in mind! CPU package power reports around 260W on average.

@rbapat-ampere
Copy link

Hi Jeff,
Can you rerun the workload using the HPL.dat input file attached?
Our testing shows us getting 3.0 TFLOP/s
HPL.zip

Note: We are working on tweaking the performance even further.

@geerlingguy
Copy link
Owner Author

@rbapat-ampere - Sure! Makes sense if the NUMA layout means the memory access is better allocated like that. It seems highly dependent on the core layout, on most single-socket systems just having the number of cores for Qs helps, but when we hit 192 cores, things behave funny :)

I'm going to use the following values in my config.yml, to override the defaults (and I was using 1/192 Ps/Qs):

hpl_dat_opts:
  # sqrt((Memory in GB * 1024 * 1024 * 1024 * Node count) / 8) * 0.9
  #Ns: "{{ (((((ram_in_gb | int) * 1024 * 1024 * 1024 * (nodecount | int)) / 8) | root) * 0.90) | int }}"
  Ns: 203788
  NBs: 256
  # (P * Q) should be roughly equivalent to total core count, with Qs higher.
  # If running on a single system, Ps should be 1 and Qs should be core count.
  Ps: 12
  Qs: 16 

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 25, 2024

2,156 Gflops at 685 W, for 3.14 Gflops/W

Average is closer to 685W on this run, with the above settings:

Screenshot 2024-10-25 at 5 39 09 PM

Final result:

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  203788
NB     :     256
PMAP   : Row-major process mapping
P      :      12
Q      :      16
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      203788   256    12    16            2617.29             2.1557e+03
HPL_pdgesv() start time Fri Oct 25 22:05:40 2024

HPL_pdgesv() end time   Fri Oct 25 22:49:17 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.71446644e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

The contents of the HPL.dat I used:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
203788         Ns
1            # of NBs
256           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
12            Ps
16            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

@rbapat-ampere - Is it possible you're using a more optimized BLIS library, e.g. like https://github.com/AmpereComputing/HPL-on-Ampere-Altra ?

@rbapat-ampere
Copy link

rbapat-ampere commented Oct 27, 2024

@geerlingguy
Nothing of that sort. Used OpenBLAS.

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      239872   256    12    16            3095.07             2.9729e+03
HPL_pdgesv() start time Fri Oct 25 10:23:52 2024

HPL_pdgesv() end time   Fri Oct 25 11:15:27 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.50966052e-03 ...... PASSED
================================================================================

Can you share your build and run procedure?

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 28, 2024

@rbapat-ampere / @joespeed - I noticed your N was set to 239872 in the run above (the HPL.zip you posted had N = 203788).

I will do another run with that updated value.

The entire build procedure is done automatically in an isolated build directory (all assets stored in /opt/top500) using this project's Ansible playbook: https://github.com/geerlingguy/top500-benchmark

The build instructions are:

  1. Ensure you have Ansible installed locally (e.g. pip install ansible)
  2. Clone this repository anywhere on the local volume
  3. cp example.hosts.ini hosts.ini && cp example.config.yml config.yml
  4. Modify the config.yml file and set the exact values you require (my entire config.yml is pasted below)
  5. Run ansible-playbook main.yml --tags "setup,benchmark"

The Ansible playbook makes it easy for me to swap out libraries, re-run the tests dozens of times without having to touch anything, and verify that my test runs are exactly the same across systems and time (as I also control the versions of libraries I install).

In my previous runs I've used flame/blis, but I will switch to openblas. The mpich version I'm running right now is 4.2.2, but I see 4.2.3 was just released, so I will switch to that as well. I'm starting a new run momentarily.

---
hpl_root: /opt/top500

mpich_version: "4.2.3"
linear_algebra_library: openblas  # 'atlas', 'openblas', or 'blis'
linear_algebra_blis_version: master  # only used for blis
linear_algebra_openblas_version: develop  # only used for openblas

ssh_user: ubuntu
ssh_user_home: /home/ubuntu

hpl_dat_opts:
  Ns: 239872
  NBs: 256
  Ps: 12
  Qs: 16

@rbapat-ampere
Copy link

rbapat-ampere commented Oct 28, 2024

@geerlingguy
I ran HPL using your top500 ansible playbook. A few things I noticed during runtime.

  1. While checking htop, I noticed that the average all core utilization did not reach 100%
    Screenshot 2024-10-28 000826
    Can you check for htop when running and confirm your average all core utilization

During my runs, its 100% from get go all the way to the end.

@geerlingguy
Copy link
Owner Author

@rbapat-ampere - To match your setup, you would need to change the config.yml values to match the ones I pasted in my comment (I forgot to paste initially, but they are there now).

I just re-ran the benchmark and it errored out. Checking dmesg, I found a bunch of errors like:

[  511.304636] EDAC MC0: 1 CE single-symbol chipkill ECC on P0_Node0_Channel5_Dimm0 DIMMF1 (node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:32680 column:1312 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 page:0x3fd437 offset:0x4180 grain:1 syndrome:0x0 - APEI location: node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:32680 column:1312 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 status(0x0000000000000400): Storage error in DRAM memory)

And at the end, as I got an error from HPL, I saw this in dmesg:

[  536.679810] EDAC MC0: 1 CE single-symbol chipkill ECC on P0_Node0_Channel5_Dimm0 DIMMF1 (node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:1492 column:1248 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 page:0x2ea37 offset:0x3080 grain:1 syndrome:0x0 - APEI location: node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:1492 column:1248 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 status(0x0000000000000400): Storage error in DRAM memory)
[  539.061145] {139}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  539.061148] {139}[Hardware Error]: It has been corrected by h/w and requires no further action
[  539.061149] {139}[Hardware Error]: event severity: corrected
[  539.061150] {139}[Hardware Error]:  Error 0, type: corrected
[  539.061151] {139}[Hardware Error]:   section_type: memory error
[  539.061152] {139}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[  539.061154] {139}[Hardware Error]:   physical_address: 0x00000001b0b63bc0
[  539.061156] {139}[Hardware Error]:   node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:865 column:240 
[  539.061157] {139}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[  539.061158] {139}[Hardware Error]:   DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 
[  539.061159] {139}[Hardware Error]:  Error 1, type: corrected
[  539.061160] {139}[Hardware Error]:   section type: unknown, 2826cc9f-448c-4c2b-86b6-a95394b7ef33
[  539.061161] {139}[Hardware Error]:   section length: 0x30
[  539.061163] {139}[Hardware Error]:   00000000: 00003001 00000015 1b0b63c0 00000000  .0.......c......
[  539.061165] {139}[Hardware Error]:   00000010: 00000000 00000000 00000000 00410002  ..............A.
[  539.061167] {139}[Hardware Error]:   00000020: 001d0000 00000002 0d600080 006e0000  ..........`...n.
[  539.061170] EDAC MC0: 1 CE single-symbol chipkill ECC on P0_Node0_Channel5_Dimm0 DIMMF1 (node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:865 column:240 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 page:0x1b0b6 offset:0x3bc0 grain:1 syndrome:0x0 - APEI location: node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:865 column:240 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 status(0x0000000000000400): Storage error in DRAM memory)

Indeed, as I'm running the test, I'm seeing thousands of these messages every second...

[  669.816722] EDAC MC0: 1 CE single-symbol chipkill ECC on P0_Node0_Channel5_Dimm0 DIMMF1 (node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:43119 column:1776 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 page:0x5437b7 offset:0xb8c0 grain:1 syndrome:0x0 - APEI location: node:0 card:5 module:16 rank:0 bank_group:3 bank_address:3 device:7 row:43119 column:1776 DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 status(0x0000000000000400): Storage error in DRAM memory)

It seems like DIMM location: P0_Node0_Channel5_Dimm0 DIMMF1 might be a bad stick of RAM (and would account for a slower result...).

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 28, 2024

Indeed, looking in the BMC, I see a bunch of health warnings:

Screenshot 2024-10-28 at 11 12 27 AM

I'm going to try reseating the RAM to see if that helps.

Screenshot 2024-10-28 at 11 21 20 AM

@geerlingguy
Copy link
Owner Author

After re-seating, the RAM module is not throwing any errors visible in dmesg.

However, I'm seeing OOM killing the hpl processes now, with the settings changed to the above config.yml. I've switched back N to 203788, and am letting it run again...

@geerlingguy
Copy link
Owner Author

Aww, this time it ran for 30 minutes before bailing with OOM...

    ===================================================================================
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   PID 41523 RUNNING AT 10.0.2.21
    =   EXIT CODE: 9
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
    ===================================================================================

Going to give it one more go and then maybe back off Ns a bit more. All 8 sticks of RAM accounted for, 512 GB total system RAM:

ubuntu@ubuntu:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           510Gi       397Gi       131Gi       6.5Mi       450Mi       113Gi
Swap:          8.0Gi       166Mi       7.8Gi

ubuntu@ubuntu:~$ sudo dmidecode -t memory | grep Serial
	Serial Number: 80AD012324466511C5
	Serial Number: 80AD012324466511CE
	Serial Number: 80AD01241097E76174
	Serial Number: 80AD01232446651180
	Serial Number: 80AD01232446651279
	Serial Number: 80AD01232446651347
	Serial Number: 80AD012327879F620A
	Serial Number: 80AD012327879F624D

...maybe OpenBLAS handles memory allocation differently? We'll see if a 2nd run works or if it errors out at the same point. Note that power usage is dramatically reduced, more like 430W. I can't get a result yet, though, so we'll see what that means for overall efficiency!

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 28, 2024

After 30 minutes, it looks like free space does run out, and swap is touched... so probably a good idea to limit the RAM a little further.

Screenshot 2024-10-28 at 12 59 01 PM

I let the second benchmark run for a couple hours, and it was still not complete:

Screenshot 2024-10-28 at 2 27 59 PM

So I've changed from using RAM * 0.75 to RAM * 0.70 for the N calculation. We'll see if that can complete sooner!

[Edit: It did not, in fact... after 10 minutes or so, RAM fills up again. Trying 0.50 just to see if I can get a run without filling up system memory]

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 28, 2024

Even at 0.25x the system RAM, it seems like the RAM fills up around 30 minutes into the test. Might switch back to blis and see if the behavior improves again...

@rbapat-ampere - When you're doing your runs, how long do they take? And have you monitored RAM consumption over time to see if/when it fills up with 512 GB of system memory?

Re-running with blis at master instead of openblas at devel, it's completing now—using 690W of power. OpenBLAS runs were averaging around 390-420W of power, but would fill up the system memory until processes were OOM-killed every time :/

Second question - do you know what version of OpenBLAS you're running @rbapat-ampere? I'm on devel, from earlier this morning, so maybe there are some arm64 changes that broke it...

@geerlingguy
Copy link
Owner Author

End result with blis again:

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  203788
NB     :     256
PMAP   : Row-major process mapping
P      :      12
Q      :      16
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      203788   256    12    16            2624.94             2.1495e+03
HPL_pdgesv() start time Mon Oct 28 20:35:19 2024

HPL_pdgesv() end time   Mon Oct 28 21:19:04 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.71446644e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 28, 2024

I tested again, just with 0.05 of the RAM (so N = 52133), and this time it averaged around 450W of power. (Interestingly, CPUs maxed out at 100% all the time, versus using 700W of power with CPUs hovering between 90-100%.)

It took 20 minutes, but all the 512 GB of RAM has filled up still... is there a memory leak in OpenBLAS? Or is this just expected, for it to consume all RAM even when N is really small? Strange behavior.

Screenshot 2024-10-28 at 4 41 19 PM

Still filled up memory. Tried again with tiny N allocation, still did the same... Going to re-run everything recompiling OpenBLAS at version:

linear_algebra_openblas_version: "v0.3.28"
ubuntu@ubuntu:/opt/top500/tmp/openblas-build$ git status
HEAD detached at v0.3.28
nothing to commit, working tree clean

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 28, 2024

On v0.3.28, I'm able to get a run to complete setting N laughably low. Setting it to 166501 it seems like it may be working better, using up 99% of RAM but not thrashing... We'll see.

After about 40 minutes, the RAM utilization hit 100% and CPUs went to 80-100% utilization, for about 2 minutes, but now it's back to 99% with CPUs pegging 100% (though system power was averaging 465W, now it's down to 450W):

Screenshot 2024-10-28 at 6 46 14 PM

[Edit: And now it's doing the same thing, but only about 1.5 minutes long.]

Screenshot 2024-10-28 at 6 52 20 PM

@geerlingguy
Copy link
Owner Author

That test failed with more:

[26986.597698] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/[email protected]/init.scope,task=systemd,pid=529208,uid=1000
[26986.597752] Out of memory: Killed process 529208 (systemd) total-vm:22080kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:384kB oom_score_adj:100
[26986.640893] xhpl invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

So I'm switching gears and will run the scripts sent over from @rbapat-ampere to compile the versions of OpenMPI 5.0.5, BLIS and HPL he has running. I believe a repo with that configuration will be forthcoming...

@geerlingguy
Copy link
Owner Author

Finally got a good run in, with the Ampere-supplied script.

2,996 Gflops at 725W, for 4.13 Gflops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  239872 
NB     :     256 
PMAP   : Row-major process mapping
P      :      12 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      239872   256    12    16            3071.39             2.9958e+03
HPL_pdgesv() start time Tue Oct 29 01:17:45 2024

HPL_pdgesv() end time   Tue Oct 29 02:08:56 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.56457026e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Screenshot 2024-10-28 at 9 35 24 PM

@geerlingguy
Copy link
Owner Author

If I attempt to modify my config to use the configure options in the Ampere-supplied repo, I wind up with errors compiling HPL. I've noticed another major difference is the HPL Makefile, Make.AmpereOne — it differs a bit from what I generate in the task that builds sh make_generic...

I'm happy enough accepting the result from the Ampere build though. To be more complete, I should probably denote runtime options used by each run in the issues—I don't always do that, but if there are platform-specific optimizations (besides auto and make_generic), that can lead to a substantial difference in performance.

@geerlingguy
Copy link
Owner Author

Going to run once more tonight, to see if I can pass 3 Tflops (got 2.996 lol).

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 29, 2024

Also to give a little more of a leg up, I may do one more run with 25 Gbps network detached, and pulling all the NVMe drives out from the front. I'm guessing each drive sucks down 2-5W at idle...

System was idling around 233W this morning before I unplugged those parts. Now, after rebooting and waiting another 10 or so minutes, it's idling around 196W:

Screenshot 2024-10-29 at 9 41 15 AM

@geerlingguy
Copy link
Owner Author

Today's run, without NVMe, and without 25G networking connected, was rock solid at 692W (see graph below), for:

3,027 Gflops at 692W, or 4.37 Gflops/W

Screenshot 2024-10-29 at 10 38 40 AM Screenshot 2024-10-29 at 10 37 20 AM

@geerlingguy
Copy link
Owner Author

I may do another run later downclocked to 2.6 GHz or some lower clock, to see if efficiency can scale up a bit, and what the sweet spot is for the AmpereOne.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 30, 2024

Setting the frequency to 2.6 GHz all cores to simulate the more efficient A192-26X SKU:

sudo cpupower frequency-set -u 2600000

2,755 Gflops at 612W, for 4.50 Gfops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  239872 
NB     :     256 
PMAP   : Row-major process mapping
P      :      12 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      239872   256    12    16            3340.34             2.7546e+03
HPL_pdgesv() start time Wed Oct 30 01:52:32 2024

HPL_pdgesv() end time   Wed Oct 30 02:48:12 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.58219282e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Screenshot 2024-10-29 at 9 55 52 PM

@geerlingguy
Copy link
Owner Author

Closing this issue for now as I have the results—I will reopen if we get any other optimizations back from Ampere!

@geerlingguy geerlingguy reopened this Oct 30, 2024
@geerlingguy
Copy link
Owner Author

One more run this morning, with the 6x NVMe drives unplugged, for max efficiency score...

2,745.1 Gflops at 570 W, for 4.82 Gflops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  239872 
NB     :     256 
PMAP   : Row-major process mapping
P      :      12 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      239872   256    12    16            3351.97             2.7451e+03
HPL_pdgesv() start time Wed Oct 30 14:23:21 2024

HPL_pdgesv() end time   Wed Oct 30 15:19:13 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.55559315e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Screenshot 2024-10-30 at 10 32 49 AM

@geerlingguy
Copy link
Owner Author

(Just to be complete, after getting the extra stick of RAM, I re-ran the benchmark and got about the same 3 Tflop result)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants