Skip to content

Latest commit

 

History

History
539 lines (468 loc) · 30.1 KB

bug3177.md

File metadata and controls

539 lines (468 loc) · 30.1 KB

from [https://scicomp.stackexchange.com/questions/11827/flop-counts-for-lapack-symmetric-eigenvalue-routines-dsyev-dsyevd-dsyevx-and-d]:

First of all, yes, these are all based on an initial tridiagonalization (often quoted to be 43n3 flops). DSYEV is just an easier to use version of DSYEVX, so let's ignore it for now. The basic breakdown is such:

  • DSYEVX: Tridiagonal implicitly shifted QR (bulge chasing)
  • DSYEVD: Divide and Conquer (stitching by rank 1 modifications)
  • DSYEVR: Use the Multiple Relatively Robust Representations (MRRR) algorithm (symmetric dqds)

The flop count of each of these algorithms is highly dependent on the eigenvalue distribution. For example, if your matrix has clusters of closely packed eigenvalues that are nearly degenerate, then these algorithms tend to have reduced order of convergence. At best you have O(n2) flops in the tridiagonal eigenproblem, and at worst O(n3) if the sizes of your clusters go as O(n). See this working note for a more detailed analysis.

Since you want eigenvectors, let's look at the flop count for their extraction. Implicitly shifted QR continually updates the orthogonal matrices used for reduction, so it is always runs in O(n3) when eigenvectors are desired (basically, each QR bulge sweep takes O(n2), and you need to O(n) sweeps usually).

For D&C and MRRR, the textbook descriptions say to compute eigenvalues first, then extract eigenvectors with inverse iteration. This requires O(n) work per eigenvector since usually 1-2 iterations is enough, so it would be O(n2) work to get all the eigenvectors after the eigenvalues have been found. However, in practice, Lapack does something more sophisticated.

I'm not familiar with D&C, but look at the source, it looks like it keeps the eigenvectors updated as it goes, so it's runtime is likely somewhere between O(n2) and O(n3). For MRRR, the failure of inverse iteration has been analyzed in this paper. In practice, the eigenvectors must be updated throughout the algorithm and reorthogonalized when clusters are detected. A few of the details of this process can be controlled in the interface to DSYEVR, and like D&C, the practical runtime is somewhere between quadratic and cubic.

graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 8000
 nb =          100
 info =            0
Time taken by dsyev for matrix size     8000 was     409.82 seconds
    -51.6506244    -51.4058072    -51.3475218    -51.2443889    -51.1703626    -51.1030575    -51.0072332    -50.9789340    -50.9090230    -50.8573262
     50.9106148     50.9503807     51.0075738     51.1221930     51.2002598     51.2702891     51.3218429     51.4007310     51.6350470   4000.6517809

real	0m9,112s
user	6m39,464s
sys	0m11,204s
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_mkl 8000
 info =            0
Time taken by dsyev for matrix size     8000 was     176.24 seconds
    -51.6506244    -51.4058072    -51.3475218    -51.2443889    -51.1703626    -51.1030575    -51.0072332    -50.9789340    -50.9090230    -50.8573262
     50.9106148     50.9503807     51.0075738     51.1221930     51.2002598     51.2702891     51.3218429     51.4007310     51.6350470   4000.6517809

real	2m57,398s
user	2m56,736s
sys	0m0,208s
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma 8000
 info =            0
Time taken by dsyev for matrix size     8000 was     189.90 seconds
    -51.6506244    -51.4058072    -51.3475218    -51.2443889    -51.1703626    -51.1030575    -51.0072332    -50.9789340    -50.9090230    -50.8573262
     50.9106148     50.9503807     51.0075738     51.1221930     51.2002598     51.2702891     51.3218429     51.4007310     51.6350470   4000.6517809

real	0m23,362s
user	3m9,148s
sys	0m1,496s
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 16000
 nb =          100
 info =            0
Time taken by dsyev for matrix size    16000 was    1828.30 seconds
    -72.8550325    -72.6979438    -72.6765172    -72.6194803    -72.5662783    -72.4850298    -72.4639951    -72.3427485    -72.3181432    -72.2927798
     72.3422654     72.4097912     72.4202021     72.5078816     72.5726288     72.6068457     72.6619977     72.8739593     72.9735306   7999.9111633

real	0m35,040s
user	29m43,724s
sys	0m47,976s
>>> ( 4.0/3.0 * 16000**3 )/1.e9/(35.040)
155.8599695585997
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 32000
 nb =          100
 info =            0
Time taken by dsyev for matrix size    32000 was    9012.72 seconds
   -103.1265024   -103.0976923   -103.0580529   -102.9909914   -102.9341979   -102.8812811   -102.8219491   -102.7303616   -102.6743930   -102.6622158
    102.6781966    102.6999579    102.7533367    102.8068101    102.8417293    102.9259807    102.9992714    103.0677603    103.2170313  15999.6382274

real	2m42,285s
user	146m26,744s
sys	4m6,176s
>>> ( 4.0/3.0 * 32000**3 )/1.e9/(162.285)
269.2218422322868

n = 8000 => nops = 4/3 * 8000**3 = 682666666666.6666

method gflops
mkl (1 core) 3.8
magma 8 cores 29.2
magma 1 p100 74.9

28/06/2021

>>> 177.398/23.362
7.5934423422652175
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma 16000
 info =            0
Time taken by dsyev for matrix size    16000 was    2179.51 seconds
    -72.8550325    -72.6979438    -72.6765172    -72.6194803    -72.5662783    -72.4850298    -72.4639951    -72.3427485    -72.3181432    -72.2927798
     72.3422654     72.4097912     72.4202021     72.5078816     72.5726288     72.6068457     72.6619977     72.8739593     72.9735306   7999.9111633

real	4m40,192s
user	36m14,516s
sys	0m8,284s

graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 16000
 nb =          100
 info =            0
Time taken by dsyev for matrix size    16000 was    1837.43 seconds
    -72.8550325    -72.6979438    -72.6765172    -72.6194803    -72.5662783    -72.4850298    -72.4639951    -72.3427485    -72.3181432    -72.2927798
     72.3422654     72.4097912     72.4202021     72.5078816     72.5726288     72.6068457     72.6619977     72.8739593     72.9735306   7999.9111633

real	0m38,862s
user	29m53,972s
sys	0m46,992s
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     14071      C   ./diag_magma_gpu                            4301MiB |
+-----------------------------------------------------------------------------+
Mon Jun 28 11:00:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:25:00.0 Off |                    0 |
| N/A   36C    P0    70W / 250W |   4311MiB / 16280MiB |     57%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5B:00.0 Off |                    0 |
| N/A   29C    P0    26W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:9B:00.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   31C    P0    25W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ env | grep NUM_THR
MKL_NUM_THREADS=72
MAGMA_NUM_THREADS=72
OMP_NUM_THREADS=72
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_mkl_omp 16000
 info =            0
Time taken by dsyev for matrix size    16000 was    2187.96 seconds
    -72.8550325    -72.6979438    -72.6765172    -72.6194803    -72.5662783    -72.4850298    -72.4639951    -72.3427485    -72.3181432    -72.2927798
     72.3422654     72.4097912     72.4202021     72.5078816     72.5726288     72.6068457     72.6619977     72.8739593     72.9735306   7999.9111633

real	4m37,511s
user	36m22,320s
sys	0m8,896s
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ export OMP_NUM_THREADS=1
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ export MAGMA_NUM_THREADS=1
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ export MKL_NUM_THREADS=1
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 16000
 nb =          100
 info =            0
Time taken by dsyev for matrix size    16000 was      30.13 seconds
    -72.8550325    -72.6979438    -72.6765172    -72.6194803    -72.5662783    -72.4850298    -72.4639951    -72.3427485    -72.3181432    -72.2927798
     72.3422654     72.4097912     72.4202021     72.5078816     72.5726288     72.6068457     72.6619977     72.8739593     72.9735306   7999.9111633

real	0m33,676s
user	0m26,136s
sys	0m7,380s
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 32000
 nb =          100
 info =            0
Time taken by dsyev for matrix size    32000 was     153.27 seconds
   -103.1265024   -103.0976923   -103.0580529   -102.9909914   -102.9341979   -102.8812811   -102.8219491   -102.7303616   -102.6743930   -102.6622158
    102.6781966    102.6999579    102.7533367    102.8068101    102.8417293    102.9259807    102.9992714    103.0677603    103.2170313  15999.6382274

real	2m54,601s
user	2m13,468s
sys	0m40,424s
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     30518      C   ./diag_magma_gpu                           16021MiB |
+-----------------------------------------------------------------------------+
Mon Jun 28 13:27:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:25:00.0 Off |                    0 |
| N/A   38C    P0   134W / 250W |  16031MiB / 16280MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5B:00.0 Off |                    0 |
| N/A   29C    P0    26W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:9B:00.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 8000
 nb =          100
 info =            0
Time taken by dsyev for matrix size     8000 was       7.58 seconds
    -51.6506244    -51.4058072    -51.3475218    -51.2443889    -51.1703626    -51.1030575    -51.0072332    -50.9789340    -50.9090230    -50.8573262
     50.9106148     50.9503807     51.0075738     51.1221930     51.2002598     51.2702891     51.3218429     51.4007310     51.6350470   4000.6517809

real	0m8,701s
user	0m6,356s
sys	0m2,064s

graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_magma_gpu 12000
 nb =          100
 info =            0
Time taken by dsyev for matrix size    12000 was      16.46 seconds
    -63.0966139    -62.9934115    -62.9068927    -62.8081064    -62.7892967    -62.7294508    -62.6248293    -62.5964066    -62.5652698    -62.5251800
     62.5216052     62.5726116     62.6076412     62.7159574     62.8125527     62.8662591     62.9496596     63.0661715     63.1963897   6000.2081462

real	0m18,357s
user	0m14,040s
sys	0m4,208s
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ export MAGMA_NUM_THREADS=8
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_mkl 32000
^Cforrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source             
diag_mkl           0000000000407B6B  Unknown               Unknown  Unknown
libpthread-2.24.s  00007FCD9D69B0E0  Unknown               Unknown  Unknown
libmkl_avx512.so   00007FCBB113D58F  mkl_blas_avx512_x     Unknown  Unknown
libmkl_core.so     00007FCD9FF6FD56  mkl_blas_xdsyr2       Unknown  Unknown
libmkl_core.so     00007FCDA06F6132  mkl_lapack_dsytd2     Unknown  Unknown
libmkl_core.so     00007FCDA0AD062E  mkl_lapack_xdsytr     Unknown  Unknown
libmkl_core.so     00007FCDA06E8011  mkl_lapack_dsyev      Unknown  Unknown
libmkl_intel_lp64  00007FCD9F5BD7BC  mkl_lapack__dsyev     Unknown  Unknown
diag_mkl           0000000000405497  Unknown               Unknown  Unknown
diag_mkl           00000000004049A2  Unknown               Unknown  Unknown
libc-2.24.so       00007FCD9D30B2E1  __libc_start_main     Unknown  Unknown
diag_mkl           000000000040486A  Unknown               Unknown  Unknown

real	11m20,586s
user	11m15,500s
sys	0m3,396s
graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ time ./diag_mkl_omp 32000
 info =            0
Time taken by dsyev for matrix size    32000 was   19106.00 seconds
   -103.1265024   -103.0976923   -103.0580529   -102.9909914   -102.9341979   -102.8812811   -102.8219491   -102.7303616   -102.6743930   -102.6622158
    102.6781966    102.6999579    102.7533367    102.8068101    102.8417293    102.9259807    102.9992714    103.0677603    103.2170313  15999.6382274

real	40m25,284s
user	317m41,348s
sys	1m4,800s

09/07/2021

graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ for c in 1 2 3 4 5 6 7 8 10 12 16; do echo "num cores : $c";time OMP_NUM_THREADS=$c MKL_NUM_THREADS=$c ./diag_mkl_omp 6000; done

graffy@physix90:/opt/ipr/cluster/work.local/graffy/bug3177$ for c in 1 2 4 8 12 16 18 36 72; do echo "num cores : $c";time OMP_NUM_THREADS=$c MKL_NUM_THREADS=$c ./diag_mkl_omp 12000; done

19/08/2021

graffy@graffy-ws2:~/work/tickets/bug3177$ rsync --rsh=ssh -va ./ [email protected]:/mnt/workl/graffy/bug3177/
graffy@physix12:/mnt/workl/graffy/bug3177$ module load compilers/ifort/latest
graffy@physix12:/mnt/workl/graffy/bug3177$ for c in 1 2 4 8 12 16 18 24 32; do echo "num cores : $c";time OMP_NUM_THREADS=$c MKL_NUM_THREADS=$c ./diag_mkl_omp 12000; done
date >> ./toto.log; for c in {1..32}; do echo "num cores : $c" >> ./toto.log ; OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./diag_mkl_omp 6000 >> ./toto.log & done; wait ; date >> ./toto.log

jeudi 19 août 2021, 20:06:56 (UTC+0200)
jeudi 19 août 2021, 20:13:39 (UTC+0200)

graffy@physix12:/mnt/workl/graffy/bug3177$ strings ./toto.log | grep taken
Time taken by dsyev for matrix size     6000 was     263.56 seconds
Time taken by dsyev for matrix size     6000 was     278.55 seconds
Time taken by dsyev for matrix size     6000 was     279.00 seconds
Time taken by dsyev for matrix size     6000 was     280.90 seconds
Time taken by dsyev for matrix size     6000 was     283.78 seconds
Time taken by dsyev for matrix size     6000 was     289.54 seconds
Time taken by dsyev for matrix size     6000 was     294.37 seconds
Time taken by dsyev for matrix size     6000 was     298.77 seconds
Time taken by dsyev for matrix size     6000 was     300.43 seconds
Time taken by dsyev for matrix size     6000 was     302.72 seconds
Time taken by dsyev for matrix size     6000 was     308.66 seconds
Time taken by dsyev for matrix size     6000 was     310.77 seconds
Time taken by dsyev for matrix size     6000 was     312.39 seconds
Time taken by dsyev for matrix size     6000 was     313.05 seconds
Time taken by dsyev for matrix size     6000 was     319.08 seconds
Time taken by dsyev for matrix size     6000 was     322.58 seconds
Time taken by dsyev for matrix size     6000 was     382.66 seconds
Time taken by dsyev for matrix size     6000 was     386.16 seconds
Time taken by dsyev for matrix size     6000 was     387.94 seconds
Time taken by dsyev for matrix size     6000 was     388.16 seconds
Time taken by dsyev for matrix size     6000 was     390.04 seconds
Time taken by dsyev for matrix size     6000 was     393.09 seconds
Time taken by dsyev for matrix size     6000 was     394.48 seconds
Time taken by dsyev for matrix size     6000 was     394.61 seconds
Time taken by dsyev for matrix size     6000 was     394.26 seconds
Time taken by dsyev for matrix size     6000 was     395.16 seconds
Time taken by dsyev for matrix size     6000 was     396.60 seconds
Time taken by dsyev for matrix size     6000 was     397.20 seconds
Time taken by dsyev for matrix size     6000 was     398.25 seconds
Time taken by dsyev for matrix size     6000 was     399.21 seconds
Time taken by dsyev for matrix size     6000 was     399.24 seconds
Time taken by dsyev for matrix size     6000 was     401.76 seconds
graffy@physix12:/mnt/workl/graffy/bug3177$ for c in 1 2 4 8 12 16 18 24 32; do echo "num cores : $c";time OMP_NUM_THREADS=$c MKL_NUM_THREADS=$c ./diag_mkl_omp 6000; done

waw ... diagonalising a 6000x6000 matrix takes only 122s on one core of physix12, but if you attempt tu launch 32 of them simultaneously, they take between 263s and 401s...! The only explanation I see is that they're fighting for cache.. let's try smaller matrices

graffy@physix12:/mnt/workl/graffy/bug3177$ date >> ./toto.log; for c in {1..16}; do echo "num cores : $c" >> ./toto.log ; OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./diag_mkl_omp 6000 >> ./toto.log & done; wait ; date >> ./toto.log


    jeudi 19 août 2021, 20:26:12 (UTC+0200)
    jeudi 19 août 2021, 20:30:32 (UTC+0200)

    graffy@physix12:/mnt/workl/graffy/bug3177$ strings ./toto.log | grep taken
    Time taken by dsyev for matrix size     6000 was     189.82 seconds
    Time taken by dsyev for matrix size     6000 was     190.97 seconds
    Time taken by dsyev for matrix size     6000 was     193.70 seconds
    Time taken by dsyev for matrix size     6000 was     196.05 seconds
    Time taken by dsyev for matrix size     6000 was     196.68 seconds
    Time taken by dsyev for matrix size     6000 was     196.94 seconds
    Time taken by dsyev for matrix size     6000 was     197.21 seconds
    Time taken by dsyev for matrix size     6000 was     200.16 seconds
    Time taken by dsyev for matrix size     6000 was     251.46 seconds
    Time taken by dsyev for matrix size     6000 was     251.74 seconds
    Time taken by dsyev for matrix size     6000 was     251.94 seconds
    Time taken by dsyev for matrix size     6000 was     253.90 seconds
    Time taken by dsyev for matrix size     6000 was     253.90 seconds
    Time taken by dsyev for matrix size     6000 was     255.48 seconds
    Time taken by dsyev for matrix size     6000 was     256.46 seconds
    Time taken by dsyev for matrix size     6000 was     259.31 seconds

graffy@physix12:/mnt/workl/graffy/bug3177$ mv toto.log [email protected]

graffy@physix12:/mnt/workl/graffy/bug3177$ date > ./toto.log; for c in {1..32}; do echo "num cores : $c" >> ./toto.log ; OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./diag_mkl_omp 4000 >> ./toto.log & done; wait ; date >> ./toto.log

    jeudi 19 août 2021, 20:34:47 (UTC+0200)
    jeudi 19 août 2021, 20:37:17 (UTC+0200)

graffy@physix12:/mnt/workl/graffy/bug3177$ strings ./toto.log | grep taken
    Time taken by dsyev for matrix size     4000 was      60.19 seconds
    Time taken by dsyev for matrix size     4000 was      61.14 seconds
    Time taken by dsyev for matrix size     4000 was      70.28 seconds
    Time taken by dsyev for matrix size     4000 was      70.73 seconds
    Time taken by dsyev for matrix size     4000 was      70.95 seconds
    Time taken by dsyev for matrix size     4000 was      71.16 seconds
    Time taken by dsyev for matrix size     4000 was      71.27 seconds
    Time taken by dsyev for matrix size     4000 was      71.59 seconds
    Time taken by dsyev for matrix size     4000 was      81.06 seconds
    Time taken by dsyev for matrix size     4000 was      83.59 seconds
    Time taken by dsyev for matrix size     4000 was      84.70 seconds
    Time taken by dsyev for matrix size     4000 was      86.68 seconds
    Time taken by dsyev for matrix size     4000 was      91.74 seconds
    Time taken by dsyev for matrix size     4000 was      91.82 seconds
    Time taken by dsyev for matrix size     4000 was     135.89 seconds
    Time taken by dsyev for matrix size     4000 was     142.83 seconds
    Time taken by dsyev for matrix size     4000 was     143.08 seconds
    Time taken by dsyev for matrix size     4000 was     144.80 seconds
    Time taken by dsyev for matrix size     4000 was     145.04 seconds
    Time taken by dsyev for matrix size     4000 was     145.25 seconds
    Time taken by dsyev for matrix size     4000 was     145.77 seconds
    Time taken by dsyev for matrix size     4000 was     145.80 seconds
    Time taken by dsyev for matrix size     4000 was     145.93 seconds
    Time taken by dsyev for matrix size     4000 was     146.92 seconds
    Time taken by dsyev for matrix size     4000 was     147.27 seconds
    Time taken by dsyev for matrix size     4000 was     147.59 seconds
    Time taken by dsyev for matrix size     4000 was     148.72 seconds
    Time taken by dsyev for matrix size     4000 was     148.86 seconds
    Time taken by dsyev for matrix size     4000 was     148.99 seconds
    Time taken by dsyev for matrix size     4000 was     149.07 seconds
    Time taken by dsyev for matrix size     4000 was     149.23 seconds
    Time taken by dsyev for matrix size     4000 was     149.41 seconds

    graffy@physix12:/mnt/workl/graffy/bug3177$ mv toto.log [email protected]
graffy@physix90:/mnt/workl/graffy/bug3177$ date >> ./toto.log; for c in {1..70}; do echo "num cores : $c" >> ./toto.log ; OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./diag_mkl_omp 8000 >> ./toto.log & done; wait ; date >> ./toto.log

jeudi 19 août 2021, 20:01:48 (UTC+0200)
jeudi 19 août 2021, 20:28:29 (UTC+0200)

graffy@physix90:/mnt/workl/graffy/bug3177$ strings size8000\@72cores.log | grep taken
Time taken by dsyev for matrix size     8000 was    1377.04 seconds
Time taken by dsyev for matrix size     8000 was    1385.54 seconds
Time taken by dsyev for matrix size     8000 was    1386.46 seconds
Time taken by dsyev for matrix size     8000 was    1386.28 seconds
Time taken by dsyev for matrix size     8000 was    1392.95 seconds
Time taken by dsyev for matrix size     8000 was    1393.57 seconds
Time taken by dsyev for matrix size     8000 was    1408.21 seconds
Time taken by dsyev for matrix size     8000 was    1413.81 seconds
Time taken by dsyev for matrix size     8000 was    1427.93 seconds
Time taken by dsyev for matrix size     8000 was    1429.53 seconds
Time taken by dsyev for matrix size     8000 was    1434.31 seconds
Time taken by dsyev for matrix size     8000 was    1434.73 seconds
Time taken by dsyev for matrix size     8000 was    1435.41 seconds
Time taken by dsyev for matrix size     8000 was    1434.86 seconds
Time taken by dsyev for matrix size     8000 was    1449.81 seconds
Time taken by dsyev for matrix size     8000 was    1449.94 seconds
Time taken by dsyev for matrix size     8000 was    1458.62 seconds
Time taken by dsyev for matrix size     8000 was    1464.88 seconds
Time taken by dsyev for matrix size     8000 was    1465.58 seconds
Time taken by dsyev for matrix size     8000 was    1474.32 seconds
Time taken by dsyev for matrix size     8000 was    1474.93 seconds
Time taken by dsyev for matrix size     8000 was    1475.24 seconds
Time taken by dsyev for matrix size     8000 was    1479.48 seconds
Time taken by dsyev for matrix size     8000 was    1480.93 seconds
Time taken by dsyev for matrix size     8000 was    1484.08 seconds
Time taken by dsyev for matrix size     8000 was    1486.02 seconds
Time taken by dsyev for matrix size     8000 was    1487.28 seconds
Time taken by dsyev for matrix size     8000 was    1484.03 seconds
Time taken by dsyev for matrix size     8000 was    1491.19 seconds
Time taken by dsyev for matrix size     8000 was    1490.94 seconds
Time taken by dsyev for matrix size     8000 was    1491.42 seconds
Time taken by dsyev for matrix size     8000 was    1493.18 seconds
Time taken by dsyev for matrix size     8000 was    1492.59 seconds
Time taken by dsyev for matrix size     8000 was    1494.33 seconds
Time taken by dsyev for matrix size     8000 was    1498.14 seconds
Time taken by dsyev for matrix size     8000 was    1499.87 seconds
Time taken by dsyev for matrix size     8000 was    1498.84 seconds
Time taken by dsyev for matrix size     8000 was    1503.38 seconds
Time taken by dsyev for matrix size     8000 was    1511.98 seconds
Time taken by dsyev for matrix size     8000 was    1511.79 seconds
Time taken by dsyev for matrix size     8000 was    1514.42 seconds
Time taken by dsyev for matrix size     8000 was    1515.51 seconds
Time taken by dsyev for matrix size     8000 was    1518.11 seconds
Time taken by dsyev for matrix size     8000 was    1524.22 seconds
Time taken by dsyev for matrix size     8000 was    1525.70 seconds
Time taken by dsyev for matrix size     8000 was    1528.15 seconds
Time taken by dsyev for matrix size     8000 was    1529.68 seconds
Time taken by dsyev for matrix size     8000 was    1529.47 seconds
Time taken by dsyev for matrix size     8000 was    1532.36 seconds
Time taken by dsyev for matrix size     8000 was    1532.66 seconds
Time taken by dsyev for matrix size     8000 was    1537.51 seconds
Time taken by dsyev for matrix size     8000 was    1540.12 seconds
Time taken by dsyev for matrix size     8000 was    1541.18 seconds
Time taken by dsyev for matrix size     8000 was    1543.94 seconds
Time taken by dsyev for matrix size     8000 was    1540.04 seconds
Time taken by dsyev for matrix size     8000 was    1543.16 seconds
Time taken by dsyev for matrix size     8000 was    1549.16 seconds
Time taken by dsyev for matrix size     8000 was    1547.96 seconds
Time taken by dsyev for matrix size     8000 was    1551.22 seconds
Time taken by dsyev for matrix size     8000 was    1548.58 seconds
Time taken by dsyev for matrix size     8000 was    1554.20 seconds
Time taken by dsyev for matrix size     8000 was    1555.50 seconds
Time taken by dsyev for matrix size     8000 was    1556.40 seconds
Time taken by dsyev for matrix size     8000 was    1555.19 seconds
Time taken by dsyev for matrix size     8000 was    1557.36 seconds
Time taken by dsyev for matrix size     8000 was    1557.69 seconds
Time taken by dsyev for matrix size     8000 was    1564.10 seconds
Time taken by dsyev for matrix size     8000 was    1564.82 seconds
Time taken by dsyev for matrix size     8000 was    1567.04 seconds
Time taken by dsyev for matrix size     8000 was    1576.38 seconds

20/08/2021

  • with aviel, realized that the matrices are not the same, and that could make some diagonalizations faster than others. So I's added a fixed seed to ensure all computations are the same

  • to see the effect of the cache on performance, I'll do computations on smaller matrices that will fit into cache memory (1Mb per core for xeon 6248r).

    • to do that I've added a loop argument to diag
graffy@physix90:/mnt/workl/graffy/bug3177$ export LOG_FILE=./run$(date).log; date > "$LOG_FILE"; for c in {1..2}; do echo "num cores : $c" >> "$LOG_FILE" ; OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./diag_mkl_omp 128 10000 >> "$LOG_FILE" & done; wait ; date >> "$LOG_FILE"
graffy@physix12:/mnt/workl/graffy/bug3177$ ./cohab_batch.bash 32 physix12 10
# machine_id matrix_size num_loops num_processes duration_min(s) duration_max(s)
physix12 128 10000 1 23.33 23.33
physix12 128 10000 2 23.08 23.08
graffy@physix90:/mnt/workl/graffy/bug3177$ ./cohab_batch.bash 72 physix90 10
# machine_id matrix_size num_loops num_processes duration_min(s) duration_max(s)
physix90 128 10000 1 12.38 12.38
physix90 128 10000 2 12.38 12.45

24/08/2021

harvest results on physix12

graffy@graffy-ws2:~/work/tickets/bug3177$ rsync --rsh=ssh -va [email protected]:/mnt/workl/graffy/bug3177/cohab* ./cohab/physix12/
graffy@graffy-ws2:~/work/tickets/bug3177$ rsync --rsh=ssh -va [email protected]:/mnt/workl/graffy/bug3177/physix12* ./cohab/physix12/

harvest results on physix90

graffy@graffy-ws2:~/work/tickets/bug3177$ rsync --rsh=ssh -va [email protected]:/mnt/workl/graffy/bug3177/cohab* ./cohab/physix90/
graffy@graffy-ws2:~/work/tickets/bug3177$ rsync --rsh=ssh -va [email protected]:/mnt/workl/graffy/bug3177/physix90* ./cohab/physix90/

test xeon 6248r on physix96

now that some of the xeon 6248r are available, I can launch benchmarks on one of them (as 6248r is the cpu that we are planning to purchgase if we choose intel)

graffy@physix96:/mnt/workl/graffy/bug3177$ ./cohab_batch.bash 48 physix96 10