Skip to content

Releases: ashvardanian/SimSIMD

Release v6.3.0

24 Jan 15:36
Compare
Choose a tag to compare

Release: v6.3.0 [skip ci]

Minor

  • Add: simsimd_flush_denormals (63af257)

Patch

  • Make: Faster cibuildwheel releases (e54e939)
  • Make: Fix CI instance label (60048b4)
  • Make: Use newer Python for cibuildwheel (e922019)
  • Make: Skip 32-bit Windows Python images (372480a)
  • Make: Use newer image for Arm CI (a63f55f)
  • Make: Patch pyproject.toml (b3e35a9)
  • Make: Skip PyPy builds (1fe7faa)
  • Make: test-command Windows compatibility (dea5b71)
  • Make: Skip armv7l PyPi builds (9227c86)
  • Make: Skip missing ARM builds (e1de8ee)
  • Make: Use default images (4d9e6e7)
  • Make: Separate PyPi CI jobs (b690f7b)
  • Improve: Naming CI steps (d883b9d)
  • Make: Try newer cibuildwheel==2.22 (77898ac)
  • Improve: Test flushing denormalized values (9685346)
  • Improve: Code styling (adbbd9c)
  • Improve: Prefer Asm in _simsimd_flush_denormals_x86 (7c54935)
  • Docs: Spelling (114ff7d)

Release v6.2.3

21 Dec 18:53
Compare
Choose a tag to compare

Release: v6.2.3 [skip ci]

Patch

  • Docs: Navigating the codebase (c375e3b)
  • Docs: FMA ports details (7c87779)

Release v6.2.2

17 Dec 20:36
Compare
Choose a tag to compare

Release: v6.2.2 [skip ci]

Patch

Release v6.2.1

27 Nov 12:44
Compare
Choose a tag to compare

Release: v6.2.1 [skip ci]

Patch

  • Fix: -Wvla warnings (671be9f)
  • Docs: Harley-Seal plans for binary kernels (45dbe6e)
  • Docs: Cleaner benchmarks table (a39419c)

v6.2: Complex Bilinear Forms for Physics ⚛️

26 Nov 15:19
Compare
Choose a tag to compare
  • Technical highlight of the release: AVX-512 masking is extremely handy in implementing unrolled BLAS Level 2 operations for small inputs, resulting in up to 5x faster kernels than OpenBLAS.
  • Semantic highlight of the release: Bilinear forms now support complex numbers as inputs, extending the kernels' applicability to Computational Physics.

Bilinear Forms are essential in Scientific Computing. Some of the most computationally intensive cases arise in Quantum systems and their simulations, as discussed on r/Quantum. This PR adds support for complex inputs to make it more broadly applicable.

$$\text{BilinearForm}(a, b, M) = a^T M b$$

In Python, you can execute this by consecutively calling 2 NumPy functions. Ideally, reusing a buffer for the intermediate results:

ndim = 128

import numpy as np
dtype = np.float32
temporary_vector = np.empty((ndim, ), dtype=dtype)

first_quantum_state = np.random.randn(ndim).astype(dtype)
second_quantum_state = np.random.randn(ndim).astype(dtype)
interaction_matrix = np.random.randn(ndim, ndim).astype(dtype)

np.matmul(first_quantum_state, interaction_matrix, out=temporary_vector)
result: float = np.inner(temporary_vector, second_quantum_state)

With SimSIMD, the last 2 lines are fused:

import simsimd as simd
simd.bilinear(first_quantum_state, second_quantum_state, interaction_matrix)

For 128-dimensional np.float32, the latency of 2.11 μs with NumPy went down to 1.31 μs. For smaller 16-dimensional np.float32, the latency of 1.31 μs with NumPy went down to 202 ns. As always, the gap is wider for low-precision np.float16 representations: 2.68 μs with NumPy vs 313 ns with NumPy.

Small Matrices and AVX-512

In the past, developers were used to providing separate precompiled kernels for every reasonable matrix size when dealing with small matrices. That negatively affects the binary size and makes CPU L1i instruction caches ineffective. With AVX-512, however, for different matrix sizes, we can reuse the same single-instruction vectorized loops with just a single additional BZHI instruction precomputing the load masks.

Avoiding Data Dependency

A common approach in dot products is to use a single register to accumulate dot products. That VFMADD132PS instruction:

  • AMD Zen 4 has a latency of 4 cycles and can execute on ports 0 and 1.
  • Intel Skylake-X has a latency of 4 cycles and can execute on ports 0 and 5.

Assuming it can run on 2 ports simultaneously, even on modern hardware, introducing data dependency between consecutive statements is inefficient. In future generations, we may be able to compute this on more ports, so to "futureproof" the solution, I use 4 intermediaries.

Avoiding Horizontal Reductions

When computing $a \dot X \dot b$, we may prefer to evaluate $X \dot b$ first due to the associativity of matrix multiplication. On tiny inputs, the operation may be bottlenecked by computing horizontal reductions for every one of the rows in $X$. Instead, we use more serial loads and broadcasts but only perform one horizontal accumulation in the end, assuming all of the needed intermediaries fit into a single register (or a few if we minimize the data dependency).

Intel Sapphire Rapids Benchmarks

Running on recent Intel Sapphire Rapids CPUs, one can expect the following performance metrics for 128-dimensional Bilinear Forms for SimSIMD and OpenBLAS:

-----------------------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------
bilinear_f64_blas<128d>/min_time:10.000/threads:1            3584 ns         3584 ns      3906234 abs_delta=3.8576a bytes=571.503M/s pairs=279.054k/s relative_error=1.45341f
bilinear_f64c_blas<128d>/min_time:10.000/threads:1           7605 ns         7604 ns      1856665 abs_delta=3.90906a bytes=538.656M/s pairs=131.508k/s relative_error=3.10503f
bilinear_f32_blas<128d>/min_time:10.000/threads:1            1818 ns         1818 ns      7621072 abs_delta=743.294p bytes=563.325M/s pairs=550.122k/s relative_error=301.396n
bilinear_f32c_blas<128d>/min_time:10.000/threads:1           3607 ns         3606 ns      3886483 abs_delta=958.531p bytes=567.864M/s pairs=277.278k/s relative_error=1.4445u
bilinear_f16_haswell<128d>/min_time:10.000/threads:1         1324 ns         1324 ns     10597225 abs_delta=1.31674n bytes=386.742M/s pairs=755.355k/s relative_error=851.968n
bilinear_bf16_haswell<128d>/min_time:10.000/threads:1        1305 ns         1305 ns     10752131 abs_delta=1.33001n bytes=392.464M/s pairs=766.532k/s relative_error=561.046n
bilinear_bf16_genoa<128d>/min_time:10.000/threads:1           862 ns          862 ns     16241596 abs_delta=1.40284n bytes=593.885M/s pairs=1.15993M/s relative_error=849.533n
bilinear_bf16c_genoa<128d>/min_time:10.000/threads:1         2610 ns         2610 ns      5355435 abs_delta=351.596p bytes=392.313M/s pairs=383.118k/s relative_error=243.698n
bilinear_f16_sapphire<128d>/min_time:10.000/threads:1         875 ns          875 ns     16038203 abs_delta=10.5652u bytes=584.951M/s pairs=1.14248M/s relative_error=9.42998m
bilinear_f16c_sapphire<128d>/min_time:10.000/threads:1       2159 ns         2159 ns      6449575 abs_delta=4.43296u bytes=474.398M/s pairs=463.28k/s relative_error=3.98057m
bilinear_f64_skylake<128d>/min_time:10.000/threads:1         3483 ns         3483 ns      4019657 abs_delta=4.3853a bytes=587.96M/s pairs=287.09k/s relative_error=3.02046f
bilinear_f64c_skylake<128d>/min_time:10.000/threads:1        7178 ns         7178 ns      1949803 abs_delta=3.45547a bytes=570.624M/s pairs=139.313k/s relative_error=4.07708f
bilinear_f32_skylake<128d>/min_time:10.000/threads:1        1783 ns         1783 ns      7848896 abs_delta=2.45041n bytes=574.255M/s pairs=560.796k/s relative_error=811.561n
bilinear_f32c_skylake<128d>/min_time:10.000/threads:1       3504 ns         3504 ns      3976879 abs_delta=1.94251n bytes=584.494M/s pairs=285.397k/s relative_error=2.99757u
bilinear_f64_serial<128d>/min_time:10.000/threads:1          5528 ns         5528 ns      2529904 abs_delta=0 bytes=370.459M/s pairs=180.888k/s relative_error=0
bilinear_f64c_serial<128d>/min_time:10.000/threads:1        12324 ns        12324 ns      1140788 abs_delta=0 bytes=332.371M/s pairs=81.1453k/s relative_error=0
bilinear_f32_serial<128d>/min_time:10.000/threads:1          5299 ns         5298 ns      2649614 abs_delta=1.69242n bytes=193.264M/s pairs=188.734k/s relative_error=776.834n
bilinear_f32c_serial<128d>/min_time:10.000/threads:1        10217 ns        10216 ns      1370535 abs_delta=1.89398n bytes=200.461M/s pairs=97.8816k/s relative_error=3.25219u
bilinear_f16_serial<128d>/min_time:10.000/threads:1         42372 ns        42371 ns       330369 abs_delta=1.93284n bytes=12.0838M/s pairs=23.6011k/s relative_error=1.51289u
bilinear_f16c_serial<128d>/min_time:10.000/threads:1        46101 ns        46100 ns       303997 abs_delta=1.77214n bytes=22.2124M/s pairs=21.6918k/s relative_error=1.5494u
bilinear_bf16_serial<128d>/min_time:10.000/threads:1        85325 ns        85324 ns       163256 abs_delta=1.34067n bytes=6.00066M/s pairs=11.72k/s relative_error=527.801n
bilinear_bf16c_serial<128d>/min_time:10.000/threads:1      178970 ns       178967 ns        78235 abs_delta=1.46323n bytes=5.72174M/s pairs=5.58764k/s relative_error=1004.88n

Highlights:

  • Single- and double-precision kernels are only about 5% faster than BLAS due to removed temporary buffer stores.
  • Both bf16 and f16 kernels provide linear speedups proportional to the number of bits in the data type.

On low-dimensional inputs, the performance gap is larger:

---------------------------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
bilinear_f64_blas<8d>/min_time:10.000/threads:1            42.7 ns         42.7 ns    328247670 abs_delta=15.9107a bytes=3.00004G/s pairs=23.4378M/s relative_error=550.946a
bilinear_f64c_blas<8d>/min_time:10.000/threads:1           57.4 ns         57.4 ns    243896993 abs_delta=21.3452a bytes=4.46378G/s pairs=17.4366M/s relative_error=514.643a
bilinear_f32_blas<8d>/min_time:10.000/threads:1            32.2 ns         32.2 ns    434784869 abs_delta=6.73645n bytes=3.97757G/s pairs=31.0747M/s relative_error=235.395n
bilinear_f32c_blas<8d>/min_time:10.000/threads:1           50.6 ns         50.6 ns    276504577 abs_delta=7.97379n bytes=2.52823G/s pairs=19.7518M/s relative_error=251.204n
bilinear_f16_haswell<8d>/min_time:10.000/threads:1         13.7 ns         13.7 ns   1000000000 abs_delta=6.06053n bytes=9.35133G/s pairs=73.0573M/s relative_error=139.096n
bilinear_bf16_haswell<8d>/min_time:10.000/threads:1        13.0 ns         13.0 ns   1000000000 abs_delta=5.03892n bytes=9.84787G/s pairs=76.9365M/s relative_error=114.101n
bilinear_bf16_genoa<8d>/min_time:10.000/threads:1          12.6 ns         12.6 ns   1000000000 abs_delta=5.63947n bytes=10.1297G/s pairs=79.1384M/s relative_error=166.305n
bilinear_bf16c_genoa<8d>/min_time:10.000/threads:1         69.0 ns         69.0 ns    203022389 abs_delta=1.61581n bytes=1.85573G/s pairs=14.4979M/s relative_error=60.9203n
bilinear_f16_sapphire<8d>/min_time:10.000/threads:1        8.52 ns         8.52 ns   1000000000 abs_delta=51.4863u bytes=...
Read more

v6.1.2: Avoid AVX-512BW on Skylake-X

24 Nov 19:19
Compare
Choose a tag to compare

Previous AVX-512 implementation of complex products used an extra ZMM register for swap_adjacent_vec. Moreover, they used the vpshufb instruction available only with the Ice Lake capability and newer. The replacement uses the _mm512_permute_ps and its double-precision variant.

v6.1.1: Expose Set Intersections to Rust SDK

20 Nov 15:07
Compare
Choose a tag to compare

v6.1.0: Add canonical L2 to Rust SDK

20 Nov 07:26
Compare
Choose a tag to compare
Release: v6.1.0 [skip ci]

Release v6.0.7

19 Nov 11:20
Compare
Choose a tag to compare

Release: v6.0.7 [skip ci]

Release v6.0.6

18 Nov 22:59
Compare
Choose a tag to compare

Release: v6.0.6 [skip ci]