Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native SynchronizedObject implementation allocates too much #412

Open
haitaka opened this issue Mar 21, 2024 · 2 comments
Open

Native SynchronizedObject implementation allocates too much #412

haitaka opened this issue Mar 21, 2024 · 2 comments
Assignees

Comments

@haitaka
Copy link

haitaka commented Mar 21, 2024

Native SynchronizedObject implementation allocates an instance of LockState on every lock() and unlock() call.

I've compared performance of the SynchronizedObject and a simple non-allocationg spin-lock in uncontended case. The benchmarks can be found here. In my case, the metrics reported were as follows (greater values are better):

macosArm64 summary:
Benchmark                                Mode  Cnt         Score        Error    Units
AtomicFUBenchmark.benchmarkUncontended  thrpt    5  25905005.328 ±  96180.326  ops/sec
NoalLockBenchmark.benchmarkUncontended  thrpt    5  65159902.404 ± 126124.691  ops/sec

Compose Multiplatform heavily utilizes atomicfu's SynchronizedObject under the hood, with the LockState instances being the most frequent heap allocation case.
Certain Compomse Multiplatfrom benchmarks experience noticeable improvement in missed frame ratio (1/3 better) if the SynchronizedObject implementation is replaced with a non allocating spin lock.

@qwwdfsad qwwdfsad self-assigned this Mar 21, 2024
haitaka pushed a commit to haitaka/kotlinx-atomicfu that referenced this issue Apr 8, 2024
…tion (Kotlin#412)

    * Pack "thin" lock state in a single ptr-word
    * Spin when "fat" state is required (it's harder to pack in a single word)
@qwwdfsad
Copy link
Member

Our findings so far (copying my internal response FTR):

====

I'll give a short overview of what is wrong with synchronized objects and why it is hard to achieve what plain spinlock can achieve:

  • The lock has to be reentrant (and this is a really tough constraint to dance around, e.g. it immediately filters out easy algos like Benaphores)
  • The underlying platform (K/N) has no notion of threads and thread parking; the best we can rely on is .def-based posix primitives. E.g. even a permit-based parking should be implemented from scratch, probably using futexes or something like that
  • To switch between thin reentrant lock and posix lock without permit-based parking, we have to juggle with a primitive (posix_thread_id) that we cannot box and potentially-relocateable object (posix lock), which is a tough thing to do

Options I see I can pursue:

  • Look into platform's POSIX. High chances are on some platforms, things like pthread_mutex_lock do not step into the kernel when there is no contention
  • Do some bit-twiddling and hard assumptions to pack "either this is reinterpreted pointer to POSIX lock or thread ID that holds the lock". A honorable mention goes to the author of pthread_t that decided to define it not only as an arbitrary-size integer without any constraints (e.g. can be negative), but potentially as a struct as well.
  • Do something in between like try to abuse @ThreadLocal as the PID storage and see whether it yields any decent perf

@qwwdfsad
Copy link
Member

So while we cannot reasonably fix this one yet, the outcomes are:

  • Compose folks are recommended to use SO more carefully. For their specific benchmark (access to CoW TL map), @ThreadLocal or a spin-lock is recommended
  • Current locks are unstable and have non-trivial performance characteristics. Even though they are undocumented, they are being used, we should warn users about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants