Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak when using OpenSSL and threads #1942

Closed
wants to merge 12 commits into from

Conversation

ashman-p
Copy link
Contributor

'run_tests' memory leak tests fails when build options enables OQS_USE_PTHREADS and OQS_USE_OPENSSL.

valgrind tests/test_sig SPHINCS+-SHA2-256s-simple
==158054== Memcheck, a memory error detector
==158054== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==158054== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==158054== Command: tests/test_sig SPHINCS+-SHA2-256s-simple
==158054== 
Testing signature algorithms using liboqs version 0.10.2-dev
Configuration info
==================
Target platform:  x86_64-Linux-5.4.0-126-generic
Compiler:         gcc (9.4.0)
Compile options:  [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.10.2-dev
Git commit:       f47817d8cb0cd621dada5276f30f99934f3132a9
OpenSSL enabled:  Yes (OpenSSL 3.3.3.8.3.0-dev )
AES:              NI
SHA-2:            OpenSSL
SHA-3:            OpenSSL
OQS build flags:  OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release 
CPU exts active:  AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
================================================================================
Sample computation for signature SPHINCS+-SHA2-256s-simple
================================================================================
verification passes as expected
==158054== 
==158054== HEAP SUMMARY:
==158054==     in use at exit: 240 bytes in 1 blocks
==158054==   total heap usage: 7,654 allocs, 7,653 frees, 844,712 bytes allocated
==158054== 
==158054== LEAK SUMMARY:
**==158054==    definitely lost: 240 bytes in 1 blocks**
==158054==    indirectly lost: 0 bytes in 0 blocks
==158054==      possibly lost: 0 bytes in 0 blocks
==158054==    still reachable: 0 bytes in 0 blocks
==158054==         suppressed: 0 bytes in 0 blocks
==158054== Rerun with --leak-check=full to see details of leaked memory
==158054== 
==158054== For lists of detected and suppressed errors, rerun with: -s
==158054== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Fixes #1941.

  • Does this PR change the input/output behaviour of a cryptographic algorithm (i.e., does it change known answer test values)? (If so, a version bump will be required from x.y.z to x.(y+1).0.)
  • Does this PR change the list of algorithms available -- either adding, removing, or renaming? Does this PR otherwise change an API? (If so, PRs in fully supported downstream projects dependent on these, i.e., oqs-provider will also need to be ready for review and merge by the time this is merged.)

@ashman-p ashman-p requested a review from dstebila as a code owner September 29, 2024 05:17
@ashman-p ashman-p self-assigned this Sep 29, 2024
@ashman-p ashman-p marked this pull request as draft September 30, 2024 03:48
@ashman-p ashman-p marked this pull request as ready for review September 30, 2024 03:50
@ashman-p ashman-p marked this pull request as draft September 30, 2024 03:50
@ashman-p ashman-p marked this pull request as ready for review September 30, 2024 05:59
@dstebila
Copy link
Member

This is present in the 0.11.0 release, I guess? Is it critical to fix?

Do we have a CI job that tests this configuration?

@SWilson4
Copy link
Member

This is present in the 0.11.0 release, I guess? Is it critical to fix?

Do we have a CI job that tests this configuration?

For what it's worth, OQS_USE_PTHREADS is not intended to be exposed to the user as a config variable; it's set automatically based on the availability of pthreads and not documented in CONFIGURE.md. I would guess that all of our Linux and macOS CI systems have pthreads, so we are probably not running any tests without pthreads enabled—which makes me wonder why we haven't previously detected this leak. I'm hoping that maybe this leak only arises when OQS_USE_PTHREADS is set manually instead of being autoset by CMake.

I'm not able to reproduce the leak myself—@ashman-p could you let me know what Linux environment you're running so I can test various configs out in Docker images?

@ashman-p
Copy link
Contributor Author

I'm not able to reproduce the leak myself—@ashman-p could you let me know what Linux environment you're running so I can test various configs out in Docker images?

@SWilson4 , Thanks for checking on this issue.
I just have a generic linux setup.
Linux nashley-dev 5.4.0-126-generic #142-Ubuntu SMP Fri Aug 26 12:12:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

liboqs Build

cmake -DBUILD_SHARED_LIBS=ON -DOQS_USE_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug -GNinja ..
-- The C compiler identification is GNU 9.4.0
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Performing Test CC_SUPPORTS_WA_NOEXECSTACK
-- Performing Test CC_SUPPORTS_WA_NOEXECSTACK - Success
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Algorithms filtered for KEM_ml_kem_512;KEM_ml_kem_768;KEM_ml_kem_1024;KEM_ml_kem_512_ipd;KEM_ml_kem_768_ipd;KEM_ml_kem_1024_ipd;SIG_ml_dsa_44;SIG_ml_dsa_65;SIG_ml_dsa_87;SIG_sphincs_sha2_128f_simple;SIG_sphincs_sha2_128s_simple;SIG_sphincs_sha2_256f_simple;SIG_sphincs_sha2_256s_simple;SIG_sphincs_shake_128f_simple;SIG_sphincs_shake_128s_simple;SIG_sphincs_shake_256f_simple
-- Found OpenSSL: /home/ubuntu/openssl3.3/ossl-install/lib/libcrypto.so (Required is at least version "1.1.1")
-- Looking for aligned_alloc
-- Looking for aligned_alloc - found
-- Looking for posix_memalign
-- Looking for posix_memalign - found
-- Looking for memalign
-- Looking for memalign - found
-- Looking for explicit_bzero
-- Looking for explicit_bzero - found
-- Looking for explicit_memset
-- Looking for explicit_memset - not found
-- Looking for memset_s
-- Looking for memset_s - not found
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.17") found components: doxygen dot
-- Configuring done
-- Generating done

@ashman-p
Copy link
Contributor Author

ashman-p commented Oct 1, 2024

@SWilson4, the problem occurs when use of OQS_USE_SHA3_OPENSSL=ON and threading is active. The only significance of the use of threads is that the problem went away when threading was disabled.

@SWilson4
Copy link
Member

SWilson4 commented Oct 2, 2024

OK, I've managed to reproduce the leak, but only when building against OpenSSL >= 3.3.2. In particular, the leak does not occur when building against OpenSSL 3.3.1 with the same configuration.

It seems to me that this might actually be an OpenSSL bug rather than a liboqs bug, especially as the fix proposed here should (?) be unnecessary: per the OpenSSL docs:

As of version 1.1.0 OpenSSL will automatically allocate all resources that it needs so no explicit initialisation is required. Similarly it will also automatically deinitialise as required.

Looping in @baentsch @romen @beldmit @levitte: any knowledge of a regression going from 3.3.1 to 3.3.2?

I will try to isolate the exact commit which introduces the leak.

@beldmit
Copy link
Contributor

beldmit commented Oct 2, 2024

At first glance nothing suspicious

@SWilson4
Copy link
Member

SWilson4 commented Oct 2, 2024

I will try to isolate the exact commit which introduces the leak.

It seems that this leak was introduced in openssl/openssl@83efcfd. When building against the parent of that commit (openssl/openssl@a13df68) the leak does not occur.

@SWilson4
Copy link
Member

SWilson4 commented Oct 2, 2024

^ fyi @beldmit

@beldmit
Copy link
Contributor

beldmit commented Oct 2, 2024

@nhorman could you please take a look?

@nhorman
Copy link

nhorman commented Oct 2, 2024

FWIW, i don't see anything expressly wrong with the changes, its fine to call OPENSSL_init_crypto in your application rather than having the library do it as needed, but I'm having a hard time understanding:

  1. Where exactly the leak is
  2. How manually calling OpenSSL_init_crypto fixes it

I'll try reproduce here, but since you already have it set up, can you re-run valgrind with --leak-check=full to show the stack trace of where the leaked memory is being allocated?

@SWilson4
Copy link
Member

SWilson4 commented Oct 2, 2024

Sure, here's the valgrind output:

==462229== 
==462229== HEAP SUMMARY:
==462229==     in use at exit: 240 bytes in 1 blocks
==462229==   total heap usage: 6,841 allocs, 6,840 frees, 845,217 bytes allocated
==462229== 
==462229== 240 bytes in 1 blocks are definitely lost in loss record 1 of 1
==462229==    at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==462229==    by 0x4AEC3C3: CRYPTO_malloc (mem.c:202)
==462229==    by 0x4AEC44D: CRYPTO_zalloc (mem.c:222)
==462229==    by 0x4B06246: ossl_rcu_read_lock (threads_pthread.c:410)
==462229==    by 0x49D7FB8: module_find (conf_mod.c:403)
==462229==    by 0x49D7A9D: module_run (conf_mod.c:259)
==462229==    by 0x49D782F: CONF_modules_load (conf_mod.c:166)
==462229==    by 0x49D796A: CONF_modules_load_file_ex (conf_mod.c:215)
==462229==    by 0x49D8CB9: ossl_config_int (conf_sap.c:68)
==462229==    by 0x4AEAF13: ossl_init_config (init.c:258)
==462229==    by 0x4AEAEF5: ossl_init_config_ossl_ (init.c:256)
==462229==    by 0x4F31EC2: __pthread_once_slow (pthread_once.c:116)
==462229== 
==462229== LEAK SUMMARY:
==462229==    definitely lost: 240 bytes in 1 blocks
==462229==    indirectly lost: 0 bytes in 0 blocks
==462229==      possibly lost: 0 bytes in 0 blocks
==462229==    still reachable: 0 bytes in 0 blocks
==462229==         suppressed: 0 bytes in 0 blocks
==462229== 
==462229== For lists of detected and suppressed errors, rerun with: -s
==462229== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

OpenSSL is built with

cd ~/openssl
git checkout 83efcfdfa1
./config --debug --prefix=`pwd`/../.localopenssl_83efcfdfa1_debug && make -j 4 && make install_sw install_ssldirs

liboqs is built with

cd ~/liboqs
git checkout 7f4c89b2
mkdir build && cd build
cmake -GNinja -DOPENSSL_ROOT_DIR=../.localopenssl_83efcfdfa1_debug -DOQS_MINIMAL_BUILD="SIG_ml_dsa_65" -DOQS_USE_SHA3_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug ..
ninja

The triggering command is

valgrind --leak-check=full ~/liboqs/build/tests/test_sig ML-DSA-65

@nhorman
Copy link

nhorman commented Oct 2, 2024

Thank you. Thats interesting, I just tried to recreate with the head of the master branch, and didn't encounter the leak. Given that you reported the issue was introduced with commit 83efcfd (which is a backport of the original commit to the 3.3 branch, that makes me think we fixed something in master that we didn't backport. Let me see if I can find it

@SWilson4
Copy link
Member

SWilson4 commented Oct 2, 2024

Thank you. Thats interesting, I just tried to recreate with the head of the master branch, and didn't encounter the leak. Given that you reported the issue was introduced with commit 83efcfd (which is a backport of the original commit to the 3.3 branch, that makes me think we fixed something in master that we didn't backport. Let me see if I can find it

Weird—I'm also getting the leak on the latest master (openssl/openssl@c262cc0).

@nhorman
Copy link

nhorman commented Oct 2, 2024

hmm, what version of valgrind are you using? I'm on valgrind-3.23.0

@nhorman
Copy link

nhorman commented Oct 2, 2024

also, what does your openssl.conf look like? Its possible I'm not loading any modules, and as such not triggering the allocation thats leaking

@nhorman
Copy link

nhorman commented Oct 2, 2024

In fact, I'm sure its my config, I just ran it through gdb and never hit a breakpoint in ossl_rcu_read_lock

@nhorman
Copy link

nhorman commented Oct 2, 2024

There we go, if I load the oqs provider (duh, should have thought of that), I see the leak. Now to figure out why

@SWilson4
Copy link
Member

SWilson4 commented Oct 2, 2024

I'm on valgrind-3.22.0, and I haven't made any manual changes to openssl.cnf (including to load any provider).

I'm running everything in a Docker container on x86_64 / Linux; I'll attach a Dockerfile to duplicate my setup in case it's helpful to you.

@SWilson4
Copy link
Member

SWilson4 commented Oct 2, 2024

FROM openquantumsafe/ci-ubuntu-latest:latest

WORKDIR /root

RUN git clone --depth=1 --branch=0.11.0 https://github.com/open-quantum-safe/liboqs.git liboqs
RUN git clone --branch=master https://github.com/openssl/openssl.git openssl

WORKDIR /root/openssl

RUN ./config --debug --prefix=/root/.localopenssl && make -j && make install_sw install_ssldirs

WORKDIR /root/liboqs

RUN mkdir build

WORKDIR ./build

RUN cmake -GNinja -DOPENSSL_ROOT_DIR=../.localopenssl -DOQS_MINIMAL_BUILD="SIG_ml_dsa_65" -DOQS_USE_SHA3_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug ..
RUN ninja

@nhorman
Copy link

nhorman commented Oct 2, 2024

Ok, think I found the problem. openssl/openssl@83efcfd added a use of ossl_init_thread_start to the code base, for which there is only a handful of users. ossl_init_thread_start registers a handler function to clean up thread local data, not when the thread exits explicitly, but rather when the library context against which the thread allocates resources is deleted. I'm not 100% sure why we do this, but its how it works.

Normally its not a problem. Library contexts get deleted in OSSL_LIB_CTX_free, that calls context_deinit->ossl_ctx_thread_stop->init_thread_stop, and that calls all the handlers registered via ossl_init_thread_start, and everything gets cleaned up.

However, in liboqs, you implicitly call the OpenSSL_cleanup routine via the registered atexit() handler, which again, is ok, but, it means that in test_sig.c (and likely your other tests), you join and exit the thread prior to OpenSSL getting cleaned up. It appears whats happening is that, on exit, the C library is cleaning up its thread local data list (I feel like I investigated this before), and, not having an cleanup handler registered (recalling from above that we register the local data handler with libcrypto, not with libc), and so it just NULL's the pointer in the thread local data table in the c library. As a result when OpenSSL_cleanup is called (either explicitly or implicitly), we run through the libcrypto registered cleanup handlers, the handler for rcu gets called (ossl_rcu_free_local_data), and when we call CRYPTO_THREAD_get_local to fetch the local data to free (which maps to pthread_getspecific) it returns NULL, as libc already expunged that pointer. As a result the local data leaks.

This API could use some reworking, as It seems to me that it relies on implementation details in libc that aren't guaranteed (I think in the past those pointers stuck around until after all the exit handlers ran, but I need to look further). The documentation here:
https://docs.openssl.org/3.0/man3/OPENSSL_init_crypto/#description
Is sort of vague on the subject, indicating that the cleanup should be done on thread exit, but thats not really true, as its actually done on behalf of the thread when the library context is cleaned.

For the time being, as any fix here would be an ABI break, I think its something we need to live with. The good news(?) is that theres an API to manage this, and theres precedent for it in our tests. OPENSSL_thread_stop is used to explicitly clean up a threads local data prior to exiting, which can be used here. You can see it used in situations similar/identical to yours in drbgtest.c and threadstest.h in which thread wrappers like thread_run() call it prior to exiting for just this reason.

You will likely need to do this for your various thread enabled tests, but this patch:

diff --git a/tests/test_sig.c b/tests/test_sig.c
index e94d3034..bd5e4728 100644
--- a/tests/test_sig.c
+++ b/tests/test_sig.c
@@ -9,6 +9,7 @@
 #include <string.h>
 
 #include <oqs/oqs.h>
+#include <openssl/crypto.h>
 
 #if OQS_USE_PTHREADS
 #include <pthread.h>
@@ -183,6 +184,7 @@ struct thread_data {
 void *test_wrapper(void *arg) {
        struct thread_data *td = arg;
        td->rc = sig_test_correctness(td->alg_name);
+       OPENSSL_thread_stop();
        return NULL;
 }
 #endif

which I tested here, resolves the leak for me

@ashman-p
Copy link
Contributor Author

ashman-p commented Oct 4, 2024

@SWilson4, I can complete the code changes if you haven't already done them. But can you work out the documentation?

@SWilson4
Copy link
Member

SWilson4 commented Oct 4, 2024

@SWilson4, I can complete the code changes if you haven't already done them. But can you work out the documentation?

Sounds like a plan, thanks @ashman-p.

@ashman-p ashman-p marked this pull request as draft October 5, 2024 01:43
@ashman-p
Copy link
Contributor Author

ashman-p commented Oct 8, 2024

@baentsch, @nhorman, can you give me any pointers on resolving the errors with 'dlopen' builds? Undefined ref to OPENSSL_thread_stop when -DOQS_DLOPEN_OPENSSL=ON is used?
Thanks.

@baentsch
Copy link
Member

baentsch commented Oct 8, 2024

@baentsch, @nhorman, can you give me any pointers on resolving the errors with 'dlopen' builds? Undefined ref to OPENSSL_thread_stop when -DOQS_DLOPEN_OPENSSL=ON is used? Thanks.

Tagging @ueno as the author of that option.

@ueno
Copy link
Contributor

ueno commented Oct 8, 2024

@baentsch, @nhorman, can you give me any pointers on resolving the errors with 'dlopen' builds? Undefined ref to OPENSSL_thread_stop when -DOQS_DLOPEN_OPENSSL=ON is used? Thanks.

With dlopen builds, OpenSSL functions are supposed to be called in a form OSSL_FUNC(function)(args), after including src/common/ossl_helpers.h. The problem, however, is that the header file is not exposed to the tests. My suggestion (haven't tried it yet) is to create a thin wrapper around OPENSSL_thread_stop, say OQS_thread_stop in src/common/common.c, expose it as an internal API (i.e., without OQS_API), and use it throughout the tests.

@baentsch
Copy link
Member

baentsch commented Oct 8, 2024

say OQS_thread_stop in src/common/common.c, expose it as an internal API [...] and use it throughout the tests

OK -- that might solve the issue for "internal use" (tests), but isn't the proposal to expose OQS_thread_stop externally (see above)?

tests/test_sig_stfl.c Outdated Show resolved Hide resolved
tests/test_sig.c Outdated Show resolved Hide resolved
tests/test_kem.c Outdated Show resolved Hide resolved
ashman-p and others added 12 commits October 21, 2024 23:45
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Steen Rasmussen <[email protected]>
Co-authored-by: Steen Rasmussen <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
Remove OPENSSL build-time macro from helper functions.

Signed-off-by: Norman Ashley <[email protected]>
Use of OQS_DLOPEN_OPENSSL needs a new API change to expose an OQS function that calls  OPENSSL_thread_stop

Signed-off-by: Norman Ashley <[email protected]>
Use of OQS_DLOPEN_OPENSSL needs a new API change to expose an OQS function that calls  OPENSSL_thread_stop

Signed-off-by: Norman Ashley <[email protected]>
Use of OQS_DLOPEN_OPENSSL needs a new API change to expose an OQS function that calls  OPENSSL_thread_stop

Signed-off-by: Norman Ashley <[email protected]>
@ashman-p
Copy link
Contributor Author

This last commit isolates the memory leak fix when dlopen is used. A separate will be created to track the addition of a new OQS API needed to address this. I hope this is a reasonable and agreeable approach as discussed earlier in a status meeting.

@ashman-p ashman-p marked this pull request as ready for review October 22, 2024 04:16
@SWilson4
Copy link
Member

It looks like some unrelated changes to CBOM-related files snuck into the diff. Possibly a rebase gone wrong @ashman-p?

@ashman-p
Copy link
Contributor Author

It looks like some unrelated changes to CBOM-related files snuck into the diff. Possibly a rebase gone wrong @ashman-p?

Yeah, sorry about that. I am going to punt this PR and start over. New PR is
here.

@ashman-p
Copy link
Contributor Author

Starting over...

@ashman-p ashman-p closed this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'ninja run_tests' fails due to memory leak
9 participants