Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nng creates too many threads #1572

Closed
Jasper-Bekkers opened this issue Feb 1, 2022 · 23 comments
Closed

nng creates too many threads #1572

Jasper-Bekkers opened this issue Feb 1, 2022 · 23 comments

Comments

@Jasper-Bekkers
Copy link

Describe the bug
I'm running a 32-core/64-thread threadripper machine, and it looks like in total nng creates 151 threads.

While this could be the wanted case for some apps, for the application that I'm writing it's all around way too much, and clutters up the profiling tools quite a bit.

Thread name Thread count
nng:iocp 64
nng:ipc:dial 1
nng:resolver 4
nng:task 16
nng:reap2 1
nng:timer 1
nng:aio:expire 64

Expected behavior
Ideally, thread creation is left to the application instead of being done by the nng framework, that way I can control scheduling of work and fit it into my async execution runtime, instead of having nng take up resource (and arguably more important) screen real estate in the profiler.

In our application we're trying to severely limit our number of threads and schedule work on our own.

Actual Behavior
See above

To Reproduce
Simply run nng on a high core-count machine.

** Environment Details **

  • NNG version: we're using the nng-rs crate, I think it's a bit behind on latest nng
  • Operating system and version: Windows
  • Compiler and language used: rustc
  • Shared or static library: unknown (and probably irrelevant)

Additional context
I found an additional issue on limiting the amount of threads being spun up in #769 this seems to require compile time toggles? Ideally this is something that can be set at runtime (we have a few different use-cases all sharing the same codebase - one case where run the service inproc and want a low thread count, one where we run the service as a daemon and we don't really care about thread count).

@JaylinYu
Copy link
Contributor

JaylinYu commented Feb 6, 2022

This is expected behavior. But I agree with you that nng should have a way to set the num of threads at runtime ideally.
Hi @gdamore, please have a look at "var_num_taskq_threads" https://github.com/nanomq/nng/blob/master/src/core/taskq.c, I will prepare a PR if you are ok with my way.

@gdamore
Copy link
Contributor

gdamore commented Feb 6, 2022

The iocp and expire threads are created one per core to provide good scalability.

Having said that, I'm totally amenable to reducing this. Note that these threads should generally be sitting around idle unless you have extreme I/O pressure, in which case you actually probably want to have this many threads (although you wind up having a trade-off of application processing and NNG I/O processing.)

@gdamore
Copy link
Contributor

gdamore commented Feb 6, 2022

@JaylinYu those tunables won't really help this -- because the threads involved aren't taskq threads. I'm also pretty opposed to using globals to tune that -- if we need tuning, then local variables (parameters on the taskq) should be set.

We need to ensure that at least 2 threads are allocated for any taskq, as it's likely that deadlocks will occur if only one thread is allocated. I recommend 4 as a bare minimum actually.

@gdamore
Copy link
Contributor

gdamore commented Feb 6, 2022

I think the IOCP and expire threads should have a similar default limiter as the tasks... I think we max this to 16. Arguably these could be even smaller -- especially the expire thread can safely be reduced all the way down to 1. (It used to be one, but we started creating a lot more of them to improve scalability by reducing contention on locks -- each thread has it's own queue, and having one per thread tends to ensure we don't contend on the expiration queue locks, which were one of the larger areas of lock contention.

@Jasper-Bekkers
Copy link
Author

Full disclosure: Since filing this issue I've moved away from nng because of this particular issue.

Note that these threads should generally be sitting around idle unless you have extreme I/O pressure, in which case you actually probably want to have this many threads (although you wind up having a trade-off of application processing and NNG I/O processing.)

I highly doubt this is true, over subscribing a machine like that is almost never useful and just introduces overhead.

@gdamore
Copy link
Contributor

gdamore commented Feb 8, 2022

Idle threads not runnable threads don't hurt anything at all apart from the stack space they consume.

Where things get really ugly is when you have a large number of runnable threads. Then stuff falls apart.

There are unfortunately also some limitations in userland thread frameworks that wind up using way too many resources when allocating a thread (thread stacks) which is why you can't create 10,000 threads in user space harmlessly. (On modern kernels like illumos you can, but only in kernel space. 10,000 threads in that situations uses like 80MB of RAM for stack space, at 8K per thread.)

@gdamore
Copy link
Contributor

gdamore commented Feb 8, 2022

Btw, the thread count we allocate is based on the CPU core count of the machine.

@gdamore
Copy link
Contributor

gdamore commented Apr 16, 2022

Just a few more notes for anyone looking at this:

The 64 expiration threads are almost always idle. They represent a timer thread, and the reason we have so many (one timer per CPU) is to reduce pressure on the central lock used for expirations (timers). We found that in some workloads the expiration queue was tragically contended, and this was a simple solution.

We can probably look at alternative approaches to use fewer threads, since likely the expirations have some reasonable tolerances with respect to timing out.

The IOCP thread list is this way on Windows, as it represents the actual threads doing I/O. Conceivably this could better be done with a Windows thread pool. Note that on other platforms we don't have nearly this many threads, but that comes at a scalability cost as well.

The task threads are the ones doing real work, and likely running user callbacks.

It might be worth looking for ways for user supplied thread(s) to be able to chew on this instead of fixed allocation threads. This could ultimately be used for some kind of cooperative multitasking, although running user callbacks is always a bit dodgy since we have no long those will take to complete.

@JochenBaier
Copy link

The high amount of thread is also a blocker for me. Basically i want to use the Request/Reply
pattern (https://nanomsg.org/gettingstarted/nng/reqrep.html) with only 2 nodes.

On my 6 core machine (Windows 10) this creates 64 threads…

Patching the limits in source would also be fine for me.

@wsxjbupt
Copy link

Is there a way to close redundant threads without modifying the source code?
We only use nng for thread-to-thread communication through a simple P2P approach.
The number of threads is also a big obstacle in our system.

@gdamore

@bionicbeagle
Copy link

I've been evaluating nng for use in some tooling and this issue is a blocker for us as well. Much like @Jasper-Bekkers we often use machines with high core counts and the number of threads is a problem for debuggability and performance profiling since the threads cause lots of clutter as well as making the library needlessly resource intensive regardless of whether threads do actual work or not. In the case of IOCP especially there's typically no real need to have a huge number of threads as long as you only do I/O related work in the completion handlers.

We are currently consuming nng via vcpkg and therefore using build-time defines to reconfigure nng is not really a great option. Ideally there would be an API to allow application-specific tuning. I'm currently not familiar enough with the project to propose what that would look like but might take a stab at it if there is no attractive alternative.

@bionicbeagle
Copy link

bionicbeagle commented Nov 30, 2022

I made a quick-and-dirty local patch to enable initialization-time tweaking of thread counts. It's still a bit rough around the edges but works well enough for me to evaluate.

It adds the following API functions:

void nng_set_ncpu_max(int);  // limit the number of CPUs reported
void nng_set_pool_thread_limit_min(int); // configure minimum number of threads in thread pools
void nng_set_pool_thread_limit_max(int); // configure maximum number of threads in thread pools
void nng_set_resolve_thread_max(int); // configure number of resolve threads

patch is here

I wasn't sure what would be a good naming scheme for these as I didn't see any similar function to mimic, feedback is welcomed! (I already know nng_set_resolve_thread_max is a misnomer as it sets the actual count)

@gdamore
Copy link
Contributor

gdamore commented Apr 20, 2023

So initially Im going to add compile time defaults that are sensible. I may add runtime for these as well.

@gdamore gdamore changed the title nng creates a massive amount of threads on high core count machines nng creates too many expire threads Apr 20, 2023
@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2023

I'm using 1619 as the EXPIRE_THREADS tunable, and a PR is out... (I'm waiting to merge that as I've uncovered some other worse bugs that I want to ensure are fixed first.)

I expect this will merge.

I'll revisit the poller for Windows to do something similar for it as well. Longer term we need to look at ways to improve the parallelism in the pollers anyway -- in nanomq they are running into scalability issues beyond 64 cores with the single epoll() thread.

@gedalia
Copy link

gedalia commented Jun 23, 2023

did a fix ever merge?

@ayizhi
Copy link

ayizhi commented Aug 22, 2023

Hope these threads num settings functions apply as soon as possible, sincerely, respect

@xinxinxiangying258
Copy link

Describe the bug I'm running a 32-core/64-thread threadripper machine, and it looks like in total nng creates 151 threads.

While this could be the wanted case for some apps, for the application that I'm writing it's all around way too much, and clutters up the profiling tools quite a bit.

Thread name Thread count
nng:iocp 64
nng:ipc:dial 1
nng:resolver 4
nng:task 16
nng:reap2 1
nng:timer 1
nng:aio:expire 64
Expected behavior Ideally, thread creation is left to the application instead of being done by the nng framework, that way I can control scheduling of work and fit it into my async execution runtime, instead of having nng take up resource (and arguably more important) screen real estate in the profiler.

In our application we're trying to severely limit our number of threads and schedule work on our own.

Actual Behavior See above

To Reproduce Simply run nng on a high core-count machine.

** Environment Details **

  • NNG version: we're using the nng-rs crate, I think it's a bit behind on latest nng
  • Operating system and version: Windows
  • Compiler and language used: rustc
  • Shared or static library: unknown (and probably irrelevant)

Additional context I found an additional issue on limiting the amount of threads being spun up in #769 this seems to require compile time toggles? Ideally this is something that can be set at runtime (we have a few different use-cases all sharing the same codebase - one case where run the service inproc and want a low thread count, one where we run the service as a daemon and we don't really care about thread count).

Could you tell me how to find nng threads on Windows? I use Process Explorer but cannot find nng threads named as your table shows. THX.

@gdamore
Copy link
Contributor

gdamore commented Sep 14, 2023

I've not merged anything... as the patch was not submitted in the form of a PR, I will need some time to look it over and get it into such form so that it can run in our CI.

@gdamore
Copy link
Contributor

gdamore commented Sep 14, 2023

Btw, there are compile time fixes for these that * did* merge. Runtime is a different matter. Probably it will still need to be set up by the application before calling nng_init() or other functions, because the initial threads have to be configured at startup.

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

I intend at some point to refactor the expiration subsystem entirely -- which will probably let me greatly reduce the needs for lots of threads, by reducing the lock pressure caused by expiration.

I am probably not going to configuring the expiration threads a runtime tunable.

@gdamore
Copy link
Contributor

gdamore commented Dec 18, 2023

In head, we got read of the nni_timer thread. It's only thread, but still an improvement, if only a small one.

@gdamore
Copy link
Contributor

gdamore commented Dec 30, 2023

I've decided to apply a sane limit for Windows by default. 64 was simply too big.

#1749

This limits to 8 by default. Note that POSIX pollers are already single threaded right now, so this isn't necessarily a big deal (although we need to make them parallel too at some point.)

Its compile time tunable as well, but 8 is a lot nicer to work around than 64. It has to be at least 2 right now. (I probably could have allowed just one of these, but I'm paranoid about going too low.)

@gdamore
Copy link
Contributor

gdamore commented Jan 1, 2024

I think I am going to provide an API for this, that applications can call at start up time. Stay tuned.

@gdamore gdamore changed the title nng creates too many expire threads nng creates too many threads Jan 1, 2024
gdamore added a commit that referenced this issue Jan 1, 2024
This further limits some of the thread counts, but principally it
offers a new runtime facility, nng_init_set_parameter(), which can
be used to set certain runtime parameters on the number of threads,
provided it is called before the rest of application start up.

This facility is quite intentionally "undocumented", at least for now,
as we want to limit our commitment to it.  Still this should be helpful
for applications that need to reduce the number of threads that are
created.
gdamore added a commit that referenced this issue Jan 1, 2024
This further limits some of the thread counts, but principally it
offers a new runtime facility, nng_init_set_parameter(), which can
be used to set certain runtime parameters on the number of threads,
provided it is called before the rest of application start up.

This facility is quite intentionally "undocumented", at least for now,
as we want to limit our commitment to it.  Still this should be helpful
for applications that need to reduce the number of threads that are
created.
gdamore added a commit that referenced this issue Jan 1, 2024
This further limits some of the thread counts, but principally it
offers a new runtime facility, nng_init_set_parameter(), which can
be used to set certain runtime parameters on the number of threads,
provided it is called before the rest of application start up.

This facility is quite intentionally "undocumented", at least for now,
as we want to limit our commitment to it.  Still this should be helpful
for applications that need to reduce the number of threads that are
created.
gdamore added a commit that referenced this issue Jan 1, 2024
This further limits some of the thread counts, but principally it
offers a new runtime facility, nng_init_set_parameter(), which can
be used to set certain runtime parameters on the number of threads,
provided it is called before the rest of application start up.

This facility is quite intentionally "undocumented", at least for now,
as we want to limit our commitment to it.  Still this should be helpful
for applications that need to reduce the number of threads that are
created.
@gdamore gdamore closed this as completed in a9e98e5 Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants