Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux: use threads for console IO #6770

Closed
wants to merge 2 commits into from

Conversation

nefelim4ag
Copy link
Contributor

@nefelim4ag nefelim4ag commented Dec 26, 2024

This is the porting of changes from: #6514

Interesting. It sounds like it could benefit all "Linux mcu" instances.

There is a threaded console IO, with a simplified ring buffer.

There are no functional changes, but this will introduce some latency for e2e messaging.

DELAY {clock + 20*freq} get_uptime
FLOOD 1000000 0.0 debug_nop
get_uptime

Tested on RPi5 with fixed frequency to 2GHz.

Before:
1000000 * 50000000 / (1656965298 - 1652508544) = 11218927.497

After (main sleep 500us): 
1000000 * 50000000 / (3923071643 - 3886878513) = 1381477.645

After (main sleep 100us):
1000000 * 50000000 / (1505470128 - 1495177143) = 4857677.340

Updated with adj sleep & sigusr1
1000000 * 50000000 / (1574801575 - 1570190161) = 10842661.274

I tried to use signaling to wake up the main thread if it sleeps and if the reader receives something.
The same applies to the writer, who sleeps or wakes up on the signal.
But seems like any other syscalls in the main thread or additional signals introduce pauses which leads to a "Rescheduled timer in the past" error.

So, there are pauses.
In the perfect scenario, I think, the main thread can use clock_nanosleep to sleep until the next SIGALRM or till a signal from the reader arrives, everything to save CPU cycles.


Mostly mitigated.
Sleep times are arbitrary values here, mostly tradeoffs between CPU usage and latency.

@nefelim4ag nefelim4ag force-pushed the linux-threaded-io branch 2 times, most recently from 3d74ee8 to a92d859 Compare December 27, 2024 00:13
@KevinOConnor
Copy link
Collaborator

Thanks for working on this.

Does this change improve overall IO performance (eg, steps per second) and if so by how much? It seems odd to me that replacing ppoll() with nanosleep() would improve performance. It seems like it could increase scheduling jitter by 100+us.

Same comment as before on ringbuf.c - it needs to have a copyright statement if it is locally developed or it should be moved to lib/ if it is an external file. For what it is worth though, I'm not sure why a ringbuf is needed. If there is a background thread for the console, it can read the buffer, parse the messages, signal the main thread when commands are ready to run, collect any output from the main thread, and write that output back to the console. The PRU code already uses a separate thread for console reading (the code is in src/pru/pru0.c , but admittedly the PRU is so goofy it's hard to follow that code).

Cheers,
-Kevin

@nefelim4ag
Copy link
Contributor Author

nefelim4ag commented Jan 1, 2025

Does this change improve overall IO performance (eg, steps per second), and if so by how much? It seems odd to me that replacing ppoll() with nanosleep() would improve performance. It seems like it could increase scheduling jitter by 100+us.

I don't think so. The original PR replaced the whole logic that works with GPIO to map/DMA, which will make it platform-specific but also provide huge performance benefits (no more syscalls for GPIO access).
On the other hand, he said that read/write calls, can sometimes produce jitter.
Originally he just isolate one core to run only MCU thread on it, any syscalls will make it slower, and so there is no sleep or any waiting at all (busy loop 100% cpu usage).

I tried to import the idea of separating IO threads there, like if the main thread will not do IO, it should be basically a bit faster.
Nanosleep here is just a tradeoff, it is cancellable by signal, as ppoll, so I expected it to perform in similar manner.
Same clock_nanosleep till the expected signal time - should behave similarly here. But I wasn't successful at making it work correctly with everything.


I tested the step rate, the same setup, and the same RPI5 with fixed 2 GHz (to be not too fast, there are a lot of people with slower hosts).

allocate_oids count=3
config_stepper oid=0 step_pin=gpio17 dir_pin=gpio18 invert_step=0 step_pulse_ticks=5
config_stepper oid=1 step_pin=gpio27 dir_pin=gpio22 invert_step=0 step_pulse_ticks=5
config_stepper oid=2 step_pin=gpio23 dir_pin=gpio24 invert_step=0 step_pulse_ticks=5

finalize_config crc=0

SET start_clock {clock+freq}
SET ticks 214

reset_step_clock oid=0 clock={start_clock}
set_next_step_dir oid=0 dir=0
queue_step oid=0 interval={ticks} count=60000 add=0
set_next_step_dir oid=0 dir=1
queue_step oid=0 interval=3000 count=1 add=0

reset_step_clock oid=1 clock={start_clock}
set_next_step_dir oid=1 dir=0
queue_step oid=1 interval={ticks} count=60000 add=0
set_next_step_dir oid=1 dir=1
queue_step oid=1 interval=3000 count=1 add=0

reset_step_clock oid=2 clock={start_clock}
set_next_step_dir oid=2 dir=0
queue_step oid=2 interval={ticks} count=60000 add=0
set_next_step_dir oid=2 dir=1
queue_step oid=2 interval=3000 count=1 add=0

ECHO Test result is: {"%.0fK" % (3. * freq / ticks / 1000.)}

714k originally (210 ticks), 701K with patches (214 ticks).
Unfortunately, it's a bit slower. Reducing nanosleep 1ms -> 50us does not affect the results, so maybe it is a scheduling jitter caused by nanosleep, just dropping it (to get busy looping) makes it perform worse.

I'm not sure why a ringbuf is needed. If there is a background thread for the console

It was just a simple handy solution, I have no strong opinion here, I don't think it really matters.
I will take a look at PRU, from a high-level view it mostly looks the same as here, we just offload decoding/encoding and use shared memory. I think it is possible to do the same stuff in Linux.


I think if there is a goal, for improvement of the performance of Linux MCU, I think there are only 2 solutions:

Timers on Linux MCUs are not the same as on bare-metal MCUs, where they perform actual work.
Currently, they do the work at irq_poll() in the same thread and wait for any previously issued work or syscall.
In other words, for STM32, we expect the step will happen whatever is currently happening at a task level, here it is not true.
If I2C consumes 100 us, the step will be late for at least 100 us.
By the step rate above, syscall should take around (1 / 50_000_000 * 214) = 0.000004280 4 us.

Below are my speculations.
Cause I did not see high CPU usage and only system CPU usage, I think, it is just because ioctl() which is happening at gpio access, forced the thread to park, and then reschedule, so here is just high time overhead for what we are trying to benchmark.


BTW,
There is a post about nanosleep(1000)/scheduling timing, I didn't verify it, but there are also 4us for rescheduling:
https://fritshoogland.wordpress.com/2018/03/13/linux-timing-and-scheduling-granularity/

So, maybe there are fundamental kernel limits on frequency here.

With nanosleep(1000)

 3)   2.481 us    |  schedule();
 3)   2.463 us    |  schedule();
 3)   2.537 us    |  schedule();
 3)   2.871 us    |  schedule();

gpio calls:

1)   9.611 us    |  gpio_ioctl();
1)   9.666 us    |  gpio_ioctl();
....
1)   0.277 us    |  cortex_a76_erratum_1463225_svc_handler();
1)               |  do_el0_svc() {
1)               |    el0_svc_common.constprop.0() {
1)               |      invoke_syscall() {
1)               |        __arm64_sys_ioctl() {
1)               |          __fdget() {
1)   0.259 us    |            __fget_light();
1)   0.740 us    |          }
1)   0.259 us    |          security_file_ioctl();
1)   0.259 us    |          do_vfs_ioctl();
1)               |          linehandle_ioctl() {
1)   0.259 us    |            down_read();
1)               |            linehandle_ioctl_unlocked() {
1)               |              gpiod_set_array_value_complex() {
1)               |                gpio_chip_set_multiple() {
1)               |                  rp1_gpio_set() {
1)   0.259 us    |                    gpiochip_get_data();
1)   0.741 us    |                  }
1)   1.481 us    |                }
1)   1.982 us    |              }
1)   2.481 us    |            }
1)   0.259 us    |            up_read();
1)   3.981 us    |          }
1)   6.407 us    |        }
1)   6.889 us    |      }
1)   7.389 us    |    }
1)   7.889 us    |  }

@KevinOConnor
Copy link
Collaborator

Nanosleep here is just a tradeoff, it is cancellable by signal, as ppoll, so I expected it to perform in similar manner.

There is a subtle race with nanosleep that does not occur with ppoll() - if the signal occurs just after TimerInfo.must_wake_timers is checked but before the code starts nanosleep() then the signal will not wake up the nanosleep. That's the purpose of ppoll() - it is specifically designed not to have that race. I don't think it would be safe to lose that atomicity.

I'm not sure if the goal was to remove jitter from timer_disable_signals(); ppoll(); timer_enable_signals(); or if the goal was to remove jitter from read(). If it's the former, it's certainly a bit of work - I'd suggest looking at removing signals entirely and using pthread_cond_wait(), or possibly running timers in a separate thread from tasks and converting irq_disable() calls to pthread_mutex_lock() (which I'd guess is a lot of work). If it's the latter (read() jitter) then there might be other solutions (such as moving read() to a thread and using pthread_cond_timedwait() in the main thread).

Actually, thinking about it further, one might be able to also get rid of the timer_disable_signals(); ppoll(); timer_enable_signals(); with a clever use of pthread_cond_timedwait().

Whatever the source of the issue, it'll be hard to make changes here if we can't benchmark an improvement. It might be necessary to implement memory mapped gpio first so that the jitter from syscalls becomes more apparent in tests. Not sure.

Cheers,
-Kevin

@KevinOConnor
Copy link
Collaborator

FYI, I tried prototyping code using pthread_cond_timedwait(), but unfortunately that call works on normal system times while the mcu code uses CLOCK_MONOTONIC times. So, it doesn't work.

Oh well.
-Kevin

@nefelim4ag
Copy link
Contributor Author

I reimplemented PRU-like logic with threads, but it performed even worse to my surprise. I expected no real difference to just threaded IO.

Looks like RTT time just goes bad, so console.py can't initialize in 2 seconds and throw timeout error.
(identify commands returns config, but just too slow, default retry of 10 ms are not working well here also).

And it nothing to do with scheduling jitter right now, so I give up on that.


FYI, I tried prototyping code using pthread_cond_timedwait(), but unfortunately that call works on normal system times while the mcu code uses CLOCK_MONOTONIC times. So, it doesn't work.

I tried to move timers to the separeted thread, and drop system timer and it works (with some locking), but does not provide any benefits (step rate is lower ~667K) and is buggy (ex. Looks like shutdown() from timers brake locks).

My a little naive solution:

static void
timer_dispatch(void)
{
    uint32_t repeat_count = TIMER_REPEAT_COUNT, next;
    for (;;) {
        // Run the next software timer
        timer_lock();
        next = sched_timer_dispatch();
        timer_unlock();
...
    }

    // Sleep till next timer
    TimerInfo.next_wake = timespec_from_time(next);
    TimerInfo.next_wake_counter = next;
    struct timespec corrected_wake = TimerInfo.next_wake;
    if (corrected_wake.tv_nsec >= ns_correction) {
        corrected_wake.tv_nsec -= ns_correction;
    } else {
        corrected_wake.tv_sec -= 1;
        corrected_wake.tv_nsec = 1000000000 + corrected_wake.tv_nsec - ns_correction;
    }
    int ret = clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME,
                              &corrected_wake, NULL);
    if (ret == EINTR)
        return;

    struct timespec ts = timespec_read();
    correction_calc(TimerInfo.next_wake, ts);
}

That was another interesting dive into MCU code, but it really looks like it hard to improve it further.

Anyway, thanks for your time.

@nefelim4ag nefelim4ag closed this Jan 3, 2025
@KevinOConnor
Copy link
Collaborator

Ah, if you're curious, I uploaded what I attempted yesterday (again, it's broken code): https://github.com/KevinOConnor/klipper-dev/tree/work-linux-20250102

-Kevin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants