-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IRQ triggering stops after a while #45
Comments
Hmm...I've been testing the waitIRQ version heavily the past week and I haven't seen this behavior yet. Do you just wait an extended period of time to make it happen? 100,000 interrupts is approximately 100 seconds? Or do you use a rate other than 1kHz? When you say stuck, do you just miss an interrupt or two or does mk lock up and need to be killed? The zynq version has been running for hours continuously without any symptoms of hard hanging except on shutdown. Occasionally, 1 in 10 or 15, on shutdown the rtapi will begin spamming timeout and no connection errors and require a manual intervention. That's the only error I've run into with wait irq. What should I do to try and duplicate? As for GPIO, I will try to post zturn today or tomorrow, and I can help get a custom config that will expose some timing gpio pins for you. I've been doing u-boot debugging to get NFS booting on the microzed, but I think I can find a little time to finish a basic zturn design. |
well with posix it happens within minutes, yes; I have it running now for 20hrs with an RT thread and it did not happen stuck means: the waitirq function just sits in the read(). The shutdown hang is something else I still need to address - seems the device read is not interruptible. 'timeout' typically means either rtapi_app went away/crashed - the command thread is separate so should run whatever is happening in HAL threads duplicate: adapt the config= line in above config fragment and run it, then observe the |
On 6/10/2016 7:24 AM, Michael Haberler wrote:
Where's the interrupt code? I'm either looking in the wrong file(s) or Usually, a problem like this means an interrupt was incorrectly cleared
I'm not sure if something like this can happen in the IRQ handling for a Similar deadlocks are also possible with improper handling of interrupt Charles Steinkuehler |
IRQ code is here: https://github.com/machinekit/machinekit/blob/master/src/hal/drivers/mesa-hostmot2/hm2_soc_ol.c#L615-L643 that above scenario could well be it! |
I was thinking about this while driving and I guess an easy way out would be to have a timeout condition on the blocking read. If the timeout occurs, then we check to see if the flags need to be reset, then make the choice to block until the next interrupt by restarting the read, or we finish the function normally and let other hal calls get their share of the CPU. From what I've read about kernel latency to interrupts, the double interrupt described above is a real possibility, and the only fix for that is a preemption kernel. If posix is a requirement though, the timeout on read will at least keep the bits flowing even if they're serviced irregularly. The other option is simply slowing down the interrupt rate since the hardware cannot support the current rate with the latency included. |
this might touch on the same issue is there any point in experimenting with level- vs edge-triggered IRQ's? |
semi-related: the shutdown hangs if employing a read() on a device file (or some other fd for that matter; I am experiencing the same with eventfd(2)) - the scenario is:
there are two ways out of this:
I'll explore in turn and see how I fare - I think 1. being less intrusive on API use I already verified that closing the device file does not cancel the read :-/ update: in fact a pthread_cancel() might do as read() is on the list of cancellation points |
well luckily it seems the pthread_cancel() does the job of terminating the read() (src/rtapi/rt-preempt.c): @dkhughes - mind trying this patch and see if this gets your shutdown hang sorted?
the more I think of it - this patch is seriously needed: ANY thread (rt or posix) doing not just HAL but any form of blocking system calls will be subjected to this shutdown hang otherwise |
previously, a HAL thread doing a blocking system call (read, poll etc) would fail to terminate on hal_delete_thread() as pthread_join() alone does not terminate any pending system calls. The pthread_cancel() achieves this effect. In theory this should remove any RT shutdown hangs when using posix/rt-preempt see also the discussion at: machinekit/mksocfpga#45 (comment)
on the socfpga, the above patch reliable removes the hang on exiting 'halrun -I irqtest.hal' on an amd64, this triggers an obscure pthread_cancel() bug causing a segv in the terminating thread |
I think I'm on the trail to this one - very subtle on 'unload ', the thread functs and pins of this comp are unlinked from the threads, then the comp is unloaded - the theory being thereafter comp code and data cannot be referenced anymore in the case of a comp doing a blocking call, e.g. read(), the thread is blocked within this read even after the functs and pins are unlinked and the comp unloaded a later delthread (implicit in shutdown) cancels the system call originating in a - by now unloaded - comp (really a shared library), meaning the code and data segment of this comp are invalid, causing the crash on return from the system call I guess the resolution is - extend the rather obscure 'halcmd unload all' to delete all threads before any comp is unloaded - in the legacy code, threads were exported by the motion and threads components only and an unload of those implicitly deleted the threads to round out my monologue ;) yes, the above worked and a patch is coming which covers the 'unload all' and 'halcmd shutdown' cases it does NOT cover the case where a comp using blocking system calls is loaded, a thread has been started calling this funct, and the user removes the comp with 'unload comp' - this still results in an rtapi crash the HAL data structures just do not support expressing this kind of referential integrity relation easily I do see an alternative, more proper fix through shared library reference counting: if a shared libary (component) already was loaded with dlopen(), another dlopen() just increases the reference count on the shared library handle; dlclose() decrements the refcount and unmaps the shared library when the refcount drops to zero. I wonder if this is worth the trouble for now |
@mhaberler Impressive detective work. So, to make sure I understand what is happening - the thread is stuck in the blocking function call, and the component is deleted out from underneath it? |
yes, exactly - a scenario which cannot happen with the current nonblocking thread functs |
This terminates threads which are blocked in a thread function (read, poll) so the underlying component/shared library can be safely dlclose()'d see also: machinekit/mksocfpga#45 (comment)
@dkhughes - machinekit/machinekit#962 should make the exit hang go away one can still crash rtapi by an explicit unloadrt hm2_soc_ol while running but maybe we'll find a fix for that downstream - for now 'a restriction' as it's easily fixed by preceding the unloadrt with 'delthread all' (that is pretty much what the patch does for 'unload all') update: merged |
@mhaberler I've used the patches for a day or so now and the exit hangs have disappeared, great work! I have been looking for side effects to the change but my tests haven't shown anything yet. |
@dkhughes great to hear, thanks! re the actual topic of this issue.. next stop is rebuilding uio_pdrv_genirq from source assuming the solution will be at that level afterall Linus might have had a case ;) |
not sure where to record this - maybe should move to mk/mk; more of an observation at this point and not yet critical
given the following config:
the IRQ count freezes at some point, visible by:
If I drop the 'posix' flag making it an RT thread, this currently does not happen (letting it run overnight now. update: 2days later no IRQ lost if using RT thread)
gut feeling: there is a race condition in acknowledging interrupts - if posix scheduling happens too late for the current cycle and the pending IRQ is not acked before a second IRQ is posted, things get stuck after 50-100.000 interrupts
if this is the case - I'd need gpio pins hooked to an LSA to nail that condition - a possible fix would be decouple IRQ acknowledgement and continuing the waitirq function: ack the IRQ right away in the driver, and keep userland out of the IRQ ack business altogether
this might mean a custom driver (based on uio_pdrv_genirq) but it would at the same time get rid of the second system call (write) in waitirq(), something I would like to do anyway, and given we have machinekit-dkms in place it might not be an undue burden
I'd be grateful for a hint how to take some GPIO pins away from the stock GPIO driver (or hm2 for that matter, not sure how that could be done) so they could be used as scoping pins from HAL (i.e. directly manipulated in hal, not through some driver thread function)
The text was updated successfully, but these errors were encountered: