-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[illumos] switch waker impl back to being pipe-based #1824
Conversation
(Once this PR lands it would be great to get a release out. Thanks!) |
I made this change on purpose since illumos supported eventfd (according tot the manual). I would prefer to debug what is causing this test to fail because the eventfd implementation is more efficient than the pipe based on. But before we focus on eventfd vs pipe, can you confirm the pipe implementation works by running with |
Thanks. Was there any validation involved as part of the change? I'm asking because the test fails quite consistently for me. (Not intending to blame here! The lack of publicly-available illumos CI is definitely a major factor here, and at Oxide we're considering ways to address that.)
To be clear I don't currently think the bug is in mio -- I think it's in illumos's eventfd impl. I read through
Yes, on master:
Thank you! |
I've also confirmed that switching back to the pipe impl makes the Tokio test suite pass as well. |
In tokio-rs@1133ed0, it looks like the waker implementation on illumos (perhaps inadvertently) got switched from being pipe-based to being eventfd-based. Unfortunately, the eventfd impl seems to break the `waker_multiple_wakeups_different_thread` on illumos: ``` thread 'waker_multiple_wakeups_different_thread' panicked at tests/waker.rs:157:5: assertion failed: !events.is_empty() ``` While we separately figure out what might be going wrong in the eventfd impl, it's worth switching back to the old pipe-based impl which is known to work, and which does fix this test.
95ce468
to
c2a8968
Compare
@sunshowers is there any (public) issue/ticket/etc. we can refer to, that we can keep track of to go back to eventfd? |
Yes, we'll convert oxidecomputer/helios#169 over to that so feel free to refer to that. |
I meant in the illumos bug tracker (if a public one exists) |
There is a public bug tracker but there isn't a bug on yet yet. We're planning to do some investigation later this week once we have a better idea, and file a bug then. I'm happy to comment on this PR once that is done. In the meantime the Helios issue I linked is authoritative. But I don't think this PR should not block on that bug existing. This PR merely puts things back to a known-working state, using a portable POSIX implementation. Another thing to note is that if there is truly a bug in the illumos eventfd implementation, and if there is no possible workaround to that within mio, then mio will have to stay with the pipe implementation for many years. That's because not everyone runs an up-to-date illumos. Hope that helps, thanks! |
Except it's not. I will not implement a workaround for a suspected OS bug without linking to a bug in OS's bug tracker. @tokio-rs/illumos have any of you hit this bug before? Any know bugs we can link to? |
G'day! I'm a member of the illumos core team. I actually hit the regression here myself when upgrading a project from tokio 1.38.X to 1.39.X the other day. My colleague, @sunshowers, picked it up and has done a lot of leg work to isolate the change that introduced the regression and get fixes out there, as this basically prevents any software that uses tokio from working on illumos today. We don't have an OS bug to point to right now because we don't currently know that there is an OS bug -- though we will obviously continue to investigate that in the background! What has become clear in the last couple of days is that the original pipe-based mechanism works reliably on illumos systems (we have years, now, of production experience using it) and the switch to using eventfd has broken basically any software that uses tokio for us. Even this simple program no longer reliably works, when built with tokio 1.39.X: use std::time::{Duration, Instant};
#[tokio::main]
async fn main() {
let now = Instant::now();
let res = tokio::time::timeout(Duration::from_secs(5), work()).await;
let dur = Instant::now().checked_duration_since(now).unwrap().as_millis();
println!("{res:?} after {dur} msec");
}
async fn work() {
tokio::time::sleep(Duration::from_secs(1)).await;
} The program either hangs forever, ostensibly due to a missed wakeup, or it runs for ~5 seconds instead of the expected 1 second. Downgrading tokio (and thus mio) makes the program work again. We'd like to get this behavioural change, which as far as I can tell wasn't tested at all at the time, reverted promptly and a new release out there. That will allow tokio-based software to start working for illumos users once again. If we do eventually find an OS bug in our eventfd implementation, we still need to revert the change here: there is no crisp way to know at runtime if you're on a version of the OS that has that hypothetical fix or not, and depending on their risk profiles and processes, some users can take a long time to upgrade their base OS bits. Thanks for looking at this! Please let us know how we can help to get this one over the line. |
G'day again! It's been about 24 hours and I just wanted to check in and see if we could get this landed today, and a new release of mio cut. It's a critical defect that prevents tokio 1.39.X working on illumos systems, and we'd really like to get that sorted out! Please let us know if there's anything left to do before this can be integrated and released. Thanks! |
@jclulow I work on Mio in my spare time, I'm not a help desk. I do not appropriate those kind of hurry up comments. |
Maybe we can apply the same trick we did to the pipe based waker to the eventfd implementation. Looking at: mio/src/sys/unix/waker/pipe.rs Lines 40 to 44 in 4a5114e
Can we apply that to: mio/src/sys/unix/waker/eventfd.rs Line 44 in 4a5114e
Can either of you try that and see if it works? |
Totally appreciate that you're busy! That's why we've picked the lowest risk change here (reverting to the behaviour we know has worked for years in production) and done all the testing already. We can, and will, investigate eventfd and other options in the future, obviously, including your suggested path above. To get the quickest possible resolution for a critical defect, though, with the lowest likelihood of us coming back to you with another urgent PR in a few days, we need to go back to the original behaviour first. We're also looking at providing illumos CI resources in the future as well, so that these sorts of behavioural changes can be tested prior to integration next time. Plus, I just want to say that we're always happy to be pinged for a review on any changes that you need to make to mio which affect illumos -- we currently do that for, e.g., the libc crate, which has been a fruitful collaboration! Can we please get this change integrated as-is today and a new release cut? If there are other tokio maintainers we should pull in here, we're happy to do that too! Just let us know. Thanks! |
Since neither of you wanted to write the code, I did: #1826. https://www.illumos.org/issues/16700 is very clear indication to me that this isn't limited to eventfd as we've also seen this in pipes. The workaround added in #1826 is copied from the pipe implementation that also wouldn't work without it. |
Hi there! In the spirit of moving forward and being mindful of everyone's time, I'd like to provide an update on our progress: We've identified this as an illumos bug. There's an open bug report with a proposed fix, which you can find here. While a fix is in progress, it's important to note that it won't immediately resolve the problem for all systems, particularly those running older illumos versions. To ensure mio functions correctly on illumos, we'll need to maintain a workaround for the foreseeable future. We've developed a potential workaround, which is available on the Helios issue I mentioned earlier. However, we're still in the process of thoroughly testing this solution. As you can imagine, we want to be as confident in this new approach as we are with our current implementation, which has a proven track record in production environments. Another aspect we're exploring is the performance of eventfd on illumos. While eventfd offers performance benefits on Linux, we haven't yet confirmed similar improvements on illumos. We're committed to rigorously testing this to ensure it provides tangible benefits. We're happy to continue validating eventfd, but this process will require some time. This includes testing and profiling the eventfd implementation on illumos. If our testing doesn't show significant performance improvements over pipes on illumos, we may need to reconsider the switch. Given these factors, we believe it would be prudent to temporarily revert to a pipe-based implementation while we continue our validation of eventfd. This approach would allow us to address the critical defect in mio without delay, especially considering:
We believe this strategy will help minimize disruption for all parties involved. We appreciate your understanding as we work through this challenge. Thank you! |
Closing as solved by #1826. |
In 1133ed0, it looks like the waker implementation on illumos (perhaps inadvertently) got switched from being pipe-based to being eventfd-based. Unfortunately, the eventfd impl seems to break the
waker_multiple_wakeups_different_thread
test on illumos:While we separately figure out what might be going wrong in the eventfd impl, it's worth switching back to the old pipe-based impl which is known to work, and which does fix this test.
See oxidecomputer/helios#169.