-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
detect and act on network change #1133
Conversation
aebafe7
to
3f905f0
Compare
warn!("greenlight network change detector died."); | ||
} | ||
|
||
let res = fallback().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to wait at all to allow time for the GRPC clients to reconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the connection is immediately reusable.
You can test this with the sdk-cli by:
- connecting the sdk
- changing to another network (like mobile hotspot)
- calling send_payment or lnurl_pay with or without --use_trampoline
It will hang for a long time (until this PR Blockstream/greenlight#548 is included) and then continue with the fallback successfully.
3f905f0
to
791985c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a nit
When there is a network change, tonic behaves as follows: - After the keepalive timeout, reconnect automatically. - Before the keepalive timeout, any grpc call will time out. After the timeout, the connection is reestablished. This commit adds a mechanism to reconnect all grpc clients after one of the clients detects a network change. Initially it was attempted to only retry based on a keepalive timeout error, but a network change affects all grpc clients, so subsequent requests to other grpc endpoints would still fail with a timeout. Ofcourse those grpc clients can also add a retry-on-timeout, but since it is known at this point the grpc clients are temporarily dead, reconnect them immediately. This ensures subsequent calls to other endpoints won't add additional time by waiting for a timeout. tokio was bumped to 1.41 to allow cloning watch::Sender.
791985c
to
806261f
Compare
Updated to wrap most grpc calls in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one comment to check if we can pass only the called function to the with_connection_fallback
|
||
pub async fn with_connection_fallback<T, M, F>( | ||
main: M, | ||
fallback: impl FnOnce() -> F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need the "main" future or we can use the fallback also for the first attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, if you want you can try yourself too.
I tried by passing in a client to the function and having a Fn
that operates on the client. I got stuck on lifetimes. It would be the cleaner way though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use a macro here so we can use it like this:
retry! {
client.lsp_list(request.clone()).await
}
I will give it a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this macro should do it:
#[macro_export]
macro_rules! retry_grpc {
($f:expr) => {{
use std::error::Error;
use log::debug;
const BROKEN_CONNECTION_STRINGS: [&str; 3] = [
"http2 error: keep-alive timed out",
"connection error: address not available",
"connection error: timed out",
];
let res = $f;
let result = match res {
Ok(t) => Ok(t),
Err(status) => {
let mut retruned = Err(status.clone());
if let Some(source) = status.source() {
if let Some(error) = source.downcast_ref::<tonic::transport::Error>() {
if error.to_string() == "transport error" {
if let Some(source) = error.source() {
if BROKEN_CONNECTION_STRINGS.contains(&source.to_string().as_str()) {
debug!("retry_grpc: initial call failed due to broken connection. Retrying.");
retruned = $f
}
}
}
}
}
retruned
},
};
result
}};
}
Then it can be used as:
retry_grpc! {
client.lsp_list(request.clone()).await
}
Payment was still getting stuck. Added more wrappers. |
@roeierez Found another way to make the compiler happy by passing a single function, what do you think?
|
A macro makes for a little less code duplication. The macro returns a single awaitable future. So the result of the macro can be used in join! calls. I was unable to do this without consuming the grpc client object. Therefore the grpc client object has to be cloned if it's used later again.
41f8860
to
2857569
Compare
Now the retry logic is wrapped in a macro. Posting the commit message here: A macro makes for a little less code duplication. The macro returns a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When there is a network change, tonic behaves as follows:
This commit adds a mechanism to reconnect all grpc clients after one of the clients detects a network change. Initially it was attempted to only retry based on a keepalive timeout error, but a network change affects all grpc clients, so subsequent requests to other grpc endpoints would still fail with a timeout. Ofcourse those grpc clients can also add a retry-on-timeout, but since it is known at this point the grpc clients are temporarily dead, reconnect them immediately. This ensures subsequent calls to other endpoints won't add additional time by waiting for a timeout.
Currently this PR wraps the
trampoline_pay
,pay
, andinvoice
calls in the greenlight client.tokio was bumped to allow cloning watch::Sender