Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect and act on network change #1133

Merged
merged 11 commits into from
Nov 30, 2024
Merged

detect and act on network change #1133

merged 11 commits into from
Nov 30, 2024

Conversation

JssDWt
Copy link
Contributor

@JssDWt JssDWt commented Nov 29, 2024

When there is a network change, tonic behaves as follows:

  • After the keepalive timeout, reconnect automatically.
  • Before the keepalive timeout, any grpc call will time out. After the timeout, the connection is reestablished.

This commit adds a mechanism to reconnect all grpc clients after one of the clients detects a network change. Initially it was attempted to only retry based on a keepalive timeout error, but a network change affects all grpc clients, so subsequent requests to other grpc endpoints would still fail with a timeout. Ofcourse those grpc clients can also add a retry-on-timeout, but since it is known at this point the grpc clients are temporarily dead, reconnect them immediately. This ensures subsequent calls to other endpoints won't add additional time by waiting for a timeout.

Currently this PR wraps the trampoline_pay, pay, and invoice calls in the greenlight client.

tokio was bumped to allow cloning watch::Sender

@JssDWt JssDWt requested review from roeierez and dangeross November 29, 2024 12:21
@JssDWt JssDWt force-pushed the jssdwt-detect-network-change branch 2 times, most recently from aebafe7 to 3f905f0 Compare November 29, 2024 12:28
warn!("greenlight network change detector died.");
}

let res = fallback().await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wait at all to allow time for the GRPC clients to reconnect?

Copy link
Contributor Author

@JssDWt JssDWt Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the connection is immediately reusable.

You can test this with the sdk-cli by:

  • connecting the sdk
  • changing to another network (like mobile hotspot)
  • calling send_payment or lnurl_pay with or without --use_trampoline

It will hang for a long time (until this PR Blockstream/greenlight#548 is included) and then continue with the fallback successfully.

@JssDWt JssDWt force-pushed the jssdwt-detect-network-change branch from 3f905f0 to 791985c Compare November 29, 2024 12:42
Copy link
Collaborator

@dangeross dangeross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a nit

libs/sdk-core/src/greenlight/node_api.rs Outdated Show resolved Hide resolved
When there is a network change, tonic behaves as follows:
- After the keepalive timeout, reconnect automatically.
- Before the keepalive timeout, any grpc call will time out. After the
timeout, the connection is reestablished.

This commit adds a mechanism to reconnect all grpc clients after one of
the clients detects a network change. Initially it was attempted to only
retry based on a keepalive timeout error, but a network change affects
all grpc clients, so subsequent requests to other grpc endpoints would
still fail with a timeout. Ofcourse those grpc clients can also add a
retry-on-timeout, but since it is known at this point the grpc clients
are temporarily dead, reconnect them immediately. This ensures
subsequent calls to other endpoints won't add additional time by waiting
for a timeout.

tokio was bumped to 1.41 to allow cloning watch::Sender.
@JssDWt JssDWt force-pushed the jssdwt-detect-network-change branch from 791985c to 806261f Compare November 29, 2024 13:41
@JssDWt
Copy link
Contributor Author

JssDWt commented Nov 29, 2024

Updated to wrap most grpc calls in with_connection_fallback.
The notification pattern that recreated other grpc clients when one hanged is now gone.

@JssDWt JssDWt requested review from dangeross and roeierez November 29, 2024 21:14
Copy link
Member

@roeierez roeierez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment to check if we can pass only the called function to the with_connection_fallback


pub async fn with_connection_fallback<T, M, F>(
main: M,
fallback: impl FnOnce() -> F,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need the "main" future or we can use the fallback also for the first attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, if you want you can try yourself too.
I tried by passing in a client to the function and having a Fn that operates on the client. I got stuck on lifetimes. It would be the cleaner way though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use a macro here so we can use it like this:

retry! {
 client.lsp_list(request.clone()).await
}

I will give it a try

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this macro should do it:

#[macro_export]
macro_rules! retry_grpc {
    ($f:expr) => {{
      use std::error::Error;
      use log::debug;
      const BROKEN_CONNECTION_STRINGS: [&str; 3] = [
          "http2 error: keep-alive timed out",
          "connection error: address not available",
          "connection error: timed out",
      ];

      let res = $f;
      let result = match res {
          Ok(t) => Ok(t),
          Err(status) => {
            let mut retruned = Err(status.clone());
            if let Some(source) = status.source() {
              if let Some(error) = source.downcast_ref::<tonic::transport::Error>() {
                if error.to_string() == "transport error" {
                  if let Some(source) = error.source() {
                    if BROKEN_CONNECTION_STRINGS.contains(&source.to_string().as_str()) {
                      debug!("retry_grpc: initial call failed due to broken connection. Retrying.");
                      retruned = $f
                    }
                  }
                }
              }
            }
            retruned
          },
      };
      result
   }};
}

Then it can be used as:

retry_grpc! {
 client.lsp_list(request.clone()).await
}

@JssDWt
Copy link
Contributor Author

JssDWt commented Nov 29, 2024

Payment was still getting stuck. Added more wrappers.

@JssDWt
Copy link
Contributor Author

JssDWt commented Nov 30, 2024

@roeierez Found another way to make the compiler happy by passing a single function, what do you think?

pub async fn with_connection_retry<C, RQ, RS>(
    client: &mut C, 
    req: RQ,
    f: impl for<'c> Fn(&'c mut C, RQ) -> Pin<Box<dyn Future<Output = Result<tonic::Response<RS>, tonic::Status>> + Send + 'c>>,
) -> Result<tonic::Response<RS>, tonic::Status>
where
    RQ: std::fmt::Debug + Clone,
    RS: std::fmt::Debug,
{
    let res = f(client, req.clone()).await;
    ...
    f(client, req).await
}
        let chain_api_servers = with_connection_retry(
            &mut client,
            ChainApiServersRequest {},
            |client, req| Box::pin(client.chain_api_servers(req)),
        )
            .await

A macro makes for a little less code duplication. The macro returns a
single awaitable future. So the result of the macro can be used in join!
calls. I was unable to do this without consuming the grpc client object.
Therefore the grpc client object has to be cloned if it's used later
again.
@JssDWt JssDWt force-pushed the jssdwt-detect-network-change branch from 41f8860 to 2857569 Compare November 30, 2024 12:33
@JssDWt JssDWt requested a review from roeierez November 30, 2024 12:35
@JssDWt
Copy link
Contributor Author

JssDWt commented Nov 30, 2024

Now the retry logic is wrapped in a macro. Posting the commit message here:

A macro makes for a little less code duplication. The macro returns a
single awaitable future. So the result of the macro can be used in join!
calls. I was unable to do this without consuming the grpc client object.
Therefore the grpc client object has to be cloned if it's used later
again.

Copy link
Member

@roeierez roeierez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JssDWt JssDWt merged commit 2857569 into main Nov 30, 2024
9 checks passed
@JssDWt JssDWt mentioned this pull request Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants