detect and act on network change #1133

JssDWt · 2024-11-29T12:21:21Z

When there is a network change, tonic behaves as follows:

After the keepalive timeout, reconnect automatically.
Before the keepalive timeout, any grpc call will time out. After the timeout, the connection is reestablished.

This commit adds a mechanism to reconnect all grpc clients after one of the clients detects a network change. Initially it was attempted to only retry based on a keepalive timeout error, but a network change affects all grpc clients, so subsequent requests to other grpc endpoints would still fail with a timeout. Ofcourse those grpc clients can also add a retry-on-timeout, but since it is known at this point the grpc clients are temporarily dead, reconnect them immediately. This ensures subsequent calls to other endpoints won't add additional time by waiting for a timeout.

Currently this PR wraps the trampoline_pay, pay, and invoice calls in the greenlight client.

tokio was bumped to allow cloning watch::Sender

dangeross · 2024-11-29T12:40:48Z

libs/sdk-core/src/greenlight/node_api.rs

+            warn!("greenlight network change detector died.");
+        }
+
+        let res = fallback().await;


Do we need to wait at all to allow time for the GRPC clients to reconnect?

It looks like the connection is immediately reusable.

You can test this with the sdk-cli by:

connecting the sdk

changing to another network (like mobile hotspot)

calling send_payment or lnurl_pay with or without --use_trampoline

It will hang for a long time (until this PR Blockstream/greenlight#548 is included) and then continue with the fallback successfully.

dangeross

LGTM, with a nit

libs/sdk-core/src/greenlight/node_api.rs

libs/sdk-core/src/breez_services.rs

When there is a network change, tonic behaves as follows: - After the keepalive timeout, reconnect automatically. - Before the keepalive timeout, any grpc call will time out. After the timeout, the connection is reestablished. This commit adds a mechanism to reconnect all grpc clients after one of the clients detects a network change. Initially it was attempted to only retry based on a keepalive timeout error, but a network change affects all grpc clients, so subsequent requests to other grpc endpoints would still fail with a timeout. Ofcourse those grpc clients can also add a retry-on-timeout, but since it is known at this point the grpc clients are temporarily dead, reconnect them immediately. This ensures subsequent calls to other endpoints won't add additional time by waiting for a timeout. tokio was bumped to 1.41 to allow cloning watch::Sender.

JssDWt · 2024-11-29T20:42:34Z

Updated to wrap most grpc calls in with_connection_fallback.
The notification pattern that recreated other grpc clients when one hanged is now gone.

roeierez

LGTM with one comment to check if we can pass only the called function to the with_connection_fallback

roeierez · 2024-11-29T21:56:35Z

libs/sdk-common/src/tonic_wrap/mod.rs

+
+pub async fn with_connection_fallback<T, M, F>(
+    main: M,
+    fallback: impl FnOnce() -> F,


I wonder if we need the "main" future or we can use the fallback also for the first attempt.

I tried this, if you want you can try yourself too.
I tried by passing in a client to the function and having a Fn that operates on the client. I got stuck on lifetimes. It would be the cleaner way though.

I think we can use a macro here so we can use it like this:

retry! { client.lsp_list(request.clone()).await }

I will give it a try

It seems that this macro should do it:

#[macro_export] macro_rules! retry_grpc { ($f:expr) => {{ use std::error::Error; use log::debug; const BROKEN_CONNECTION_STRINGS: [&str; 3] = [ "http2 error: keep-alive timed out", "connection error: address not available", "connection error: timed out", ]; let res = $f; let result = match res { Ok(t) => Ok(t), Err(status) => { let mut retruned = Err(status.clone()); if let Some(source) = status.source() { if let Some(error) = source.downcast_ref::<tonic::transport::Error>() { if error.to_string() == "transport error" { if let Some(source) = error.source() { if BROKEN_CONNECTION_STRINGS.contains(&source.to_string().as_str()) { debug!("retry_grpc: initial call failed due to broken connection. Retrying."); retruned = $f } } } } } retruned }, }; result }}; }

Then it can be used as:

retry_grpc! { client.lsp_list(request.clone()).await }

JssDWt · 2024-11-29T22:16:31Z

Payment was still getting stuck. Added more wrappers.

JssDWt · 2024-11-30T09:36:07Z

@roeierez Found another way to make the compiler happy by passing a single function, what do you think?

pub async fn with_connection_retry<C, RQ, RS>(
    client: &mut C, 
    req: RQ,
    f: impl for<'c> Fn(&'c mut C, RQ) -> Pin<Box<dyn Future<Output = Result<tonic::Response<RS>, tonic::Status>> + Send + 'c>>,
) -> Result<tonic::Response<RS>, tonic::Status>
where
    RQ: std::fmt::Debug + Clone,
    RS: std::fmt::Debug,
{
    let res = f(client, req.clone()).await;
    ...
    f(client, req).await
}

        let chain_api_servers = with_connection_retry(
            &mut client,
            ChainApiServersRequest {},
            |client, req| Box::pin(client.chain_api_servers(req)),
        )
            .await

A macro makes for a little less code duplication. The macro returns a single awaitable future. So the result of the macro can be used in join! calls. I was unable to do this without consuming the grpc client object. Therefore the grpc client object has to be cloned if it's used later again.

JssDWt · 2024-11-30T12:35:44Z

Now the retry logic is wrapped in a macro. Posting the commit message here:

A macro makes for a little less code duplication. The macro returns a
single awaitable future. So the result of the macro can be used in join!
calls. I was unable to do this without consuming the grpc client object.
Therefore the grpc client object has to be cloned if it's used later
again.

roeierez

LGTM

JssDWt requested review from roeierez and dangeross November 29, 2024 12:21

JssDWt force-pushed the jssdwt-detect-network-change branch 2 times, most recently from aebafe7 to 3f905f0 Compare November 29, 2024 12:28

dangeross reviewed Nov 29, 2024

View reviewed changes

JssDWt force-pushed the jssdwt-detect-network-change branch from 3f905f0 to 791985c Compare November 29, 2024 12:42

dangeross approved these changes Nov 29, 2024

View reviewed changes

libs/sdk-core/src/greenlight/node_api.rs Outdated Show resolved Hide resolved

roeierez reviewed Nov 29, 2024

View reviewed changes

libs/sdk-core/src/breez_services.rs Outdated Show resolved Hide resolved

JssDWt force-pushed the jssdwt-detect-network-change branch from 791985c to 806261f Compare November 29, 2024 13:41

JssDWt added 4 commits November 29, 2024 19:44

remove network change detection propagation

c335eb0

move tonic_wrap to sdk_common

42b591e

wrap breez server with connection fallbacks

e4fa424

wrap node api with connection fallbacks

0d44831

JssDWt requested review from dangeross and roeierez November 29, 2024 21:14

wrap greenlight with connection fallbacks

007cad2

roeierez approved these changes Nov 29, 2024

View reviewed changes

JssDWt added 2 commits November 29, 2024 23:12

wrap backup transport with connection fallbacks

e5eb52b

handle multiple errors in connection fallback

4e1b78e

JssDWt added 2 commits November 29, 2024 23:19

add a connection failure string

486a569

add broken connection string

e60aa4c

JssDWt force-pushed the jssdwt-detect-network-change branch from 41f8860 to 2857569 Compare November 30, 2024 12:33

JssDWt requested a review from roeierez November 30, 2024 12:35

roeierez approved these changes Nov 30, 2024

View reviewed changes

JssDWt merged commit 2857569 into main Nov 30, 2024
9 checks passed

JssDWt temporarily deployed to github-pages November 30, 2024 14:57 — with GitHub Actions Inactive

JssDWt mentioned this pull request Dec 2, 2024

remove hibernation logic #1136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detect and act on network change #1133

detect and act on network change #1133

JssDWt commented Nov 29, 2024 •

edited

Loading

dangeross Nov 29, 2024

JssDWt Nov 29, 2024 •

edited

Loading

dangeross left a comment

JssDWt commented Nov 29, 2024 •

edited

Loading

roeierez left a comment

roeierez Nov 29, 2024

JssDWt Nov 29, 2024

roeierez Nov 29, 2024

roeierez Nov 29, 2024

JssDWt commented Nov 29, 2024

JssDWt commented Nov 30, 2024

JssDWt commented Nov 30, 2024

roeierez left a comment

detect and act on network change #1133

detect and act on network change #1133

Conversation

JssDWt commented Nov 29, 2024 • edited Loading

dangeross Nov 29, 2024

Choose a reason for hiding this comment

JssDWt Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

dangeross left a comment

Choose a reason for hiding this comment

JssDWt commented Nov 29, 2024 • edited Loading

roeierez left a comment

Choose a reason for hiding this comment

roeierez Nov 29, 2024

Choose a reason for hiding this comment

JssDWt Nov 29, 2024

Choose a reason for hiding this comment

roeierez Nov 29, 2024

Choose a reason for hiding this comment

roeierez Nov 29, 2024

Choose a reason for hiding this comment

JssDWt commented Nov 29, 2024

JssDWt commented Nov 30, 2024

JssDWt commented Nov 30, 2024

roeierez left a comment

Choose a reason for hiding this comment

JssDWt commented Nov 29, 2024 •

edited

Loading

JssDWt Nov 29, 2024 •

edited

Loading

JssDWt commented Nov 29, 2024 •

edited

Loading