Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: blockchain::esplora::bdk_blockchain_tests::test_sync_stop_gap_20 occationally fails #650

Open
evanlinjin opened this issue Jul 2, 2022 · 18 comments
Labels
bug Something isn't working

Comments

@evanlinjin
Copy link
Member

Describe the bug

CI: blockchain::esplora::bdk_blockchain_tests::test_sync_stop_gap_20 occationally fails.

The command that fails

$ cargo test --no-default-features --features test-esplora,use-esplora-ureq,verify esplora::bdk_blockchain_tests

Occurences

@evanlinjin evanlinjin added the bug Something isn't working label Jul 2, 2022
@rajarshimaitra
Copy link
Contributor

I have been witnessing few occasional failures in blockchain tests in my local too.. Seems like some kind of resource blocking thingy happening somewhere?? I was seeing "resource unavailable" errors few days ago when running blockchian tests all together.. Which I can't seem to reproduce anymore.. Could be something related??

From the error message it seems something got timed out while waiting for some stuffs..

@evanlinjin
Copy link
Member Author

evanlinjin commented Jul 3, 2022

I have been witnessing few occasional failures in blockchain tests in my local too.. Seems like some kind of resource blocking thingy happening somewhere?? I was seeing "resource unavailable" errors few days ago when running blockchian tests all together.. Which I can't seem to reproduce anymore.. Could be something related??

From the error message it seems something got timed out while waiting for some stuffs..

Same here. Would it make sense to implement some retry logic/exponential backoff?

We can keep track of the errors and where they occur here.

@rajarshimaitra
Copy link
Contributor

Same here. Would it make sense to implement some retry logic/exponential backoff?

We do have an exponential pole at electrum calls

fn exponential_backoff_poll<T, F>(mut poll: F) -> T
where
F: FnMut() -> Option<T>,
{
let mut delay = Duration::from_millis(64);
loop {
match poll() {
Some(data) => break data,
None if delay.as_millis() < 512 => delay = delay.mul_f32(2.0),
None => {}
}
std::thread::sleep(delay);
}
}

To me it seems for some reason the electrum calls isn't returning and some resources are not being free??

The error seems to disappear in single test runs.. Only when I run all the blockchian tests together..

@afilini
Copy link
Member

afilini commented Jul 4, 2022

This looks like a bug in electrsd:

[2022-07-02T18:30:55Z ERROR electrsd] early exit with: ExitStatus(unix_wait_status(134))

Maybe it doesn't handle too much load well. I'll see if I can somehow attach a debugger and reproduce.

@evanlinjin
Copy link
Member Author

This is a one-off failure, not sure if it is related.

failures:

---- blockchain::esplora::bdk_blockchain_tests::test_double_spend stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_double_spend' panicked at 'called `Result::unwrap()` on an `Err` value: JSON(Error("expected value", line: 1, column: 1))', src/testutils/blockchain_tests.rs:65:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    blockchain::esplora::bdk_blockchain_tests::test_double_spend

@afilini
Copy link
Member

afilini commented Jul 4, 2022

Maybe it's related to this? bitcoindevkit/rust-electrum-client#67

@evanlinjin
Copy link
Member Author

failures:

---- blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement' panicked at 'called `Result::unwrap()` on an `Err` value: EarlyExit(ExitStatus(unix_wait_status(134)))', src/testutils/blockchain_tests.rs:36:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement

test result: FAILED. 256 passed; 1 failed; 5 ignored; 0 measured; 0 filtered out; finished in 6.41s

@evanlinjin
Copy link
Member Author

 failures:
error: test failed, to rerun pass '--lib'
---- blockchain::esplora::bdk_blockchain_tests::test_tx_chain stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_tx_chain' panicked at 'called `Result::unwrap()` on an `Err` value: EarlyExit(ExitStatus(unix_wait_status(134)))', src/testutils/blockchain_tests.rs:36:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
    blockchain::esplora::bdk_blockchain_tests::test_tx_chain

@evanlinjin
Copy link
Member Author

---- blockchain::electrum::test::bdk_blockchain_tests::test_sync_bump_fee_basic stdout ----
thread 'blockchain::electrum::test::bdk_blockchain_tests::test_sync_bump_fee_basic' panicked at 'called `Result::unwrap()` on an `Err` value: IOError(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })', src/testutils/blockchain_tests.rs:65:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@evanlinjin
Copy link
Member Author

This one is the craziest yet: https://gist.github.com/evanlinjin/3079251115e64edc355f46d0d63edb1e

@rajarshimaitra
Copy link
Contributor

Yes.. I have been seeing this one in my local..

@evanlinjin
Copy link
Member Author

Yes.. I have been seeing this one in my local..

I mean if the error states "Resource TEMPORARILY unavailable", I think we can safely wrap the calls with exponential back-off logic. I did see a function in the tests for exponential back-off, so maybe we can reuse that.

@afilini
Copy link
Member

afilini commented Jul 14, 2022

My guess: the tcp send buffer gets filled up because core can't process all those requests at once. Increasing the number of threads would probably help.

Source https://linux.die.net/man/2/send:

When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case

afilini added a commit to afilini/bdk that referenced this issue Aug 9, 2022
Whenever the transport errors with "WouldBlock" we wait a bit and retry
the call.

Related to bitcoindevkit#650, should at least fix the errors with RPC
afilini added a commit to afilini/bdk that referenced this issue Aug 10, 2022
Whenever the transport errors with "WouldBlock" we wait a bit and retry
the call.

Related to bitcoindevkit#650, should at least fix the errors with RPC
evanlinjin pushed a commit to evanlinjin/bdk that referenced this issue Sep 1, 2022
Whenever the transport errors with "WouldBlock" we wait a bit and retry
the call.

Related to bitcoindevkit#650, should at least fix the errors with RPC
@notmandatory
Copy link
Member

I ran into this issue testing bitcoindevkit/bdk-cli#102 with the new regtest-bitcoin feature which starts up a local bitcoind and connects to it via the RPC client, but the error only seemed to happen on my MacOS system, the CI tests all pass on Ubuntu Linux. While debugging I noticed in the bitcoind logs a list of RPC requests to "Update existing descriptor: raw(...)" seems to be the default 100 address look ahead, so I think the requests are all getting successfully delivered but the RPC client connection disconnects always after 5 seconds and throws the SocketError "Resource temporarily unavailable". I'll try making the bitcoind server responded faster (as @afilini suggested with more threads), and separately see if I can make the RPC client timeout longer and report back.

@notmandatory
Copy link
Member

I re-tried my test with more rpcthreads set on bitcoind with conf set to via "-rpcthread=", I tired 20 and 50, didn't seem to make a difference. I haven't find a way yet to change the RPC call socket timeout.

@notmandatory
Copy link
Member

notmandatory commented Sep 10, 2022

Probably best to disregard my above RPC problems since the title of this issue is really about esplora. I'll open a new issue for my RPC macos problem.

@notmandatory
Copy link
Member

Could this issue be related to the default number for file descriptors running out? This is mentioned for core here: JSON-RPC-interface limitations.

@rajarshimaitra
Copy link
Contributor

rajarshimaitra commented Mar 26, 2023

As per https://docs.oracle.com/cd/E19476-01/821-0505/file-descriptor-requirements.html, the default limit for file descriptors per process is 1024. I am not fully sure, but I think we are not hitting that many RPC requests.

Another pointer from LDK team was, it should just redo the call again if it hits the error. lightningdevkit/rust-lightning#2033 (comment)

I will try with this on the new RPC example and report back..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

4 participants