Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusually high network activity from CKB Light Client. #168

Closed
jordanmack opened this issue Dec 23, 2023 · 24 comments
Closed

Unusually high network activity from CKB Light Client. #168

jordanmack opened this issue Dec 23, 2023 · 24 comments

Comments

@jordanmack
Copy link

jordanmack commented Dec 23, 2023

I run ten testnet CKB light clients on a virtual machine for testing purposes. Today I received a notification from my ISP that I was approaching my usage limit. My router reports several hundred GB of transmission in the last few days, which is very unusual. I traced the increase in network activity to the virtual machine with the light clients. The current version of the light clients running is v0.3.0, which was installed on November 1, 2023.

Below is a screenshot of the host node activity charts for the month. Starting on December 13th, there was an increase in network activity which continued to increase until hitting nearly 3MB/s on December 19th. This network activity is persistent and amounts to hundreds of GB over a few days.
20231222135445

Here are a few minutes of internet network activity for my network.
20231222135247

Looking at a ps of the clients on the host machine. One node has significantly more CPU time than the others. However, there is nothing unusual about the screen output. (This was done after restarting the virtual machine and all the light clients when high network activity was noticed again.)
20231222140417
20231222140558

After killing the client with high CPU time, the bandwidth has dropped significantly. The other 9 light client nodes are still running.
20231222141430

I have turned off this virtual machine for the time being. If you would like me to debug further, please let me know how I should do so.

@jordanmack jordanmack changed the title Persistently high network activity from CKB Light Client. Unusually high network activity from CKB Light Client. Dec 23, 2023
@quake
Copy link
Member

quake commented Dec 23, 2023

are there any difference in the scripts setting for these 10 light client? could you help to call the get_scripts rpc and post result here?

@jordanmack
Copy link
Author

are there any difference in the scripts setting for these 10 light client? could you help to call the get_scripts rpc and post result here?

I do not see any difference. They all return an empty result.

{
	"jsonrpc": "2.0",
	"result": [],
	"id": 42
}

@quake
Copy link
Member

quake commented Dec 23, 2023

could you upgrade to v0.3.3? I suspect it might be related to this bug fix: #163

@jordanmack
Copy link
Author

jordanmack commented Dec 26, 2023

could you upgrade to v0.3.3? I suspect it might be related to this bug fix: #163

I upgraded to v0.3.3 by replacing the binary without deleting the configuration or data files. I no longer see high bandwidth issues. However, I see that one node has very significantly higher CPU usage than the others. This is node 9, where as before it was node 1 with high CPU and high bandwidth.

20231225222220

Output from the console is below.

log.txt

@quake
Copy link
Member

quake commented Dec 26, 2023

are there any diff between config/testnet9.toml and others? could you upload one log file of the normal ones? thanks.

@jordanmack
Copy link
Author

The only differences are the ports used.

testnet8-config.txt
testnet9-config.txt

Here is the console output from client 8, which is running normally.

testnet8-log.txt

@yangby-cryptape
Copy link
Collaborator

yangby-cryptape commented Dec 26, 2023

Was the tip header of node 9 correct? (RPC method: get_tip_header)

Base on current logs, I guess there are abnormal or malicious CKB nodes, they send unexpected responses to the light client.
The light client should ban it, but doesn't.
(This feature is unimplemented now.)

So, I want to confirm whether the light client work properly.


Why I think the remote peer is abnormal or malicious.

Light client won't send such many requests at one time.
In fact, at most 1 request every 10 seconds.
But the node sends 367 responses at "2023-12-25T23:39:00Z".

👉 Or, was there a network storm?


The log message misses a "not".

The warning message should be "is not continuous with last filter block number", the "not" is missed by typo.

I will fix that later.

@jordanmack
Copy link
Author

jordanmack commented Dec 26, 2023

I left all clients running overnight. The elevated CPU activity has stopped now.

20231226104744

The current block on client 9 from get_tip_header is 11,737,540, which is behind the current testnet at 11,739,978.

Below is the most recent console output.

testnet9-log.txt

My router reports that there was a normal amount internet activity during the night. However, I noticed something interesting in the virtual machine dashboard. There was high network activity for several hours, but this time it was only local. I suspect that this is because I have local testnet full node (v0.110.0) which is configured as the sole boot node in the light client configuration.

20231226105614

The clients started in an off state, and you can see above when I turned the clients back on based on the memory usage. Looking at the network traffic, you can see approximately 18 hours of high network activity before stopping by itself. It was receiving 1.6 MBps continuously during this period. This chart is MBps (megabytes per second), not Mbps (megabits per second). This is approximately 100 GB for the 18 hour period.

Here is the output of get_tip_header for all clients. All are caught up except client 9.

{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d08e641","dao":"0x78e44f925a31474a8f2b7252a9c52700528e94370a8ed305006e98e6681ece08","epoch":"0x7080023001ebe","extra_hash":"0x138888b369a0da4f8ac80bab46c2c72948932db8674da901b7e85c749cb5b90b","hash":"0x30e553a752dbb6258d7d141de2a5e014df8ee6a35fdf41a4060c31d74028f8ce","nonce":"0x5facfe88d6e00103809525ad07500319","number":"0xb319c4","parent_hash":"0xaf6a66074c42f16675f9641f69988c0a9514520515275725a3a0f5d27f1b6af1","proposals_hash":"0x13068f7a21a7c4a32c60aa68295ec15d0c0a11165cd0efd5bc9c4b5d30bd9dfd","timestamp":"0x18ca654f1d7","transactions_root":"0x27d8c4cfa7b043f8ce9ed41097376db997e99148779f1f675b127a85a678aba6","version":"0x0"},"id":1}

@yangby-cryptape
Copy link
Collaborator

In the log file "testnet9-log.txt", it print the following log message every 8 seconds:

block number (11737540) in storage is not less than the block number (11737540) which requires proving

So, those log messages mean that remote CKB node (peer SessionId(23)) kept block 11737540 as its tip for hours (at least 2023-12-26T16:45:21Z to 2023-12-26T19:01:15Z).
The light client worked well, but it didn't connect to any peers other than that peer SessionId(23).

@yangby-cryptape
Copy link
Collaborator

yangby-cryptape commented Dec 27, 2023

Could you provide the results of:

  • get_peers for both light client 9 and your sole boot node.
  • sync_state for your sole boot node.

@jordanmack
Copy link
Author

jordanmack commented Dec 27, 2023

get_peers for client 9:

{
	"jsonrpc":"2.0",
	"result":[
		{
			"addresses":[
				{
					"address":"/ip4/192.168.0.73/tcp/18115/p2p/QmQnnZeQnpsABhQEz5hGMoV4jkuDR3cRr6STriTEmBzKFu",
					"score":"0x64"
				}
			],
			"connected_duration":"0x325293e",
			"node_id":"QmQnnZeQnpsABhQEz5hGMoV4jkuDR3cRr6STriTEmBzKFu",
			"protocols":[
				{
					"id":"0x64",
					"version":"2"
				},
				{
					"id":"0x0",
					"version":"2"
				},
				{
					"id":"0x2",
					"version":"2"
				},
				{
					"id":"0x78",
					"version":"2"
				},
				{
					"id":"0x1",
					"version":"2.1"
				},
				{
					"id":"0x79",
					"version":"2"
				},
				{
					"id":"0x4",
					"version":"2"
				}
			],
			"sync_state":{
				"proved_best_known_header":null,
				"requested_best_known_header":null
			},
			"version":"0.110.0 (0679b11 2023-05-16)"
		}
	],
	"id":1
}

All the other clients have long lists of clients for get_peers.

sync_state for local boot node:

{
	"jsonrpc": "2.0",
	"result": {
		"best_known_block_number": "0xb33386",
		"best_known_block_timestamp": "0x18ca97dadf9",
		"fast_time": "0x3e8",
		"ibd": false,
		"inflight_blocks_count": "0x0",
		"low_time": "0x5dc",
		"normal_time": "0x4e2",
		"orphan_blocks_count": "0x0"
	},
	"id": 42
}

@quake
Copy link
Member

quake commented Dec 27, 2023

It looks like your local bootnode is not broadcasting other full node addresses to client 9.
This is weird, all other clients are receiving addresses and are able to connect to more nodes as normal.

Could you upgrade your local bootnode to latest version, then reboot client 9 and try again?

@jordanmack
Copy link
Author

It looks like your local bootnode is not broadcasting other full node addresses to client 9. This is weird, all other clients are receiving addresses and are able to connect to more nodes as normal.

Does any of this explain why something used 100GB of bandwidth?

Could you upgrade your local bootnode to latest version, then reboot client 9 and try again?

Will try next.

@jordanmack
Copy link
Author

The full node and ckb light client 9 are now at the newest available versions.

Client 9 is using more CPU cycles and high bandwidth again. Network activity on the virtual machine is 3.5MB/s. Internet traffic is just over 1.5MB/s. It is probable that it is sustaining 1.5MB/s to the local full node and 1.5MB/s to the external full node.

20231227193813
20231227194541
20231227194554

Here is the current screen output from client 9.

client9-log.txt

Here is the result of get_peers. There is a second peer now.

client9-get-peers.txt

Here is the result of get_tip_header. It was at the tip for the entire time I was tracking.

{
	"jsonrpc": "2.0",
	"result": {
		"compact_target": "0x1d0892c8",
		"dao": "0x9655c8eedbc64f4a20fbaa85c6c62700b165db4ca063d50500c18d82bf7ece08",
		"epoch": "0x70803f7001ec7",
		"extra_hash": "0x9ed419567e868465abbe943ef8a1859764092a89f1f62646569ce69af17b18b4",
		"hash": "0x8e9ef98d1de9eee0fb2a2d8271f29766d5380929898b94ae93c749d372f23983",
		"nonce": "0x33aeecb665c83f6cdb9627e306eb7b49",
		"number": "0xb35ce0",
		"parent_hash": "0x9c16173c7f423767b930ef9d2574272a791a95bd93a7ceccf7c27933d78e7579",
		"proposals_hash": "0x70bb2879cf3a8f386e6d892b2f14b54a9337b9f8418b31ff59a44fccc7f5ae3d",
		"timestamp": "0x18cae80427a",
		"transactions_root": "0x8225a143c7b4939296009344a4c5c7bb4f79e0f66bef6e30dfb2af2dad24041e",
		"version": "0x0"
	},
	"id": 42
}

After stopping client 9, network activity ceases.

@quake
Copy link
Member

quake commented Dec 28, 2023

could you zip the client 9 data folder and upload? I'd like to try to reproduce it in my local environment.

@jordanmack
Copy link
Author

could you zip the client 9 data folder and upload? I'd like to try to reproduce it in my local environment.

data9.tar.gz

@quake
Copy link
Member

quake commented Dec 28, 2023

we have reproduced this abnormal network usage, working on bug fix, will update later, thanks.

@yangby-cryptape
Copy link
Collaborator

yangby-cryptape commented Dec 28, 2023

There are 3 sub-issues for this issue:

  1. Light client won't discover enough nodes.

    Light client require 50% outbound nodes (default max_outbound_peers is 8) to finalize a checkpoint.

    max_outbound_peers = 8

    The reason for this issue is still unsure, but it maybe related to fix: keep peer store address unique ckb#4133.

  2. Light client will sync all block filter hashes after the finalized checkpoint.

    Then, light client will sync all block filter hashes from all connected peers when no enough connected peers, because the first issue which lead to finalized-checkpoint = 0.

    So, it causes a huge bandwidth and CPU cost.

    Ref: fix: don't sync block filter hashes if finalized checkpoint is too less than the last proved header #169

  3. Light clients receive a fixed state from some CKB nodes.

    I checked some CKB nodes which sent the same tip header during hours, their tip headers are correct.

    The state which light client received is the state when light client connected to the CKB node.

    The reason for this issue is still unknown.

@quake
Copy link
Member

quake commented Dec 29, 2023

@jordanmack could you update your light clients' config bootnodes section to the public nodes:
https://github.com/nervosnetwork/ckb-light-client/blob/develop/config/testnet.toml#L18-L43

I think it may resolve this cpu abnormal usage as a temp solution.

@yangby-cryptape
Copy link
Collaborator

yangby-cryptape commented Dec 29, 2023

I add some rules to ban a peer:

Light clients could not distinguish a peer whether it just has a stale state or it is malicious, so just ban it, then find a new peer to instead of it.

p.s. After the few minutes, the ban status will be removed.

@quake
Copy link
Member

quake commented Jan 2, 2024

you may upgrade light client to 0.3.4 now, it will resolve this cpu and network abnormal usage issue:
https://github.com/nervosnetwork/ckb-light-client/releases/tag/v0.3.4

@jordanmack
Copy link
Author

So far everything is running fine. Nothing except for a few long fork errors from two clients on startup.

If I shutdown a client and delete the data/network/peer_store folder, will that reset the known peers and force it to use the boot nodes again?

@jordanmack
Copy link
Author

One other question, do light clients create connections with other light clients for any reason, or do they only connect to full nodes?

@quake
Copy link
Member

quake commented Jan 9, 2024

If I shutdown a client and delete the data/network/peer_store folder, will that reset the known peers and force it to use the boot nodes again?

yes

One other question, do light clients create connections with other light clients for any reason, or do they only connect to full nodes?

only connect to full nodes

@quake quake closed this as completed Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants