Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using multiple CellDb to concurrency read from celldb #1363

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

fatcat22
Copy link

@fatcat22 fatcat22 commented Nov 5, 2024

Background

When using the TON liteserver built by ourself, we noticed that when the access count increases, the response time of some requests (such as GetAccountState) get slower, with many timeout errors.

After investigation, we found that the longest processing time for a single GetAccountState request is the scheduling of CellDb::load_cell. Below is a timing statistics we added for a specific GetAccountState request:

ton-node-1  | [ 2][t27][2024-10-30 06:04:20.001725500][query_stat.cpp:105][!litequery]        query stat counter:1205. perform_getAccountState schedule cost: 1463946μs
ton-node-1  | ValidatorManagerImpl::get_block_data_for_litequery schedule cost: 19310μs
ton-node-1  | LiteQuery::request_mc_block_data cost: 9μs. 
ton-node-1  | ValidatorManagerImpl::get_block_state_for_litequery schedule cost: 19319μs
ton-node-1  | LiteQuery::request_mc_block_state cost: 0μs. 
ton-node-1  | LiteQuery::request_mc_block_data_state cost: 30μs. 
ton-node-1  | ValidatorManagerImpl::get_block_data_from_db schedule cost: 30753μs
ton-node-1  | ValidatorManagerImpl::get_block_handle cost: 1μs. 
ton-node-1  | ValidatorManagerImpl::get_block_handle_for_litequery cost: 1μs. 
ton-node-1  | ValidatorManagerImpl::get_block_data_for_litequery cost: 3μs. 
ton-node-1  | ValidatorManagerImpl::get_shard_state_from_db schedule cost: 30886μs
ton-node-1  | ValidatorManagerImpl::get_block_handle cost: 1μs. 
ton-node-1  | ValidatorManagerImpl::get_block_handle_for_litequery cost: 1μs. 
ton-node-1  | ValidatorManagerImpl::get_block_state_for_litequery cost: 1μs. 
ton-node-1  | RootDb::get_block_data schedule cost: 9480μs
ton-node-1  | RootDb::get_block_state schedule cost: 9547μs
ton-node-1  | ValidatorManagerImpl::get_shard_state_from_db cost: 0μs. 
ton-node-1  | ArchiveManager::get_file schedule cost: 2197μs
ton-node-1  | RootDb::get_block_data cost: 0μs. 
ton-node-1  | CellDb::load_cell schedule cost: 1400909μs
ton-node-1  | RootDb::get_block_state cost: 4μs. 
ton-node-1  | ArchiveSlice::get_file schedule cost: 568μs
ton-node-1  | ArchiveManager::get_file cost: 1μs. 
ton-node-1  | PackageReader reader schedule cost: 20μs
ton-node-1  | ArchiveSlice::get_file cost: 8μs. 
ton-node-1  | LiteQuery::got_mc_block_data schedule cost: 56μs
ton-node-1  | PackageReader::start_up cost: 181μs. 
ton-node-1  | LiteQuery::got_mc_block_data cost: 0μs. 
ton-node-1  | LiteQuery::got_mc_block_state schedule cost: 1516μs
ton-node-1  | CellDb::load_cell cost: 1477μs. 
ton-node-1  | LiteQuery::finish_query cost: 6μs. 

We could see that perform_getAccountState cost 1463946μs totally, during which CellDb::load_cell schedule const 1400909μs. So the schedule of CellDb::load_cell wast most of the time.

Fix

As we know, the task send to the same actor id is executed one by one. Since there's only one CellDb, so all the load cell operation will queued and executed one by one, but there are too many load cell operation waiting to be executed, that why the CellDb::load_cell schedule cost so much time.

So the solution is clear, we increased the number of CellDb objects to allow CellDb::load_cell calls to execute concurrently.

Result

Below is our test result (the test method involves sending 5000 GetAccountState requests simultaneously, then recording the response time for each request, and finally calculating the number of timeout errors and the average response time).

before optimization:

request count: 5000. failed count: 3292. failed rate: 65.84%. avg cost: 3804ms

after optimization:

request count: 5000. failed count: 0. failed rate: 0.00%. avg cost: 1021ms

We can see that after enabling concurrent CellDb::load_cell calls, the average response time dropped from 3.8 seconds to around 1 second.

@EmelyanenkoK
Copy link
Member

Nice analysis, we will check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants