Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(consensus): reuse db and prevent InsufficientPeers error #2242

Closed
wants to merge 2 commits into from

Conversation

asmaastarkware
Copy link
Contributor

@asmaastarkware asmaastarkware commented Jul 22, 2024

This change is Reviewable

Copy link

codecov bot commented Jul 22, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 66.34%. Comparing base (2db883b) to head (647e080).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
crates/papyrus_node/src/main.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2242      +/-   ##
==========================================
+ Coverage   66.31%   66.34%   +0.02%     
==========================================
  Files         139      139              
  Lines       18341    18387      +46     
  Branches    18341    18387      +46     
==========================================
+ Hits        12163    12199      +36     
- Misses       4883     4891       +8     
- Partials     1295     1297       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@asmaastarkware asmaastarkware force-pushed the asmaa/consensus_use_same_db branch from 5f948dd to a76a409 Compare July 23, 2024 06:33
@asmaastarkware asmaastarkware changed the title chore(consensus): use previous db and prevent InsufficientPeers error chore(consensus): reuse db and prevent InsufficientPeers error Jul 23, 2024
Copy link
Contributor

@matan-starkware matan-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @asmaastarkware)


crates/sequencing/papyrus_consensus/run_consensus.py line 139 at r1 (raw file):

    # 1. The bootnode (validator 1) is started first for network peering.
    # 2. Validators 2+ are started next to join the network through the bootnode.
    # 3. Validator 0, which is the proposer, is started last so the validators don't miss the proposals.

Why did you remove this?

Code quote:

    # Validators are started in a specific order to ensure proper network formation:
    # 1. The bootnode (validator 1) is started first for network peering.
    # 2. Validators 2+ are started next to join the network through the bootnode.
    # 3. Validator 0, which is the proposer, is started last so the validators don't miss the proposals.

crates/sequencing/papyrus_consensus/run_consensus.py line 149 at r1 (raw file):

    return nodes

def acquire_lock(data_dir):

Have you tested this both succeeding and failing?

Code quote:

def acquire_lock(data_dir):

crates/sequencing/papyrus_consensus/run_consensus.py line 165 at r1 (raw file):

    logs_dir = tempfile.mkdtemp()

    if db_dir:

Suggestion:

    if db_dir is not None:

crates/sequencing/papyrus_consensus/run_consensus.py line 166 at r1 (raw file):

    if db_dir:
        assert os.path.exists(db_dir), f"The specified directory {db_dir} does not exist."

listdir triggers FileNotFoundError which I think is descriptive enough.


crates/sequencing/papyrus_consensus/run_consensus.py line 167 at r1 (raw file):

    if db_dir:
        assert os.path.exists(db_dir), f"The specified directory {db_dir} does not exist."
        data_dirs = [d for d in os.listdir(db_dir) if os.path.isdir(os.path.join(db_dir, d))]

Just create here as a set.

Suggestion:

        data_dirs = {d for d in os.listdir(db_dir) if os.path.isdir(os.path.join(db_dir, d))}

crates/sequencing/papyrus_consensus/run_consensus.py line 172 at r1 (raw file):

        assert (
            len(data_dirs) == num_validators
        ), f"The specified directory {db_dir} must contain exactly {num_validators} validator directories."

This is not needed based on the expected_dirs check below.


crates/sequencing/papyrus_consensus/run_consensus.py line 176 at r1 (raw file):

        # Ensure the directories are named data0, data1, ..., data{num_validators - 1}
        expected_dirs = {f"data{i}" for i in range(num_validators)}
        actual_dirs = set(data_dirs)

Create as set above


crates/sequencing/papyrus_consensus/run_consensus.py line 180 at r1 (raw file):

        assert (
            expected_dirs == actual_dirs
        ), f"The directories in {db_dir} must be named {', '.join(expected_dirs)}."

We want to know that there are enough db folders. We don't care if there are also other folders.

Suggestion:

        assert (
            expected_dirs.issubset(actual_dirs)
        ), f"The directories in {db_dir} must be named {', '.join(expected_dirs)}."

crates/sequencing/papyrus_consensus/run_consensus.py line 187 at r1 (raw file):

    # Acquire lock on the db_dir
    lock_fd = acquire_lock(db_dir)

Using enter and exit to enable a with statement is a more pythonic way. Means you don't need to put the close explicitly here.

Suggestion:

with LockFile(db_dif) as lockfile:

crates/sequencing/papyrus_consensus/run_consensus.py line 209 at r1 (raw file):

    parser.add_argument("--base_layer_node_url", required=True)
    parser.add_argument("--num_validators", type=int, required=True)
    parser.add_argument("--db_dir", required=False, default=None)

Let's add a description:

Directory with existing DBs that this simulation can reuse.

Code quote:

    parser.add_argument("--db_dir", required=False, default=None)

crates/sequencing/papyrus_consensus/src/lib.rs line 116 at r2 (raw file):

{
    // Add a short delay to allow peers to connect and avoid "InsufficientPeers" error
    tokio::time::sleep(std::time::Duration::from_secs(5)).await;

Let's make this part of the consensus config.

Code quote:

    // Add a short delay to allow peers to connect and avoid "InsufficientPeers" error
    tokio::time::sleep(std::time::Duration::from_secs(5)).await;

@asmaastarkware asmaastarkware force-pushed the asmaa/consensus_use_same_db branch from a76a409 to b97d984 Compare July 23, 2024 10:42
Copy link
Contributor Author

@asmaastarkware asmaastarkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 11 unresolved discussions (waiting on @matan-starkware)


crates/sequencing/papyrus_consensus/run_consensus.py line 139 at r1 (raw file):

Previously, matan-starkware wrote…

Why did you remove this?

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 149 at r1 (raw file):

Previously, matan-starkware wrote…

Have you tested this both succeeding and failing?

Yes, I have run simulations with the following cases:

  1. without db_dir.
  2. reuse an existing db_dir with a single simulation.
  3. used an invalid db_dir that either doesn't exist or doesn't have the required data dirs.
  4. ran multiple simulations simultaneously using the same db_dir.

crates/sequencing/papyrus_consensus/run_consensus.py line 165 at r1 (raw file):

    logs_dir = tempfile.mkdtemp()

    if db_dir:

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 166 at r1 (raw file):

Previously, matan-starkware wrote…

listdir triggers FileNotFoundError which I think is descriptive enough.

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 167 at r1 (raw file):

Previously, matan-starkware wrote…

Just create here as a set.

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 172 at r1 (raw file):

Previously, matan-starkware wrote…

This is not needed based on the expected_dirs check below.

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 176 at r1 (raw file):

Previously, matan-starkware wrote…

Create as set above

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 180 at r1 (raw file):

Previously, matan-starkware wrote…

We want to know that there are enough db folders. We don't care if there are also other folders.

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 187 at r1 (raw file):

Previously, matan-starkware wrote…

Using enter and exit to enable a with statement is a more pythonic way. Means you don't need to put the close explicitly here.

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 209 at r1 (raw file):

Previously, matan-starkware wrote…

Let's add a description:

Directory with existing DBs that this simulation can reuse.

Done.


crates/sequencing/papyrus_consensus/src/lib.rs line 116 at r2 (raw file):

Previously, matan-starkware wrote…

Let's make this part of the consensus config.

Done.

@matan-starkware matan-starkware self-requested a review July 23, 2024 10:49
Copy link
Contributor

@matan-starkware matan-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @asmaastarkware)


crates/sequencing/papyrus_consensus/run_consensus.py line 149 at r1 (raw file):

Previously, asmaastarkware (asmaa-starkware) wrote…

Yes, I have run simulations with the following cases:

  1. without db_dir.
  2. reuse an existing db_dir with a single simulation.
  3. used an invalid db_dir that either doesn't exist or doesn't have the required data dirs.
  4. ran multiple simulations simultaneously using the same db_dir.

Great.

Have you tried running multiple simulations simultaneously correctly (different db_dir)?


crates/sequencing/papyrus_consensus/run_consensus.py line 38 at r3 (raw file):

        command = f"curl -s -X GET http://localhost:{port}/monitoring/metrics | grep -oP 'papyrus_consensus_height \\K\\d+'"
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        # returns the most recently decided height, or None if node is not ready or consensus has not yet reached any height.

This line is too long.

Code quote:

        # returns the most recently decided height, or None if node is not ready or consensus has not yet reached any height.

crates/sequencing/papyrus_consensus/run_consensus.py line 186 at r3 (raw file):

        actual_dirs = {d for d in os.listdir(db_dir) if os.path.isdir(os.path.join(db_dir, d))}

        # Ensure the directories are named data0, data1, ..., data{num_validators - 1}

Let's just delete this and the empty line above. I think it's obvious enough from he code, and we can see the else right here where we originally created them.


crates/sequencing/papyrus_consensus/run_consensus.py line 203 at r3 (raw file):

        print(
            f"Output files will be stored in: {logs_dir} and data files will be stored in: {db_dir}"
        )

I think it's easier to read these on 2 separate lines (see how you feel actually running the script).

Suggestion:

        print(f"DB files will be stored in: {db_dir}")
        print(f"Logs will be stored in: {logs_dir}")

crates/sequencing/papyrus_consensus/src/config.rs line 27 at r4 (raw file):

    pub num_validators: u64,
    /// The delay in seconds before starting consensus to prevent InsufficientPeers error.
    pub consensus_delay: u64,

Can you put a Duration in a config?

Suggestion:

Duration

crates/sequencing/papyrus_consensus/src/config.rs line 61 at r4 (raw file):

                &self.consensus_delay,
                "The delay in seconds before starting consensus to prevent 'InsufficientPeers' \
                 error.",

Suggestion:

Delay (seconds) before starting consensus to give time for network peering.

crates/sequencing/papyrus_consensus/src/lib.rs line 109 at r4 (raw file):

    start_height: BlockNumber,
    validator_id: ValidatorId,
    consensus_delay: u64,

Suggestion:

Duration

@asmaastarkware asmaastarkware force-pushed the asmaa/consensus_use_same_db branch from b97d984 to 1d18575 Compare July 23, 2024 11:59
Copy link
Contributor Author

@asmaastarkware asmaastarkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @matan-starkware)


crates/sequencing/papyrus_consensus/run_consensus.py line 149 at r1 (raw file):

Previously, matan-starkware wrote…

Great.

Have you tried running multiple simulations simultaneously correctly (different db_dir)?

done


crates/sequencing/papyrus_consensus/run_consensus.py line 38 at r3 (raw file):

Previously, matan-starkware wrote…

This line is too long.

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 186 at r3 (raw file):

Previously, matan-starkware wrote…

Let's just delete this and the empty line above. I think it's obvious enough from he code, and we can see the else right here where we originally created them.

Done.


crates/sequencing/papyrus_consensus/run_consensus.py line 203 at r3 (raw file):

Previously, matan-starkware wrote…

I think it's easier to read these on 2 separate lines (see how you feel actually running the script).

Done.


crates/sequencing/papyrus_consensus/src/config.rs line 27 at r4 (raw file):

Previously, matan-starkware wrote…

Can you put a Duration in a config?

Done.


crates/sequencing/papyrus_consensus/src/config.rs line 61 at r4 (raw file):

                &self.consensus_delay,
                "The delay in seconds before starting consensus to prevent 'InsufficientPeers' \
                 error.",

Done.


crates/sequencing/papyrus_consensus/src/lib.rs line 109 at r4 (raw file):

    start_height: BlockNumber,
    validator_id: ValidatorId,
    consensus_delay: u64,

Done.

@asmaastarkware asmaastarkware force-pushed the asmaa/consensus_use_same_db branch from 1d18575 to 3841618 Compare July 23, 2024 12:02
@asmaastarkware asmaastarkware force-pushed the asmaa/consensus_use_same_db branch from 3841618 to 647e080 Compare July 23, 2024 12:05
Copy link
Contributor

@matan-starkware matan-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @asmaastarkware)

Copy link

There hasn't been any activity on this pull request recently, and in order to prioritize active work, it has been marked as stale.
This PR will be closed and locked in 7 days if no further activity occurs.
Thank you for your contributions!

@github-actions github-actions bot added the stale No activity for quite some time. label Aug 23, 2024
@github-actions github-actions bot closed this Aug 31, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Sep 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale No activity for quite some time.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants