Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replay verify on archive related workflows #15272

Merged
merged 2 commits into from
Nov 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions .github/workflows/provision-replay-verify-archive-disks.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
name: "provision-replay-verify-archive-disks"
on:
# Allow triggering manually
workflow_dispatch:
inputs:
NETWORK:
required: true
type: choice
description: The network to provision storage for.If not specified, it will provision snapshot for both testnet and mainnet.
options: [testnet, mainnet, all]
default: all
pull_request:
paths:
- ".github/workflows/provision-replay-verify-archive-disks.yaml"
- ".github/workflows/workflow-run-replay-verify-archive-storage-provision.yaml"
schedule:
- cron: "0 22 * * 1,3,5" # This runs every Mon,Wed,Fri

permissions:
contents: read
id-token: write #required for GCP Workload Identity federation which we use to login into Google Artifact Registry
issues: read
pull-requests: read

# cancel redundant builds
concurrency:
# cancel redundant builds on PRs (only on PR, not on branches)
group: ${{ github.workflow }}-${{ (github.event_name == 'pull_request' && github.ref) || github.sha }}
cancel-in-progress: true

jobs:
determine-test-metadata:
runs-on: ubuntu-latest
steps:
# checkout the repo first, so check-aptos-core can use it and cancel the workflow if necessary
- uses: actions/checkout@v4
- uses: ./.github/actions/check-aptos-core
with:
cancel-workflow: ${{ github.event_name == 'schedule' }} # Cancel the workflow if it is scheduled on a fork

- name: Debug
run: |
echo "Event name: ${{ github.event_name }}"
echo "Network: ${{ inputs.NETWORK }}"
provision-testnet:
if: |
github.event_name == 'schedule' ||
github.event_name == 'push' ||
github.event_name == 'workflow_dispatch' && (inputs.NETWORK == 'testnet' || inputs.NETWORK == 'all')
needs: determine-test-metadata
uses: ./.github/workflows/workflow-run-replay-verify-archive-storage-provision.yaml
secrets: inherit
with:
NETWORK: testnet

provision-mainnet:
if: |
github.event_name == 'schedule' ||
github.event_name == 'push' ||
github.event_name == 'workflow_dispatch' && (inputs.NETWORK == 'testnet' || inputs.NETWORK == 'all')
needs: determine-test-metadata
uses: ./.github/workflows/workflow-run-replay-verify-archive-storage-provision.yaml
secrets: inherit
with:
NETWORK: mainnet
85 changes: 85 additions & 0 deletions .github/workflows/replay-verify-on-archive.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# This defines a workflow to replay transactions on the given chain with the latest aptos node software.
# In order to trigger it go to the Actions Tab of the Repo, click "replay-verify" and then "Run Workflow".
#
# On PR, a single test case will run. On workflow_dispatch, you may specify the CHAIN_NAME to verify.

name: "replay-verify-on-archive"
on:
# Allow triggering manually
workflow_dispatch:
inputs:
NETWORK:
required: true
type: choice
options: [testnet, mainnet, all]
default: all
description: The chain name to test. If not specified, it will test both testnet and mainnet.
IMAGE_TAG:
required: false
type: string
description: The image tag of the feature branch to test, if not specified, it will use the latest commit on current branch.
START_VERSION:
required: false
type: string
description: Optional version to start replaying. If not specified, replay-verify will determines start version itself.
END_VERSION:
required: false
type: string
description: Optional version to end replaying. If not specified, replay-verify will determines end version itself.
pull_request:
paths:
- ".github/workflows/replay-verify-on-archive.yaml"
- ".github/workflows/workflow-run-replay-verify-on-archive.yaml"
schedule:
- cron: "0 22 * * 0,2,4" # The main branch cadence. This runs every Sun,Tues,Thurs

permissions:
contents: read
id-token: write #required for GCP Workload Identity federation which we use to login into Google Artifact Registry
issues: read
pull-requests: read

# cancel redundant builds
concurrency:
# cancel redundant builds on PRs (only on PR, not on branches)
group: ${{ github.workflow }}-${{ (github.event_name == 'pull_request' && github.ref) || github.sha }}
cancel-in-progress: true

jobs:
determine-test-metadata:
runs-on: ubuntu-latest-32-core
steps:
# checkout the repo first, so check-aptos-core can use it and cancel the workflow if necessary
- uses: actions/checkout@v4
- uses: ./.github/actions/check-aptos-core
with:
cancel-workflow: ${{ github.event_name == 'schedule' }} # Cancel the workflow if it is scheduled on a fork

replay-testnet:
if: |
github.event_name == 'schedule' ||
github.event_name == 'push' ||
github.event_name == 'workflow_dispatch' && (inputs.NETWORK == 'testnet' || inputs.NETWORK == 'all')
needs: determine-test-metadata
uses: ./.github/workflows/workflow-run-replay-verify-on-archive.yaml
secrets: inherit
with:
NETWORK: "testnet"
IMAGE_TAG: ${{ inputs.IMAGE_TAG }}
START_VERSION: ${{ inputs.START_VERSION }}
END_VERSION: ${{ inputs.END_VERSION }}

replay-mainnet:
if: |
github.event_name == 'schedule' ||
github.event_name == 'push' ||
github.event_name == 'pull_request' ||
github.event_name == 'workflow_dispatch' && (inputs.NETWORK == 'mainnet' || inputs.NETWORK == 'all' )
needs: determine-test-metadata
uses: ./.github/workflows/workflow-run-replay-verify-on-archive.yaml
secrets: inherit
with:
NETWORK: "mainnet"
IMAGE_TAG: ${{ inputs.IMAGE_TAG }}
START_VERSION: ${{ inputs.START_VERSION }}
END_VERSION: ${{ inputs.END_VERSION }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: "*run archive storage provision workflow"

on:
# This allows the workflow to be triggered from another workflow
workflow_call:
inputs:
NETWORK:
required: true
type: string
description: The network to provision storage for.
workflow_dispatch:
inputs:
NETWORK:
description: The network to provision storage for.
type: string
required: true
jobs:
provision:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ github.event.inputs.BRANCH || 'add_replay_verify_workflow' }}

# Authenticate to Google Cloud the project is aptos-ci
- name: Authenticate to Google Cloud
id: auth
uses: "google-github-actions/auth@v2"
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT_EMAIL }}
export_environment_variables: false
create_credentials_file: true

# This is required since we need to switch from aptos-ci to aptos-devinfra-0
- name: Setup Credentials
run: |
echo "GOOGLE_APPLICATION_CREDENTIALS=${{ steps.auth.outputs.credentials_file_path }}" >> $GITHUB_ENV
echo "CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE=${{ steps.auth.outputs.credentials_file_path }}" >> $GITHUB_ENV
echo "GOOGLE_GHA_CREDS_PATH=${{ steps.auth.outputs.credentials_file_path }}" >> $GITHUB_ENV
echo "CLOUDSDK_AUTH_ACCESS_TOKEN=${{ steps.auth.outputs.access_token }}" >> $GITHUB_ENV
- name: Set up Cloud SDK
uses: "google-github-actions/setup-gcloud@v2"
with:
install_components: "kubectl, gke-gcloud-auth-plugin"

- name: "Setup GCloud Project"
shell: bash
run: gcloud config set project aptos-devinfra-0

- uses: ./.github/actions/python-setup
with:
pyproject_directory: testsuite/replay-verify

- name: "Provision Storage"
env:
GOOGLE_CLOUD_PROJECT: aptos-devinfra-0
run: cd testsuite/replay-verify && poetry run python archive_disk_utils.py --network ${{ inputs.NETWORK }}

119 changes: 119 additions & 0 deletions .github/workflows/workflow-run-replay-verify-on-archive.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
name: "*run replay-verify on archive reusable workflow"

on:
# This allows the workflow to be triggered from another workflow
workflow_call:
inputs:
NETWORK:
required: true
type: string
description: The network to run replay verify on.
IMAGE_TAG:
required: false
type: string
description: The image tag of the feature branch to test, if not specified, it will use the latest commit on current branch.
START_VERSION:
required: false
type: string
description: Optional version to start replaying. If not specified, replay-verify will determines start version itself.
END_VERSION:
required: false
type: string
description: Optional version to end replaying. If not specified, replay-verify will determines end version itself.

workflow_dispatch:
inputs:
NETWORK:
required: true
type: string
description: The network to run replay verify on.
IMAGE_TAG:
required: false
type: string
description: The image tag of the feature branch to test, if not specified, it will use the latest commit on current branch.
START_VERSION:
required: false
type: string
description: The history start to use for the backup. If not specified, it will use the default history start.
END_VERSION:
required: false
type: string
description: The end version to use for the backup. If not specified, it will use the latest version.
jobs:
run-replay-verify:
runs-on: ubuntu-latest-32-core
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ github.event.inputs.BRANCH || 'add_replay_verify_workflow' }}

- uses: aptos-labs/aptos-core/.github/actions/docker-setup@main
id: docker-setup
with:
GCP_WORKLOAD_IDENTITY_PROVIDER: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
GCP_SERVICE_ACCOUNT_EMAIL: ${{ secrets.GCP_SERVICE_ACCOUNT_EMAIL }}
EXPORT_GCP_PROJECT_VARIABLES: "false"
GIT_CREDENTIALS: ${{ secrets.GIT_CREDENTIALS }}

# Authenticate to Google Cloud the project is aptos-ci with credentails files generated
- name: Authenticate to Google Cloud
id: auth
uses: "google-github-actions/auth@v2"
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT_EMAIL }}
export_environment_variables: false
create_credentials_file: true

# This is required since we need to switch from aptos-ci to aptos-devinfra-0
- name: Setup credentials
run: |
echo "GOOGLE_APPLICATION_CREDENTIALS=${{ steps.auth.outputs.credentials_file_path }}" >> $GITHUB_ENV
echo "CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE=${{ steps.auth.outputs.credentials_file_path }}" >> $GITHUB_ENV
echo "GOOGLE_GHA_CREDS_PATH=${{ steps.auth.outputs.credentials_file_path }}" >> $GITHUB_ENV
echo "CLOUDSDK_AUTH_ACCESS_TOKEN=${{ steps.auth.outputs.access_token }}" >> $GITHUB_ENV
- name: Set up Cloud SDK
uses: "google-github-actions/setup-gcloud@v2"
with:
install_components: "kubectl, gke-gcloud-auth-plugin"

- name: "Setup GCloud project"
shell: bash
run: gcloud config set project aptos-devinfra-0

- uses: ./.github/actions/python-setup
with:
pyproject_directory: testsuite/replay-verify

- name: Schedule replay verify
env:
GOOGLE_CLOUD_PROJECT: aptos-devinfra-0
run: |
cd testsuite/replay-verify
CMD="poetry run python main.py --network ${{ inputs.NETWORK }}"
if [ -n "${{ inputs.START_VERSION }}" ]; then
CMD="$CMD --start ${{ inputs.START_VERSION }}"
fi
if [ -n "${{ inputs.END_VERSION }}" ]; then
CMD="$CMD --end ${{ inputs.END_VERSION }}"
fi
if [ -n "${{ inputs.IMAGE_TAG }}" ]; then
CMD="$CMD --end ${{ inputs.IMAGE_TAG }}"
fi
eval $CMD
# This is in case user manually cancel the step above, we still want to cleanup the resources
- name: Post-run cleanup
env:
GOOGLE_CLOUD_PROJECT: aptos-devinfra-0
if: ${{ always() }}
run: |
cd testsuite/replay-verify
poetry run python main.py --network ${{ inputs.NETWORK }} --cleanup
13 changes: 7 additions & 6 deletions storage/db-tool/src/replay_on_archive.rs
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,13 @@ impl Verifier {
let mut expected_txn_infos = Vec::new();
let mut chunk_start_version = start;
for (idx, item) in txn_iter.enumerate() {
// timeout check
if let Some(duration) = self.timeout_secs {
if self.replay_stat.get_elapsed_secs() >= duration {
return Ok(total_failed_txns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout is an okay, not an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout here is to force the worker to stop replaying and return the existing results immediately. error would discard the all the results and can be a waste

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use something like tokio::timeout here instead of making our own timeout?

If we dont enforce a timeout on a future using tokio (which also has its own caveats) then it seems like this might never actually time out here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process will stop because it runs in small batches, and timeout checks occur per batch. I wrote this solution to address the issue of handling some long-running ranges, such as graffio transactions. It's not crucial for me that it stops exactly at the timeout; a few minutes later is acceptable. My main goal is to save the results from whatever has been replayed so that we can have partial results of these transactions.

}
}

let (input_txn, expected_txn_info, expected_event, expected_writeset) = item?;
let is_epoch_ending = expected_event.iter().any(ContractEvent::is_new_epoch_event);
cur_txns.push(input_txn);
Expand All @@ -224,12 +231,6 @@ impl Verifier {
self.replay_stat.update_cnt(cur_txns.len() as u64);
self.replay_stat.print_tps();

if let Some(duration) = self.timeout_secs {
if self.replay_stat.get_elapsed_secs() >= duration {
return Ok(total_failed_txns);
}
}

// empty for the new chunk
chunk_start_version = start + (idx as u64) + 1;
cur_txns.clear();
Expand Down
8 changes: 8 additions & 0 deletions testsuite/replay-verify/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import os
import sys


path = os.path.dirname(__file__)

if path not in sys.path:
sys.path.append(path)
20 changes: 20 additions & 0 deletions testsuite/replay-verify/archive-pvc-template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
name: testnet-archive-claim
labels:
run: some-label
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 10Ti
storageClassName: ssd-data-xfs
volumeMode: Filesystem
dataSourceRef:
name: testnet-archive
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Loading
Loading