Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to find SST files during compaction #2746

Closed
1 task
v0y4g3r opened this issue Nov 14, 2023 · 5 comments
Closed
1 task

Failed to find SST files during compaction #2746

v0y4g3r opened this issue Nov 14, 2023 · 5 comments
Labels
C-bug Category Bugs

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Nov 14, 2023

What type of bug is this?

Data corruption

What subsystems are affected?

Datanode

Description

When using S3 as storage, datanodes complains failed to compact region since it cannot find some input files.

Failed to compact region: 5905580032000(1375, 0) err=0: OpenDAL operator failed, at greptimedb/src/mito2/src/sst/parquet/reader.rs:126:14
1: NotFound (persistent) at  => File not found: data/[...]/public/1375/1375_0000000000/62a9180f-ff63-4137-9d5e-6eae7ae5e178.parquet

The missing file is created by another compaction which may still be influenced by this EntityTooSmall issue. But in normal control flow, if the manifest file has been successfully updated, the SST file must have already been uploaded to S3 before manifest update. So the problem is:

  • GreptimeDB believes the SST file has been uploaded, and no error logs about this SST file.
  • Manifest file has been updated.
  • No sign about this SST file has ever been deleted.
  • No version of this SST file is listed on S3 console.

In summary, this missing SST file might never have been uploaded successfully ever, while for some reason datanode mistakenly thought it was done and updated the manifest.

TODO

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Nov 14, 2023

We can list all incomplete multipart uploads via awscli, and this missing file is marked as "incomplete":

{
  "UploadId": "[...]",
  "Key": "cluster-prod1-2/data/[...]/public/1205/1205_0000000000/0ad457b9-9f21-4593-9d1f-dd1b968e1813.parquet",
  "Initiated": "2023-11-08T12:47:34+00:00",
  "StorageClass": "STANDARD",
  "Owner": {
    "DisplayName": "[...]",
    "ID": "[...]"
  },
  "Initiator": {
    "ID": "[...]",
    "DisplayName": "[...]"
  }
}

@evenyag
Copy link
Contributor

evenyag commented Nov 14, 2023

Added a flag to ignore those files.
#2745

We can enable this to skip incorrect manifests and remove it after all manifests are fixed.

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Nov 15, 2023

As per AWS ticket response:

由内部工具,我可以看到您在 Nov 08 12:47:33 建立了 Multipar upload,并上传了约 81 个片,每个片约 4194304 (最后一个 29508)。

Total bytes uploaded to S3 was 4194304*80+29508=335,573,828, which is the same as file size in manifest:

        "files_to_add": [
          {
            "region_id": 5175435591680,
            "file_id": "0ad457b9-9f21-4593-9d1f-dd1b968e1813",
            "time_range": [
              {
                "value": 1698049810950,
                "unit": "Millisecond"
              },
              {
                "value": 1699447600950,
                "unit": "Millisecond"
              }
            ],
            "level": 1,
            "file_size": 335573828
          }
        ],

We now suspect the following causes:

  • datanode did not invoke "complete multipart upload", which is wrapped inside opendal's writer's shutdown method.
  • datanode invoked "complete multipart upload", which actually failed but datanode falsely considered it a success and proceeded to update the manifest.

@evenyag
Copy link
Contributor

evenyag commented Apr 3, 2024

Another file not found error #3633. But it was caused by a bug.

@killme2008
Copy link
Contributor

Looks like not happen again.

@github-project-automation github-project-automation bot moved this from Todo to Done in mito2 Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs
Projects
Status: Done
Development

No branches or pull requests

3 participants