-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to find SST files during compaction #2746
Comments
We can list all incomplete multipart uploads via awscli, and this missing file is marked as "incomplete": {
"UploadId": "[...]",
"Key": "cluster-prod1-2/data/[...]/public/1205/1205_0000000000/0ad457b9-9f21-4593-9d1f-dd1b968e1813.parquet",
"Initiated": "2023-11-08T12:47:34+00:00",
"StorageClass": "STANDARD",
"Owner": {
"DisplayName": "[...]",
"ID": "[...]"
},
"Initiator": {
"ID": "[...]",
"DisplayName": "[...]"
}
} |
Added a flag to ignore those files. We can enable this to skip incorrect manifests and remove it after all manifests are fixed. |
As per AWS ticket response:
Total bytes uploaded to S3 was 4194304*80+29508=335,573,828, which is the same as file size in manifest: "files_to_add": [
{
"region_id": 5175435591680,
"file_id": "0ad457b9-9f21-4593-9d1f-dd1b968e1813",
"time_range": [
{
"value": 1698049810950,
"unit": "Millisecond"
},
{
"value": 1699447600950,
"unit": "Millisecond"
}
],
"level": 1,
"file_size": 335573828
}
], We now suspect the following causes:
|
Another file not found error #3633. But it was caused by a bug. |
Looks like not happen again. |
What type of bug is this?
Data corruption
What subsystems are affected?
Datanode
Description
When using S3 as storage, datanodes complains failed to compact region since it cannot find some input files.
Failed to compact region: 5905580032000(1375, 0) err=0: OpenDAL operator failed, at greptimedb/src/mito2/src/sst/parquet/reader.rs:126:14 1: NotFound (persistent) at => File not found: data/[...]/public/1375/1375_0000000000/62a9180f-ff63-4137-9d5e-6eae7ae5e178.parquet
The missing file is created by another compaction which may still be influenced by this EntityTooSmall issue. But in normal control flow, if the manifest file has been successfully updated, the SST file must have already been uploaded to S3 before manifest update. So the problem is:
In summary, this missing SST file might never have been uploaded successfully ever, while for some reason datanode mistakenly thought it was done and updated the manifest.
TODO
The text was updated successfully, but these errors were encountered: