Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compacted SST file reappears in region #2013

Closed
v0y4g3r opened this issue Jul 21, 2023 · 4 comments · Fixed by #2020
Closed

Compacted SST file reappears in region #2013

v0y4g3r opened this issue Jul 21, 2023 · 4 comments · Fixed by #2020
Labels
C-bug Category Bugs

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Jul 21, 2023

What type of bug is this?

Unexpected error

What subsystems are affected?

Datanode

What happened?

pod_name:greptimedb-frontend-5db6cc4db4-2sgpm
pod_namespace:greptimedb
message:2023-07-14T08:35:21.681888Z ERROR client::database: Failed to do Flight get, addr: greptimedb-datanode-0.greptimedb-datanode.greptimedb:4001, code: Internal error, source: Failed to do Flight get, code: Internal error, source: NotFound (persistent) at read, context: { response: Parts { status: 404, version: HTTP/1.1, headers: {"x-amz-request-id": "[...]", "x-amz-id-2": "[...]", "x-amz-delete-marker": "true", "x-amz-version-id": "[...]", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Fri, 14 Jul 2023 08:35:21 GMT", "server": "AmazonS3"} }, service: s3, path: data/[...]/public/1968/1968_0000000000/4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a.parquet, range: 0- } => S3Error { code: "NoSuchKey", message: "The specified key does not exist.", resource: "", request_id: "RG8E0ZM8E94P029B" }

Datanode logs:

root@greptimedb-datanode-0:/# grep -rF 4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a 


# The SST file [4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a ] was created
/tmp/greptimedb/logs/greptime-datanode.2023-07-08-14:{"v":0,"name":"greptime-datanode","msg":"Successfully flush memtables, region:8452495638528, files: [FileId(4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a)]","level":30,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-08T14:47:31.85146213Z","target":"storage::flush","line":305,"file":"greptimedb/src/storage/src/flush.rs"}


# [4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a] being compacted
/tmp/greptimedb/logs/greptime-datanode.2023-07-08-19:{"v":0,"name":"greptime-datanode","msg":"Compacting SST files, input: [FileId(101a7479-4666-4bc9-9996-f709e5352eb4), FileId(92c5b2c3-4001-48be-b9b0-fe40ac7a4327), FileId(b2f0e5fc-c798-4e90-ae85-aaa22e3a18bc), FileId(4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a), FileId(9a2562e6-df9d-4d85-b363-199b15445a8e), FileId(f8e8a2e6-6894-419c-8ab7-6bb24d6ed5d6), FileId(ce692d00-7434-44ae-a144-091abfefb96c), FileId(13f0465a-f3b2-4db1-81a0-d4d48d0a29e2), FileId(680d04ff-8e53-4bc0-a8fa-cfc09e024db6)], output: [FileId(a5511947-1946-4711-86f4-6ef56051d3d5)], window: Some(604800)","level":30,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-08T19:38:57.200354659Z","target":"storage::compaction::task","line":154,"file":"greptimedb/src/storage/src/compaction/task.rs"}


# [4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a] was deleted (file purge task was submitted)
/tmp/greptimedb/logs/greptime-datanode.2023-07-08-19:{"v":0,"name":"greptime-datanode","msg":"Submitted task: \"8452495638528/4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a\"","level":30,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-08T19:38:57.306224808Z","target":"storage::scheduler","line":267,"file":"greptimedb/src/storage/src/scheduler.rs"}


# And after a week, file [4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a] reappeared in regions's file list and the region found it can be compacted
/tmp/greptimedb/logs/greptime-datanode.2023-07-14-02:{"v":0,"name":"greptime-datanode","msg":"Compaction output [13f0465a-f3b2-4db1-81a0-d4d48d0a29e2,ce692d00-7434-44ae-a144-091abfefb96c,680d04ff-8e53-4bc0-a8fa-cfc09e024db6,101a7479-4666-4bc9-9996-f709e5352eb4,f8e8a2e6-6894-419c-8ab7-6bb24d6ed5d6,4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a,9a2562e6-df9d-4d85-b363-199b15445a8e,b2f0e5fc-c798-4e90-ae85-aaa22e3a18bc]-> 3476ecbc-9e90-46f8-940e-aa36d3deb1ad","level":30,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-14T02:49:08.761007994Z","target":"storage::compaction::task","line":84,"file":"greptimedb/src/storage/src/compaction/task.rs"}


/tmp/greptimedb/logs/greptime-datanode.2023-07-14-04:{"v":0,"name":"greptime-datanode","msg":"Compaction output [4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a,9a2562e6-df9d-4d85-b363-199b15445a8e,f8e8a2e6-6894-419c-8ab7-6bb24d6ed5d6,13f0465a-f3b2-4db1-81a0-d4d48d0a29e2,101a7479-4666-4bc9-9996-f709e5352eb4,b2f0e5fc-c798-4e90-ae85-aaa22e3a18bc,680d04ff-8e53-4bc0-a8fa-cfc09e024db6,ce692d00-7434-44ae-a144-091abfefb96c]-> 2302dc53-69e4-478a-a26f-b8666187865e","level":30,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-14T04:07:24.5846269Z","target":"storage::compaction::task","line":84,"file":"greptimedb/src/storage/src/compaction/task.rs"}


/tmp/greptimedb/logs/greptime-datanode.2023-07-14-04:{"v":0,"name":"greptime-datanode","msg":"Failed to compact region: 1968_0000000000","level":50,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-14T04:07:25.448327862Z","target":"storage::compaction::task","line":165,"file":"greptimedb/src/storage/src/compaction/task.rs","err.msg":"Fail to read object from path: data/.../public/1968/1968_0000000000/4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a.parquet, source: NotFound (persistent) at read, context: { response: Parts { status: 404, version: HTTP/1.1, headers: {...}, service: s3, path: data/etox4eivxxv2helpless_credit/public/1968/1968_0000000000/4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a.parquet, range: 0- } => S3Error { code: \"NoSuchKey\", message: \"The specified key does not exist.\", resource: \"\", request_id: \"Y6MP5QSDD6DXMT6N\" }","err.code":"StorageUnavailable","err.source":"NotFound (persistent) at read, context: { response: Parts { status: 404, version: HTTP/1.1, headers: {...} }, service: s3, path: data/etox4eivxxv2helpless_credit/public/1968/1968_0000000000/4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a.parquet, range: 0- } => S3Error { code: \"NoSuchKey\", message: \"The specified key does not exist.\", resource: \"\", request_id: \"Y6MP5QSDD6DXMT6N\" }"}

Looks like the compacted SST file 4caac3ad-4ad7-4b3b-8a97-f00b39bfe57a is deleted, but after a week, it reappears in region's SST file list.

What operating system did you use?

NA

Relevant log output and stack trace

No response

How can we reproduce the bug?

NA

@v0y4g3r v0y4g3r added the C-bug Category Bugs label Jul 21, 2023
@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Jul 21, 2023

We suspect it may related to some checkpoint issue:

/tmp/greptimedb/logs/greptime-datanode.2023-07-17-06:{"v":0,"name":"greptime-datanode","msg":"Region manifest checkpoint, start_version: 211, last_version: 253, compacted actions: 43","level":30,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-17T06:42:19.21837659Z","target":"storage::manifest::region","line":138,"file":"greptimedb/src/storage/src/manifest/region.rs"}
/tmp/greptimedb/logs/greptime-datanode.2023-07-17-07:{"v":0,"name":"greptime-datanode","msg":"Region manifest checkpoint, start_version: 211, last_version: 254, compacted actions: 1","level":30,"hostname":"greptimedb-datanode-0","pid":1,"time":"2023-07-17T07:50:54.588092373Z","target":"storage::manifest::region","line":138,"file":"greptimedb/src/storage/src/manifest/region.rs"}

We can see two checkpoint ran with same start_version 211.

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Jul 21, 2023

The reason of this bug is the dirty _last_checkpoint opendal cache causing the checkpoint read has another start_version.

The previous checkpoint deleted all actions between 211~253, while the latter checkpoint losts these actions, which includes those removing SST files.

killme2008 added a commit to killme2008/greptimedb that referenced this issue Jul 24, 2023
killme2008 added a commit to killme2008/greptimedb that referenced this issue Jul 24, 2023
github-merge-queue bot pushed a commit that referenced this issue Jul 24, 2023
fix: dirty last checkpoint metadata file when enable object store caching, #2013
shuiyisong pushed a commit that referenced this issue Jul 24, 2023
fix: dirty last checkpoint metadata file when enable object store caching, #2013
shuiyisong pushed a commit that referenced this issue Aug 9, 2023
fix: dirty last checkpoint metadata file when enable object store caching, #2013
@evenyag
Copy link
Contributor

evenyag commented Aug 22, 2023

Looks like this problem reappeared.

Failed to do Flight get, addr: greptimedb-datanode-1.greptimedb-datanode.greptimedb:4001, code: Internal error, source: Failed to do Flight get, code: Internal error, source: NotFound (persistent) at read, context: { response: Parts { status: 404, version: HTTP/1.1, headers: {"x-amz-request-id": "***", "x-amz-id-2": "***", "x-amz-delete-marker": "true", "x-amz-version-id": "***", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Mon, 21 Aug 2023 05:34:08 GMT", "server": "AmazonS3"} }, service: s3, path: data/nowawfppg0gtgreptime_example/public/29203/29203_0000000000/038a8491-e0a1-47c4-b4ee-33d4431a7755.parquet, range: 0- } => S3Error { code: "NoSuchKey", message: "The specified key does not exist.", resource: "", request_id: "****" }

@evenyag evenyag reopened this Aug 22, 2023
paomian pushed a commit to paomian/greptimedb that referenced this issue Oct 19, 2023
fix: dirty last checkpoint metadata file when enable object store caching, GreptimeTeam#2013
@evenyag
Copy link
Contributor

evenyag commented Mar 27, 2024

This issue is duplicated to #2746

We already added a flag in #2745 to workaround corrupted manifests.

@evenyag evenyag closed this as not planned Won't fix, can't repro, duplicate, stale Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants