Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longevity test 2024-01-16 cn OOM #14613

Closed
huangjw806 opened this issue Jan 17, 2024 · 6 comments
Closed

Longevity test 2024-01-16 cn OOM #14613

huangjw806 opened this issue Jan 17, 2024 · 6 comments
Assignees
Milestone

Comments

@huangjw806
Copy link
Contributor

huangjw806 commented Jan 17, 2024

"!!! longevity Result!!!"
Today's date:2024-01-17
Result               FAIL                
Pipeline Message     @Nightly run all nexmark (8 sets of nexmark queries) with 10k throughput
TestBed              kubebench/3264g-medium-3cn-all-affinity
RW Version           nightly-20240116    
Test Start time      2024-01-16 15:11:48 
Test End time        2024-01-17 03:14:00 
Namespace            reglngvty-20240116-150232
Queries              nexmark_q0,nexmark_q1,nexmark_q2,nexmark_q3,nexmark_q4,nexmark_q5,nexmark_q7,nexmark_q8,nexmark_q9,nexmark_q10,nexmark_q12,nexmark_q14,nexmark_q15,nexmark_q16,nexmark_q17,nexmark_q18,nexmark_q19,nexmark_q20,nexmark_q21,nexmark_q22,nexmark_q101,nexmark_q102,nexmark_q103,nexmark_q104,nexmark_q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=reglngvty-20240116-150232&from=1705417908000&to=1705461240000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=reglngvty-20240116-150232&from=1705417908000&to=1705461240000
Memory Dumps to S3   https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&prefix=k8s/reglngvty-20240116-150232/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/longevity-test/builds/915
Crash Container logs https://rw-qa-artifacts-public.s3.us-east-1.amazonaws.com/longevity/915_logs.txt
Report               https://rw-qa-artifacts-public.s3.us-east-1.amazonaws.com/longevity/915_report.txt


================================================================================
Restarted/Crashed Containers Details 
================================================================================
CONTAINER crashed/Restarted: benchmark-risingwave-compute-c-0 restart_count:11  phase:Running status:True
CONTAINER crashed/Restarted: benchmark-risingwave-compute-c-1 restart_count:1  phase:Running status:True
CONTAINER crashed/Restarted: benchmark-risingwave-compute-c-2 restart_count:1  phase:Running status:True
@fuyufjh
Copy link
Member

fuyufjh commented Jan 17, 2024

Generating flamegraph: https://buildkite.com/risingwave-test/generate-collapsed-heap-files/builds/9

@wcy-fdu wcy-fdu self-assigned this Jan 17, 2024
@huangjw806
Copy link
Contributor Author

same issue for longevity test nightly-20240117

"!!! longevity Result!!!"
Today's date:2024-01-18
Result               FAIL                
Pipeline Message     @Nightly run all nexmark (8 sets of nexmark queries) with 10k throughput
TestBed              kubebench/3264g-medium-3cn-all-affinity
RW Version           nightly-20240117    
Test Start time      2024-01-17 15:12:21 
Test End time        2024-01-18 03:14:39 
Namespace            reglngvty-20240117-150243
Queries              nexmark_q0,nexmark_q1,nexmark_q2,nexmark_q3,nexmark_q4,nexmark_q5,nexmark_q7,nexmark_q8,nexmark_q9,nexmark_q10,nexmark_q12,nexmark_q14,nexmark_q15,nexmark_q16,nexmark_q17,nexmark_q18,nexmark_q19,nexmark_q20,nexmark_q21,nexmark_q22,nexmark_q101,nexmark_q102,nexmark_q103,nexmark_q104,nexmark_q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=reglngvty-20240117-150243&from=1705504341000&to=1705547679000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=reglngvty-20240117-150243&from=1705504341000&to=1705547679000
Memory Dumps to S3   https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&prefix=k8s/reglngvty-20240117-150243/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/longevity-test/builds/925
Crash Container logs https://rw-qa-artifacts-public.s3.us-east-1.amazonaws.com/longevity/925_logs.txt
Report               https://rw-qa-artifacts-public.s3.us-east-1.amazonaws.com/longevity/925_report.txt


================================================================================
Restarted/Crashed Containers Details 
================================================================================
CONTAINER crashed/Restarted: benchmark-risingwave-compute-c-0 restart_count:3  phase:Running status:True
CONTAINER crashed/Restarted: benchmark-risingwave-compute-c-2 restart_count:2  phase:Running status:True

@lmatz
Copy link
Contributor

lmatz commented Jan 23, 2024

any findings? cc: @wcy-fdu

@fuyufjh
Copy link
Member

fuyufjh commented Jan 24, 2024

@wcy-fdu
Copy link
Contributor

wcy-fdu commented Jan 25, 2024

From the heap file, o one piece of memory is particularly large. The relatively large memory is the write chunk of materialized executor, which is 600MB.
I suspect that it is caused by more mv writing state store at the same time. In order to verify, I suggest setting the spill threshold smaller to see if OOM still occurs.

@fuyufjh
Copy link
Member

fuyufjh commented Jan 31, 2024

Closed by #14855

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants