-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Milvus Query worker CPU skewness #30978
Comments
I don't have monitoring metrics. A rough guess is that the only shard delegator is the bottleneck To verify this: To Fix this:
|
/assign @madogar |
I can see following WARN logs in the query node having CPU spike:
Also, can you please throw light on:
|
/assign @xiaofan-luan |
@madogar The error "delegator failed to release segment" doesn't affect search latency, it is a normal warning in v2.3.3 and can be ignored. |
|
Do you have a monitoring system? https://milvus.io/docs/monitor.md If you have the monitoring system, take some screenshots for me like this: |
Unfortunately we do not have these metrics dashboard like above for the cluster. We are working on it. |
@yhmo can you please also take a look at the segment balance api issue: |
did you check what does the balance api do on java sdk? |
here is the doc we referred: and here is a similar issue filed by another dev: milvus-io/milvus-sdk-java#356 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/assign @yhmo |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there an existing issue for this?
Environment
Current Behavior
We have a Milvus cluster with ~200 million records and ingestion happening constantly at ~500 rows/sec. In parallel we have vector search at ~20 QPS. Query worker replica count is at 15(with 14 cores and 60G ram). We are noticing a weird scenario when the query latency suddenly shoots up from 200ms to 3000ms.
Upon digging deeper we find CPU usage on 1-2 query workers very high(~99% and CPU load average > 100), other nodes have ~50% usage. Our hypothesis is it is due to this query worker having high cpu the overall query latency is shooting up.
Further we validated if segments are balanced and realised the worker having very high cpu usage has more segments(90 vs 60 in other workers) but overall the memory usage has been balanced across workers. We understand segment balancing happens based on memory but we observe when segment count is high CPU usage shoots up. Why does this happen? What is the solution for the problem?
Also, we tried to manually rebalance segments using loadBalance() java api but got the following error: "ERROR: LoadBalanceRequest failed:collection=0: collection not found". Would have been interesting if we could have moved around the segments and validated if it could solve the cpu skewness.
Expected Behavior
CPU usage skewness in query workers shouldn't sustain for long time.
Steps To Reproduce
No response
Milvus Log
milvus_cluster_query_coordlogs-2024-03-01.pdf
milvus_cluster_query-workerlogs-2024-03-01.pdf
Anything else?
No response
The text was updated successfully, but these errors were encountered: