[Bug]: Milvus Query worker CPU skewness #30978

madogar · 2024-03-01T09:08:00Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

We have a Milvus cluster with ~200 million records and ingestion happening constantly at ~500 rows/sec. In parallel we have vector search at ~20 QPS. Query worker replica count is at 15(with 14 cores and 60G ram). We are noticing a weird scenario when the query latency suddenly shoots up from 200ms to 3000ms.
Upon digging deeper we find CPU usage on 1-2 query workers very high(~99% and CPU load average > 100), other nodes have ~50% usage. Our hypothesis is it is due to this query worker having high cpu the overall query latency is shooting up.
Further we validated if segments are balanced and realised the worker having very high cpu usage has more segments(90 vs 60 in other workers) but overall the memory usage has been balanced across workers. We understand segment balancing happens based on memory but we observe when segment count is high CPU usage shoots up. Why does this happen? What is the solution for the problem?

Also, we tried to manually rebalance segments using loadBalance() java api but got the following error: "ERROR: LoadBalanceRequest failed:collection=0: collection not found". Would have been interesting if we could have moved around the segments and validated if it could solve the cpu skewness.

Expected Behavior

CPU usage skewness in query workers shouldn't sustain for long time.

Steps To Reproduce

No response

Milvus Log

milvus_cluster_query_coordlogs-2024-03-01.pdf
milvus_cluster_query-workerlogs-2024-03-01.pdf

Anything else?

No response

xiaofan-luan · 2024-03-02T02:10:40Z

I don't have monitoring metrics. A rough guess is that the only shard delegator is the bottleneck

To verify this:
find the node with most cpu, and check if it has logs with delegator.

To Fix this:

make sure you are running on latest 2.3.x, we have growing segment index which could accelerate shard delegator processing.
@sunby has a new balance policy that could balance data more evenly with streaming insertion. he should be able to handle on that

yanliang567 · 2024-03-02T02:44:34Z

/assign @madogar
/unassign

madogar · 2024-03-02T09:39:33Z

I can see following WARN logs in the query node having CPU spike:
"[WARN] [querynodev2/services.go:569] ["delegator failed to release
segment"] [traceID=1de1f04620581da219ed2dc11c4c533a] [collectionID=448030081733463925] [shard=milvusshreeshahpa- rootcoord-dml_6_448030081733463925v0] [segmentIDs="[448030081688217575]"] [currentNodeID=62]"

We are running on 2.3.3 version
Are you hinting the root cause being growing segments during streaming ingestion? The segment distribution skewness I observed was only for sealed segments. I have tried stopping ingestion and flushing the data which didn't help bringing down the cpu skewness.

Also, can you please throw light on:

Shard delegator concept
The logic new segment balance policy that you mentioned.

madogar · 2024-03-02T09:46:49Z

/assign @xiaofan-luan

yhmo · 2024-03-08T09:28:37Z

@madogar
Could you show me the client code how do you call the search() interface?
And can you print the collection'schema for me?
print(collection.describe())

The error "delegator failed to release segment" doesn't affect search latency, it is a normal warning in v2.3.3 and can be ignored.

madogar · 2024-03-08T09:43:24Z

List<String> outFields = Collections.singletonList(AGE_FIELD);
        List<List<Float>> vectors = generateFloatVectors(1);

        SearchParam searchParam =
                SearchParam.newBuilder()
                        .withCollectionName(collectionName)
                        .withMetricType(MetricType.IP)
                        .withOutFields(outFields)
                        .withTopK(topkRandom.nextInt(7)+1)
                        .withVectors(vectors)
                        .withVectorFieldName(VECTOR_FIELD)
                        .withExpr(searchExpr)
                        //.withParams("{\"ef\":"+ EF_VAL +"}")
                        .withIgnoreGrowing(IGNORE_GROWING)
                        .withConsistencyLevel(ConsistencyLevelEnum.EVENTUALLY)
                        .build();

        log.info("========== searchFace() making actual call==========");

        R<SearchResults> searchResponse = milvusClient.getMilvusServiceClient()
                .withTimeout(10,TimeUnit.SECONDS)
                //.withTimeout(120,TimeUnit.SECONDS)
                .withRetryInterval(5,TimeUnit.SECONDS)
                .withRetry(2)
                .search(searchParam);
        handleResponseStatus(searchResponse);
        long end = System.currentTimeMillis();
        long cost = (end - begin);
        log.info("Search time cost: " + cost + "ms" );

        SearchResultsWrapper wrapper = new SearchResultsWrapper(searchResponse.getData().getResults());

yhmo · 2024-03-08T10:23:47Z

Do you have a monitoring system? https://milvus.io/docs/monitor.md

If you have the monitoring system, take some screenshots for me like this:

madogar · 2024-03-12T04:54:06Z

Unfortunately we do not have these metrics dashboard like above for the cluster. We are working on it.

madogar · 2024-03-12T04:55:03Z

@yhmo can you please also take a look at the segment balance api issue:
"Also, we tried to manually rebalance segments using loadBalance() java api but got the following error: "ERROR: LoadBalanceRequest failed:collection=0: collection not found". Would have been interesting if we could have moved around the segments and validated if it could solve the cpu skewness."

xiaofan-luan · 2024-03-12T05:38:22Z

@yhmo

did you check what does the balance api do on java sdk?

madogar · 2024-03-12T05:55:17Z

here is the doc we referred:
https://milvus.io/docs/load_balance.md

and here is a similar issue filed by another dev: milvus-io/milvus-sdk-java#356

stale · 2024-04-13T05:02:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 · 2024-04-15T01:25:50Z

/assign @yhmo
any updates

yhmo · 2024-04-19T04:15:05Z

How to use loadBalance() interface in Java SDK:
Assume we have a collection with 2 segments. The Milvus cluster has 2 query nodes.
By default, the 2 segments are sent to different query nodes(each query node loads a segment).

Now we call the loadBalance() to move the segment "449183061651634632" form query node "5" to query node "6":

R<RpcStatus> resp = milvusClient.loadBalance(LoadBalanceParam.newBuilder()
                .withCollectionName(COLLECTION_NAME)
                .addSegmentID(449183061651634632L)
                .withSourceNodeID(5L)
                .addDestinationNodeID(6L)
                .build());
System.out.println(resp);

If the loadBalance() returns a success status code. Check the Attu, you will see the segment "449183061651634632" is moved to query node "6".

But, after a while, the internal load balancer will move the segment back to query node "5" because it detects the load is not balanced.

This interface is for the maintainer to handle some special cases that the load balancer cannot handle. You can try it. If the segment is moved back, it could be a defect of load balancer.

stale · 2024-05-19T10:18:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

madogar added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 1, 2024

madogar assigned yanliang567 Mar 1, 2024

sre-ci-robot assigned madogar and unassigned yanliang567 Mar 2, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 2, 2024

madogar removed their assignment Mar 2, 2024

sre-ci-robot assigned xiaofan-luan Mar 2, 2024

stale bot added the stale indicates no udpates for 30 days label Apr 13, 2024

sre-ci-robot assigned yhmo Apr 15, 2024

stale bot removed the stale indicates no udpates for 30 days label Apr 15, 2024

stale bot added the stale indicates no udpates for 30 days label May 19, 2024

stale bot closed this as completed Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus Query worker CPU skewness #30978

[Bug]: Milvus Query worker CPU skewness #30978

madogar commented Mar 1, 2024 •

edited

Loading

xiaofan-luan commented Mar 2, 2024

yanliang567 commented Mar 2, 2024

madogar commented Mar 2, 2024

madogar commented Mar 2, 2024

yhmo commented Mar 8, 2024

madogar commented Mar 8, 2024

yhmo commented Mar 8, 2024

madogar commented Mar 12, 2024

madogar commented Mar 12, 2024

xiaofan-luan commented Mar 12, 2024

madogar commented Mar 12, 2024

stale bot commented Apr 13, 2024

yanliang567 commented Apr 15, 2024

yhmo commented Apr 19, 2024 •

edited

Loading

stale bot commented May 19, 2024

[Bug]: Milvus Query worker CPU skewness #30978

[Bug]: Milvus Query worker CPU skewness #30978

Comments

madogar commented Mar 1, 2024 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

xiaofan-luan commented Mar 2, 2024

yanliang567 commented Mar 2, 2024

madogar commented Mar 2, 2024

madogar commented Mar 2, 2024

yhmo commented Mar 8, 2024

madogar commented Mar 8, 2024

yhmo commented Mar 8, 2024

madogar commented Mar 12, 2024

madogar commented Mar 12, 2024

xiaofan-luan commented Mar 12, 2024

madogar commented Mar 12, 2024

stale bot commented Apr 13, 2024

yanliang567 commented Apr 15, 2024

yhmo commented Apr 19, 2024 • edited Loading

stale bot commented May 19, 2024

madogar commented Mar 1, 2024 •

edited

Loading

yhmo commented Apr 19, 2024 •

edited

Loading