Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus Query worker CPU skewness #30978

Closed
1 task done
madogar opened this issue Mar 1, 2024 · 15 comments
Closed
1 task done

[Bug]: Milvus Query worker CPU skewness #30978

madogar opened this issue Mar 1, 2024 · 15 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@madogar
Copy link
Contributor

madogar commented Mar 1, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

We have a Milvus cluster with ~200 million records and ingestion happening constantly at ~500 rows/sec. In parallel we have vector search at ~20 QPS. Query worker replica count is at 15(with 14 cores and 60G ram). We are noticing a weird scenario when the query latency suddenly shoots up from 200ms to 3000ms.
Upon digging deeper we find CPU usage on 1-2 query workers very high(~99% and CPU load average > 100), other nodes have ~50% usage. Our hypothesis is it is due to this query worker having high cpu the overall query latency is shooting up.
Further we validated if segments are balanced and realised the worker having very high cpu usage has more segments(90 vs 60 in other workers) but overall the memory usage has been balanced across workers. We understand segment balancing happens based on memory but we observe when segment count is high CPU usage shoots up. Why does this happen? What is the solution for the problem?

Also, we tried to manually rebalance segments using loadBalance() java api but got the following error: "ERROR: LoadBalanceRequest failed:collection=0: collection not found". Would have been interesting if we could have moved around the segments and validated if it could solve the cpu skewness.

Expected Behavior

CPU usage skewness in query workers shouldn't sustain for long time.

Steps To Reproduce

No response

Milvus Log

milvus_cluster_query_coordlogs-2024-03-01.pdf
milvus_cluster_query-workerlogs-2024-03-01.pdf

Anything else?

No response

@madogar madogar added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 1, 2024
@xiaofan-luan
Copy link
Collaborator

I don't have monitoring metrics. A rough guess is that the only shard delegator is the bottleneck

To verify this:
find the node with most cpu, and check if it has logs with delegator.

To Fix this:

  1. make sure you are running on latest 2.3.x, we have growing segment index which could accelerate shard delegator processing.
  2. @sunby has a new balance policy that could balance data more evenly with streaming insertion. he should be able to handle on that

@yanliang567
Copy link
Contributor

/assign @madogar
/unassign

@sre-ci-robot sre-ci-robot assigned madogar and unassigned yanliang567 Mar 2, 2024
@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 2, 2024
@madogar
Copy link
Contributor Author

madogar commented Mar 2, 2024

I can see following WARN logs in the query node having CPU spike:
"[WARN] [querynodev2/services.go:569] ["delegator failed to release
segment"] [traceID=1de1f04620581da219ed2dc11c4c533a] [collectionID=448030081733463925] [shard=milvusshreeshahpa- rootcoord-dml_6_448030081733463925v0] [segmentIDs="[448030081688217575]"] [currentNodeID=62]"

  1. We are running on 2.3.3 version
  2. Are you hinting the root cause being growing segments during streaming ingestion? The segment distribution skewness I observed was only for sealed segments. I have tried stopping ingestion and flushing the data which didn't help bringing down the cpu skewness.

Also, can you please throw light on:

  1. Shard delegator concept
  2. The logic new segment balance policy that you mentioned.

@madogar madogar removed their assignment Mar 2, 2024
@madogar
Copy link
Contributor Author

madogar commented Mar 2, 2024

/assign @xiaofan-luan

@yhmo
Copy link
Contributor

yhmo commented Mar 8, 2024

@madogar
Could you show me the client code how do you call the search() interface?
And can you print the collection'schema for me?
print(collection.describe())

The error "delegator failed to release segment" doesn't affect search latency, it is a normal warning in v2.3.3 and can be ignored.

@madogar
Copy link
Contributor Author

madogar commented Mar 8, 2024

List<String> outFields = Collections.singletonList(AGE_FIELD);
        List<List<Float>> vectors = generateFloatVectors(1);

        SearchParam searchParam =
                SearchParam.newBuilder()
                        .withCollectionName(collectionName)
                        .withMetricType(MetricType.IP)
                        .withOutFields(outFields)
                        .withTopK(topkRandom.nextInt(7)+1)
                        .withVectors(vectors)
                        .withVectorFieldName(VECTOR_FIELD)
                        .withExpr(searchExpr)
                        //.withParams("{\"ef\":"+ EF_VAL +"}")
                        .withIgnoreGrowing(IGNORE_GROWING)
                        .withConsistencyLevel(ConsistencyLevelEnum.EVENTUALLY)
                        .build();

        log.info("========== searchFace() making actual call==========");

        R<SearchResults> searchResponse = milvusClient.getMilvusServiceClient()
                .withTimeout(10,TimeUnit.SECONDS)
                //.withTimeout(120,TimeUnit.SECONDS)
                .withRetryInterval(5,TimeUnit.SECONDS)
                .withRetry(2)
                .search(searchParam);
        handleResponseStatus(searchResponse);
        long end = System.currentTimeMillis();
        long cost = (end - begin);
        log.info("Search time cost: " + cost + "ms" );

        SearchResultsWrapper wrapper = new SearchResultsWrapper(searchResponse.getData().getResults());

@yhmo
Copy link
Contributor

yhmo commented Mar 8, 2024

Do you have a monitoring system? https://milvus.io/docs/monitor.md

If you have the monitoring system, take some screenshots for me like this:
Screenshot from 2024-03-08 18-21-48
Screenshot from 2024-03-08 18-19-01

Screenshot from 2024-03-08 18-19-47

Screenshot from 2024-03-08 18-20-18

@madogar
Copy link
Contributor Author

madogar commented Mar 12, 2024

Unfortunately we do not have these metrics dashboard like above for the cluster. We are working on it.

@madogar
Copy link
Contributor Author

madogar commented Mar 12, 2024

@yhmo can you please also take a look at the segment balance api issue:
"Also, we tried to manually rebalance segments using loadBalance() java api but got the following error: "ERROR: LoadBalanceRequest failed:collection=0: collection not found". Would have been interesting if we could have moved around the segments and validated if it could solve the cpu skewness."

@xiaofan-luan
Copy link
Collaborator

@yhmo

did you check what does the balance api do on java sdk?

@madogar
Copy link
Contributor Author

madogar commented Mar 12, 2024

here is the doc we referred:
https://milvus.io/docs/load_balance.md

and here is a similar issue filed by another dev: milvus-io/milvus-sdk-java#356

Copy link

stale bot commented Apr 13, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Apr 13, 2024
@yanliang567
Copy link
Contributor

/assign @yhmo
any updates

@stale stale bot removed the stale indicates no udpates for 30 days label Apr 15, 2024
@yhmo
Copy link
Contributor

yhmo commented Apr 19, 2024

How to use loadBalance() interface in Java SDK:
Assume we have a collection with 2 segments. The Milvus cluster has 2 query nodes.
By default, the 2 segments are sent to different query nodes(each query node loads a segment).
Screenshot from 2024-04-19 12-06-02

Now we call the loadBalance() to move the segment "449183061651634632" form query node "5" to query node "6":

R<RpcStatus> resp = milvusClient.loadBalance(LoadBalanceParam.newBuilder()
                .withCollectionName(COLLECTION_NAME)
                .addSegmentID(449183061651634632L)
                .withSourceNodeID(5L)
                .addDestinationNodeID(6L)
                .build());
System.out.println(resp);

If the loadBalance() returns a success status code. Check the Attu, you will see the segment "449183061651634632" is moved to query node "6".
Screenshot from 2024-04-19 12-08-05

But, after a while, the internal load balancer will move the segment back to query node "5" because it detects the load is not balanced.

This interface is for the maintainer to handle some special cases that the load balancer cannot handle. You can try it. If the segment is moved back, it could be a defect of load balancer.

Copy link

stale bot commented May 19, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label May 19, 2024
@stale stale bot closed this as completed Jun 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants