Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Setting model size and max concurrency specifically for each model (Triton). #93

Open
haiminh2001 opened this issue Sep 14, 2024 · 0 comments

Comments

@haiminh2001
Copy link

message LoadModelResponse {
    // OPTIONAL - If nontrivial cost is involved in
    // determining the size, return 0 here and
    // do the sizing in the modelSize function
    uint64 sizeInBytes = 1;

    // EXPERIMENTAL - Applies only if limitModelConcurrency = true
    // was returned from runtimeStatus rpc.
    // See RuntimeStatusResponse.limitModelConcurrency for more detail
    uint32 maxConcurrency = 2;
}

Hi, in the model-runtime.proto, the LoadModelResponse specify the model size in bytes and the max concurrency of the model. Currently, the size in bytes is hard-coded as the size of model files, which may be reasonable for Deep Learning weights but inaccurate for example, triton python backend. In addition, each model should indeed have different max concurrency.
Therefore, I propose that the adapter perhaps can read these configurations from a separate config file within the model folder (just like the config.pbtxt file) to override these configurations.
I am open to create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant