-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[neo] Allow setting GPU memory-related LMI options in sharding jobs #2546
Conversation
@@ -103,6 +103,7 @@ def shard_lmi_dist_model(self, input_dir: str, output_dir: str, | |||
enforce_eager=True, | |||
disable_custom_all_reduce=True, | |||
distributed_executor_backend="mp", | |||
gpu_memory_utilization=0.99, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this may allow for the sharding of 405b on a p4de, will the resulting converted artifacts even be usable by the customer on the same instance? Do we also provide the necessary configs in the output to ensure the model can be loaded on the instance for inference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, it would have to be set to the same value during runtime. I updated the PR to simply allow customers to specify option.gpu_memory_utilization
during sharding jobs, and with #2545 that value will be propagated to the serving.properties
. Same with option.enforce_eager
, since it also affects GPU mem utilization.
Thus customers wanting to use a model/instance combination which requires a change to gpu_memory_utilization
or enforce_eager
will have to be aware of this at the time of converting the artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, this sounds like the right approach!
Lets also add:
option.max_rolling_batch_size
-> max_num_seqsoption.model_max_len
-> model_max_len
That should be good for this PR - we should look at the engine configs and determine the full set, but with these 4 (in total) we can do the rest in a fast folllow
This PR allows customers to pass in values for the LMI options
option.gpu_memory_utilization
option.enforce_eager
option.max_rolling_batch_size
option.max_model_len
during Neo AOT sharding jobs. Changing the values of these options can potentially allow a customer to shard a model on an instance which would result in OOM when using default values. Note that the customized options need to be set to the same values when loading back the artifacts during runtime; the generated
serving.properties
in the output artifacts will ensure this parity.