Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[neo] Allow setting GPU memory-related LMI options in sharding jobs #2546

Merged
merged 4 commits into from
Nov 12, 2024

Conversation

ethnzhng
Copy link
Contributor

@ethnzhng ethnzhng commented Nov 12, 2024

This PR allows customers to pass in values for the LMI options

  • option.gpu_memory_utilization
  • option.enforce_eager
  • option.max_rolling_batch_size
  • option.max_model_len

during Neo AOT sharding jobs. Changing the values of these options can potentially allow a customer to shard a model on an instance which would result in OOM when using default values. Note that the customized options need to be set to the same values when loading back the artifacts during runtime; the generated serving.properties in the output artifacts will ensure this parity.

@ethnzhng ethnzhng requested review from zachgk and a team as code owners November 12, 2024 18:34
@@ -103,6 +103,7 @@ def shard_lmi_dist_model(self, input_dir: str, output_dir: str,
enforce_eager=True,
disable_custom_all_reduce=True,
distributed_executor_backend="mp",
gpu_memory_utilization=0.99,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this may allow for the sharding of 405b on a p4de, will the resulting converted artifacts even be usable by the customer on the same instance? Do we also provide the necessary configs in the output to ensure the model can be loaded on the instance for inference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, it would have to be set to the same value during runtime. I updated the PR to simply allow customers to specify option.gpu_memory_utilization during sharding jobs, and with #2545 that value will be propagated to the serving.properties. Same with option.enforce_eager, since it also affects GPU mem utilization.

Thus customers wanting to use a model/instance combination which requires a change to gpu_memory_utilization or enforce_eager will have to be aware of this at the time of converting the artifacts.

Copy link
Contributor

@siddvenk siddvenk Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this sounds like the right approach!

Lets also add:

  • option.max_rolling_batch_size -> max_num_seqs
  • option.model_max_len -> model_max_len

That should be good for this PR - we should look at the engine configs and determine the full set, but with these 4 (in total) we can do the rest in a fast folllow

@ethnzhng ethnzhng changed the title [neo] Increase gpu_memory_utilization of lmi-dist engine in sharding jobs [neo] Allow setting gpu_memory_utilization & enforce_eager in sharding jobs Nov 12, 2024
@ethnzhng ethnzhng changed the title [neo] Allow setting gpu_memory_utilization & enforce_eager in sharding jobs [neo] Allow setting GPU memory-related LMI options in sharding jobs Nov 12, 2024
@ethnzhng ethnzhng merged commit a014254 into deepjavalibrary:master Nov 12, 2024
6 of 9 checks passed
@ethnzhng ethnzhng deleted the fml-gpu-mem branch November 12, 2024 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants