[neo] Allow setting GPU memory-related LMI options in sharding jobs #2546

ethnzhng · 2024-11-12T18:34:50Z

This PR allows customers to pass in values for the LMI options

option.gpu_memory_utilization
option.enforce_eager
option.max_rolling_batch_size
option.max_model_len

during Neo AOT sharding jobs. Changing the values of these options can potentially allow a customer to shard a model on an instance which would result in OOM when using default values. Note that the customized options need to be set to the same values when loading back the artifacts during runtime; the generated serving.properties in the output artifacts will ensure this parity.

…jobs

siddvenk · 2024-11-12T19:18:51Z

serving/docker/partition/sm_neo_shard.py

@@ -103,6 +103,7 @@ def shard_lmi_dist_model(self, input_dir: str, output_dir: str,
            enforce_eager=True,
            disable_custom_all_reduce=True,
            distributed_executor_backend="mp",
+            gpu_memory_utilization=0.99,


While this may allow for the sharding of 405b on a p4de, will the resulting converted artifacts even be usable by the customer on the same instance? Do we also provide the necessary configs in the output to ensure the model can be loaded on the instance for inference?

That is correct, it would have to be set to the same value during runtime. I updated the PR to simply allow customers to specify option.gpu_memory_utilization during sharding jobs, and with #2545 that value will be propagated to the serving.properties. Same with option.enforce_eager, since it also affects GPU mem utilization.

Thus customers wanting to use a model/instance combination which requires a change to gpu_memory_utilization or enforce_eager will have to be aware of this at the time of converting the artifacts.

Awesome, this sounds like the right approach!

Lets also add:

option.max_rolling_batch_size -> max_num_seqs

option.model_max_len -> model_max_len

That should be good for this PR - we should look at the engine configs and determine the full set, but with these 4 (in total) we can do the rest in a fast folllow

[neo] Increase gpu_memory_utilization of lmi-dist engine in sharding …

e215b7d

…jobs

ethnzhng requested review from zachgk and a team as code owners November 12, 2024 18:34

siddvenk reviewed Nov 12, 2024

View reviewed changes

ethnzhng added 2 commits November 12, 2024 20:50

[neo] Support option.gpu_memory_utilization for sharding jobs

bca3148

[neo] Support option.enforce_eager for sharding jobs

0d1b11d

ethnzhng changed the title ~~[neo] Increase gpu_memory_utilization of lmi-dist engine in sharding jobs~~ [neo] Allow setting gpu_memory_utilization & enforce_eager in sharding jobs Nov 12, 2024

Add max_num_seqs & max_model_len

1776eb6

siddvenk approved these changes Nov 12, 2024

View reviewed changes

ethnzhng changed the title ~~[neo] Allow setting gpu_memory_utilization & enforce_eager in sharding jobs~~ [neo] Allow setting GPU memory-related LMI options in sharding jobs Nov 12, 2024

ethnzhng merged commit a014254 into deepjavalibrary:master Nov 12, 2024
6 of 9 checks passed

ethnzhng deleted the fml-gpu-mem branch November 12, 2024 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[neo] Allow setting GPU memory-related LMI options in sharding jobs #2546

[neo] Allow setting GPU memory-related LMI options in sharding jobs #2546

ethnzhng commented Nov 12, 2024 •

edited

Loading

siddvenk Nov 12, 2024

ethnzhng Nov 12, 2024

siddvenk Nov 12, 2024 •

edited

Loading

[neo] Allow setting GPU memory-related LMI options in sharding jobs #2546

[neo] Allow setting GPU memory-related LMI options in sharding jobs #2546

Conversation

ethnzhng commented Nov 12, 2024 • edited Loading

siddvenk Nov 12, 2024

Choose a reason for hiding this comment

ethnzhng Nov 12, 2024

Choose a reason for hiding this comment

siddvenk Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

ethnzhng commented Nov 12, 2024 •

edited

Loading

siddvenk Nov 12, 2024 •

edited

Loading