You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prerequisite: Make sure the LLM Inference framework can be launched following the SPMD style. For example, the LLM inference script can be launched by torchrun --standalone --nproc=8 offline_inference.py
A Rollout class: Build a xxx_rollout.py script similar to vllm_rollout.py. In this file, define a xxxRollout class that inherits from BaseRollout.
This class should have a generate_sequence API that accepts a batch of input_ids, response_masks, and position_ids from the DataProto as input. The self.inference_engine (e.g., LLMEngine in vLLM) is responsible for performing auto-regressive generation and outputting a batch of responses. These responses should then be concatenated with input_ids, and the response_masks and position_ids should be reconstructed accordingly.
ShardingManager Classes for Weight Synchronization with Training Frameworks: Create files named fsdp_xxx.py and megatron_xxx.py, similar to fsdp_vllm.py and megatron_vllm.py. These files should define XXXShardingManager classes (i.e., HybridEngine) that handle weight sharding between the training and inference frameworks.
In megatron_vllm.py, we define an AllGatherPPModel class to collect weights across the pipeline parallelism dimension. The parameters stored in the memory_buffers of AllGatherPPModel will be used to synchronize the weights with the models in the vLLM rollout.
Weight loading issues: It may be necessary to provide specific weight loaders for transferring weights between different LLM Inference and Training backends for each model. This is similar to the dtensor_weight_loader.py and megatron_weight_loader.py files in vLLM.
The text was updated successfully, but these errors were encountered:
PeterSH6
changed the title
Basic Tutorial of Adding New LLM Inference/Serving Backend
Basic Tutorial: Adding a New LLM Inference/Serving Backend
Nov 22, 2024
Could you describe what primary changes you have to make in verl/third_party/vllm/ assuming that most of the code in the directory are from vllm code. If we can somehow simplify the dependency with vllm, it would be a lot easier to upgrade to higher version of vllm.
torchrun --standalone --nproc=8 offline_inference.py
xxx_rollout.py
script similar tovllm_rollout.py
. In this file, define axxxRollout
class that inherits fromBaseRollout
.generate_sequence
API that accepts a batch ofinput_ids
,response_masks
, andposition_ids
from theDataProto
as input. Theself.inference_engine
(e.g.,LLMEngine
in vLLM) is responsible for performing auto-regressive generation and outputting a batch of responses. These responses should then be concatenated withinput_ids
, and theresponse_masks
andposition_ids
should be reconstructed accordingly.fsdp_xxx.py
andmegatron_xxx.py
, similar tofsdp_vllm.py
andmegatron_vllm.py
. These files should defineXXXShardingManager
classes (i.e., HybridEngine) that handle weight sharding between the training and inference frameworks.megatron_vllm.py
, we define anAllGatherPPModel
class to collect weights across the pipeline parallelism dimension. The parameters stored in thememory_buffers
ofAllGatherPPModel
will be used to synchronize the weights with the models in the vLLM rollout.dtensor_weight_loader.py
andmegatron_weight_loader.py
files in vLLM.The text was updated successfully, but these errors were encountered: