You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently using this model for inference, and I would like to know how to generate inference results in batch mode. Specifically, I'm trying to avoid processing inputs one by one and instead process multiple inputs in a single forward pass for efficiency.
Could you please provide guidance or examples on how to:
Structure inputs for batch processing.
Modify the inference pipeline to handle batches.
Optimize batch size for performance without running into memory issues.
Any advice, sample code, or references to the documentation would be greatly appreciated.
Thanks for your help!
The text was updated successfully, but these errors were encountered:
Different sentences within a batch may have various acceptance lengths in speculative sampling [1,2], thus requires careful padding, scheduling, or custom kernel implementation. We haven't support batch processing in our implementation now.
[1] Liu Xiaoxuan, et al. Optimizing Speculative Decoding for Serving Large Language Models Using Goodput. 2024.
[2] Qian Haifeng, et al. BASS: Batched Attention-optimized Speculative Sampling. 2024.
I'm currently using this model for inference, and I would like to know how to generate inference results in batch mode. Specifically, I'm trying to avoid processing inputs one by one and instead process multiple inputs in a single forward pass for efficiency.
Could you please provide guidance or examples on how to:
Any advice, sample code, or references to the documentation would be greatly appreciated.
Thanks for your help!
The text was updated successfully, but these errors were encountered: