Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Perform Inference with Batch Processing. #5

Open
Chlience opened this issue Oct 10, 2024 · 1 comment
Open

How to Perform Inference with Batch Processing. #5

Chlience opened this issue Oct 10, 2024 · 1 comment

Comments

@Chlience
Copy link
Contributor

I'm currently using this model for inference, and I would like to know how to generate inference results in batch mode. Specifically, I'm trying to avoid processing inputs one by one and instead process multiple inputs in a single forward pass for efficiency.

Could you please provide guidance or examples on how to:

  1. Structure inputs for batch processing.
  2. Modify the inference pipeline to handle batches.
  3. Optimize batch size for performance without running into memory issues.

Any advice, sample code, or references to the documentation would be greatly appreciated.

Thanks for your help!

@Achazwl
Copy link
Collaborator

Achazwl commented Oct 14, 2024

Different sentences within a batch may have various acceptance lengths in speculative sampling [1,2], thus requires careful padding, scheduling, or custom kernel implementation. We haven't support batch processing in our implementation now.

[1] Liu Xiaoxuan, et al. Optimizing Speculative Decoding for Serving Large Language Models Using Goodput. 2024.
[2] Qian Haifeng, et al. BASS: Batched Attention-optimized Speculative Sampling. 2024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants