Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for Streamed Responses in LLM API Calls #18

Open
yhbcode000 opened this issue Jul 18, 2024 · 0 comments
Open
Assignees
Labels
backend database documentation Improvements or additions to documentation enhancement New feature or request

Comments

@yhbcode000
Copy link
Member

Description:

We would like to request the addition of streamed responses for the Large Language Model (LLM) API. Currently, the API returns responses only after the entire output has been generated. Streaming the response would allow for more efficient and user-friendly interactions, especially for longer text generations.

Use Cases:

  1. Improved User Experience:

    • Users can see responses in real-time, enhancing the interactivity of applications such as chatbots, real-time data processing, and virtual assistants.
    • Early partial responses can improve the perceived speed and responsiveness of the application.
  2. Efficiency in Long-Form Content Generation:

    • For applications generating long-form content, such as articles, essays, or reports, streaming can provide immediate feedback and allow users to start reading or editing as the content is being generated.
  3. Resource Management:

    • Streaming responses can help in managing server and network resources more effectively by allowing incremental data transfer and processing, reducing the load on the server and network.

Proposed Implementation:

  1. API Endpoint:

    • Introduce a new endpoint or modify the existing one to support streaming. The endpoint should start returning data as soon as the model begins generating the response.
  2. Response Format:

    • The response should be sent in chunks, with each chunk representing a portion of the generated text. This can be achieved using server-sent events (SSE), WebSockets, or HTTP/2 streams.
    • Ensure each chunk contains metadata, such as sequence number or completion status, to help the client assemble the final response correctly.
  3. Client-Side Handling:

    • Provide guidelines and examples for client-side implementation to handle streamed responses, ensuring compatibility with various programming languages and frameworks.
  4. Error Handling:

    • Implement robust error handling mechanisms for interruptions in the stream, ensuring clients can retry or continue from where the stream was interrupted.

Benefits:

  • Enhanced user engagement and satisfaction due to faster and more responsive interactions.
  • Ability to handle large responses more effectively.
  • Potential to reduce server load and improve overall performance.

Priority: Medium/High (Adjust based on your internal prioritization criteria)

Attachments: (Include any relevant mockups, diagrams, or examples if applicable)

Additional Notes:

We believe that introducing streamed responses aligns with the overall goal of providing a more responsive and efficient API service. We are open to discussions on the best implementation approach and are willing to assist in testing the new feature.

Thank you for considering this feature request. We look forward to the potential enhancement of the LLM API.

@yhbcode000 yhbcode000 added documentation Improvements or additions to documentation enhancement New feature or request backend database labels Jul 18, 2024
@yhbcode000 yhbcode000 moved this to Todo in MultiverseNote Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend database documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

4 participants