-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrapping completions and chat/completions endpoint #2
Conversation
Current state.
Chat/Completion non-streamExpected: {"id":"chatcmpl-12345","object":"chat.completion","created":1690902873,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The capital of Bavaria is Munich (München in German)."}}],"usage":{"completion_tokens":13,"prompt_tokens":16,"total_tokens":29}} With current state: {"id":"chatcmpl-1690902875","object":"chat.completion","created":1690902875,"model":"bigscience/bloom-560m","choices":[{"message":{"role":"assistant","content":"\"\n\n\"It is the capital of Bavaria,\" said the man, with a smile. \""},"finish_reason":"length","index":0}],"usage":{"total_tokens":20,"completion_tokens":20,"prompt_tokens":0}} Completley working as is. Only: No "prompt tokens" if prefill is not on and a fake id can be improved. Chat/completion stream
data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant"}}],"usage":null}
data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":null,"delta":{"content":")."}}],"usage":null}
data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","delta":{}}],"usage":null}
data: [DONE] Current state: data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":" smile"},"finish_reason":null,"index":0}]}
data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":"."},"finish_reason":null,"index":0}]}
data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":" \""},"finish_reason":"length","index":0}]} CompletionExpected {"id":"cmpl-7ilSRaYo3","object":"text_completion","created":1690903215,"model":"gpt","choices":[{"text":" Sudhi !\n\nWhen u clik on the links it goes to the destination site right. there the","index":0,"finish_reason":"length","logprobs":null}],"usage":{"completion_tokens":20,"prompt_tokens":1,"total_tokens":21}} With this PR: {"id":"id-1690903295","object":"text_completion","created":1690903295,"model":"bigscience/bloom-560m","choices":[{"text":", I just want to say that I am just new to weblog and seriously savored your","finish_reason":"length","logprobs":null,"index":0}],"usage":{"total_tokens":20,"completion_tokens":20,"prompt_tokens":0}} Completion stream
Expected
With this PR:
|
I think the chat API is critical to avoid prompt injection. I just want to understand this PR better on how to let the server add special token while API callers cannot. |
Changes are addressed now, stream and non stream are both working. |
router/src/server.rs
Outdated
let span = tracing::Span::current(); | ||
let start_time = Instant::now(); | ||
let created_time = std::time::SystemTime::now().duration_since(std::time::UNIX_EPOCH).unwrap().as_secs() as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be a year 2038 problem?
router/src/server.rs
Outdated
match stream_type { | ||
StreamType::ChatCompletionsStreamResponse => { | ||
let start_msg = chat_start_message(created_time); | ||
yield Ok(Event::from(Event::default().json_data(start_msg).unwrap())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not feeling good for unwrap
call here, although I understand original Hugging Face TGI already includes a ton of unwrap
, unfortunately.
I wonder if you could handle or convert the error instead of unwrap
it.
If unwrap
is unavoidable, could you instead use expect
to give a message? See https://www.thecodedmessage.com/posts/2022-07-14-programming-unwrap/
router/src/completion.rs
Outdated
}) | ||
} | ||
|
||
pub(crate) fn patch_streaming_event (st: StreamResponse, stream_type: StreamType, created_time: u32) -> Box<dyn Send + erased_serde::Serialize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the purpose of patch_streaming_event
. Could you add some comment about what it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If needed it reformats the StreamResponse with another protocol. It takes a StreamResponse
and may return as output:
- StreamResponse
- ChatCompletionsStreamResponse
- CompletionsResponse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the only other viable way to switch waht the SSE returns.
Any other approach would be to duplicate
async fn generate_stream() per protocol, which would be highy redundant and more error prone in my view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that duplicating might be good for modularity if they are different. For example, if we change one protocol in the future, we would not affect other protocols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes also sense to me.
Originally, I was looking for a way to include all in completion.rs
That would make the code also more modular, changes would take some days
Could you run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I have concern about the complexity of _generate_stream_typed
. Basically this PR is trying to implement 3 different endpoints in a single _generate_stream_typed
function, making it complicated. What is worse is that if in the future OpenAI adds more features to their streaming API and we want to also adopt the same features, we will have to add more code into _generate_stream_typed
.
I think a more sustainable way is to implement different streaming protocols in different functions, and reuse fine-grained code pieces as self-contained functions, instead of trying to reuse the whole function.
I would argue, that the internal protobuf protocol is more likley to change, that the openai's stable external streaming api. It is therefore important to wrap the internal generate_stream as much as possible, enable code-reuse, and do the conversion at the latest point in time. |
Sorry I didn't propose to wrap My suggestion was to reuse fine-grained code pieces as self-contained functions if any. If you did not find fine-grained code pieces worth to reuse, it's OK to copy the implementation and leave original |
I will merge this PR if we could avoid patching to original To be honest, the consideration here is that Preemo (or other contributors) are expected to change Duplication is relatively tolerable. Even if we want to deduplicate similar code, I would extract common code in an utility module, instead of patching existing implementation. |
Sorry for messing up history. Docker image is at:
|
Hello, is OpenAI API Wrapper now available? Can you give me an example? Thank you. |
@Atry Can this be tested on the CI? I involved two other libs, which now respecivly integrated the changes, now just this one is waiting! |
I can merge it any way, but I cannot test this PR in our internal CI because our internal implementation is already partially refactored for modularization as we promised in README. Once the internal version is ready to publish to this open-source repository, I expect that changes like this PR will need to migrate to new architecture. |
What does this PR do?
This PR implements a comparability endpoint, by wrapping
generate
andgenerate_stream
inserver.rs
. It does so by minimally modify existing files, packing the required changes into amod completion.rs
making it as modular as possible.I guess there is a couple of applications who want to post exactly the following request, which is hard-coded in several ways in some python applications (e.g. Langchain, AutoGPT, stuff that is already legacy).
For background: Also a lot of issues were created, up until recently, e.g. huggingface/text-generation-inference#497 huggingface/text-generation-inference#735
Note: I recently (last week) learn RUST, and ist quite a leap for a Python dev - any constructive feedback is welcome.
Post:/completions
And get an answer in this format:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR.