Wrapping completions and chat/completions endpoint #2

michaelfeil · 2023-07-31T06:19:55Z

What does this PR do?

This PR implements a comparability endpoint, by wrapping generate and generate_stream in server.rs. It does so by minimally modify existing files, packing the required changes into a mod completion.rs making it as modular as possible.

I guess there is a couple of applications who want to post exactly the following request, which is hard-coded in several ways in some python applications (e.g. Langchain, AutoGPT, stuff that is already legacy).
For background: Also a lot of issues were created, up until recently, e.g. huggingface/text-generation-inference#497 huggingface/text-generation-inference#735

Note: I recently (last week) learn RUST, and ist quite a leap for a Python dev - any constructive feedback is welcome.

Post:/completions

{
  "max_tokens": 20,
  "messages": [
    {
      "content": "What is the capital of Bavaria?",
      "role": "assistant"
    }
  ]  
}

And get an answer in this format:

{"id":"chatcmpl-12345","object":"chat.completion","created":1690902873,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The capital of Bavaria is Munich (München in German)."}}],"usage":{"completion_tokens":13,"prompt_tokens":16,"total_tokens":29}}

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR.

michaelfeil · 2023-08-01T15:05:40Z

Current state.

chat/completion:
- non-stream working
  - [ x] compliant
  - [ ] tested
- stream working
  - [ ] tested
completion:
- non-stream working
  - [ x] compliant
  - [ ] tested
- stream working
  - [ ] tested

Chat/Completion non-stream

Expected:

{"id":"chatcmpl-12345","object":"chat.completion","created":1690902873,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The capital of Bavaria is Munich (München in German)."}}],"usage":{"completion_tokens":13,"prompt_tokens":16,"total_tokens":29}}

With current state:

{"id":"chatcmpl-1690902875","object":"chat.completion","created":1690902875,"model":"bigscience/bloom-560m","choices":[{"message":{"role":"assistant","content":"\"\n\n\"It is the capital of Bavaria,\" said the man, with a smile. \""},"finish_reason":"length","index":0}],"usage":{"total_tokens":20,"completion_tokens":20,"prompt_tokens":0}}

Completley working as is. Only: No "prompt tokens" if prefill is not on and a fake id can be improved.

Chat/completion stream

model name correct
fixed id and timestamp -> both are not non-fixed and depend on the streaming time.
special event: start with deltaMessage {"role":"assistant","content":" ""} Implemented
special event: Done Token Implemented
Stream as expected for protocol.

data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant"}}],"usage":null}

data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":null,"delta":{"content":")."}}],"usage":null}

data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","delta":{}}],"usage":null}

data: [DONE]

Current state:

data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":" smile"},"finish_reason":null,"index":0}]}

data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":"."},"finish_reason":null,"index":0}]}

data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":" \""},"finish_reason":"length","index":0}]}

Completion

Expected

{"id":"cmpl-7ilSRaYo3","object":"text_completion","created":1690903215,"model":"gpt","choices":[{"text":" Sudhi !\n\nWhen u clik on the links it goes to the destination site right. there the","index":0,"finish_reason":"length","logprobs":null}],"usage":{"completion_tokens":20,"prompt_tokens":1,"total_tokens":21}}

With this PR:

{"id":"id-1690903295","object":"text_completion","created":1690903295,"model":"bigscience/bloom-560m","choices":[{"text":", I just want to say that I am just new to weblog and seriously savored your","finish_reason":"length","logprobs":null,"index":0}],"usage":{"total_tokens":20,"completion_tokens":20,"prompt_tokens":0}}

Completion stream

special event: Done Token Implemented

Expected

data: {"id":"cmpl-7ilVsJ","object":"text_completion","created":1690903428,"model":"gpt","choices":[{"text":"\"))","index":0,"finish_reason":null,"logprobs":null}],"usage":null}

...

data: {"id":"cmpl-7ilVsJ","object":"text_completion","created":1690903428,"model":"gpt","choices":[{"text":"                           ","index":0,"finish_reason":null,"logprobs":null}],"usage":null}"

data: {"id":"cmpl-7ilVsJ","object":"text_completion","created":1690903428,"model":"gpt","choices":[{"text":" +\"","index":0,"finish_reason":"length","logprobs":null}],"usage":null}

data: [DONE]

With this PR:

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":",","logprobs":null,"index":0}]}
...

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":"avor","logprobs":null,"index":0}]}

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":"ed","logprobs":null,"index":0}]}

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":" your","finish_reason":"length","logprobs":null,"index":0}]}

router/src/completion.rs

Atry · 2023-08-01T22:40:03Z

I think the chat API is critical to avoid prompt injection. I just want to understand this PR better on how to let the server add special token while API callers cannot.

router/src/lib.rs

router/Cargo.toml

michaelfeil · 2023-08-03T18:36:29Z

Changes are addressed now, stream and non stream are both working.

Atry · 2023-08-04T00:07:31Z

router/src/server.rs

    let span = tracing::Span::current();
    let start_time = Instant::now();
+    let created_time = std::time::SystemTime::now().duration_since(std::time::UNIX_EPOCH).unwrap().as_secs() as u32;


Would this be a year 2038 problem?

Atry · 2023-08-04T00:11:57Z

router/src/server.rs

+                    match stream_type {
+                        StreamType::ChatCompletionsStreamResponse => {
+                            let start_msg = chat_start_message(created_time);
+                            yield Ok(Event::from(Event::default().json_data(start_msg).unwrap()))


I am not feeling good for unwrap call here, although I understand original Hugging Face TGI already includes a ton of unwrap, unfortunately.

I wonder if you could handle or convert the error instead of unwrap it.

If unwrap is unavoidable, could you instead use expect to give a message? See https://www.thecodedmessage.com/posts/2022-07-14-programming-unwrap/

Atry · 2023-08-04T00:13:17Z

router/src/completion.rs

+    })
+}
+
+pub(crate) fn patch_streaming_event (st: StreamResponse, stream_type: StreamType, created_time: u32) -> Box<dyn Send + erased_serde::Serialize> {


I don't understand the purpose of patch_streaming_event. Could you add some comment about what it does?

If needed it reformats the StreamResponse with another protocol. It takes a StreamResponse
and may return as output:

StreamResponse

ChatCompletionsStreamResponse

CompletionsResponse

It's the only other viable way to switch waht the SSE returns.
Any other approach would be to duplicate

async fn generate_stream() per protocol, which would be highy redundant and more error prone in my view.

I would argue that duplicating might be good for modularity if they are different. For example, if we change one protocol in the future, we would not affect other protocols.

Makes also sense to me.
Originally, I was looking for a way to include all in completion.rs

That would make the code also more modular, changes would take some days

Atry · 2023-08-04T00:16:13Z

Could you run rustfmt?

Atry

Also I have concern about the complexity of _generate_stream_typed. Basically this PR is trying to implement 3 different endpoints in a single _generate_stream_typed function, making it complicated. What is worse is that if in the future OpenAI adds more features to their streaming API and we want to also adopt the same features, we will have to add more code into _generate_stream_typed.

I think a more sustainable way is to implement different streaming protocols in different functions, and reuse fine-grained code pieces as self-contained functions, instead of trying to reuse the whole function.

router/src/completion.rs

michaelfeil · 2023-08-04T12:04:15Z

Also I have concern about the complexity of _generate_stream_typed. Basically this PR is trying to implement 3 different endpoints in a single _generate_stream_typed function, making it complicated. What is worse is that if in the future OpenAI adds more features to their streaming API and we want to also adopt the same features, we will have to add more code into _generate_stream_typed.

I think a more sustainable way is to implement different streaming protocols in different functions, and reuse fine-grained code pieces as self-contained functions, instead of trying to reuse the whole function.

I would argue, that the internal protobuf protocol is more likley to change, that the openai's stable external streaming api. It is therefore important to wrap the internal generate_stream as much as possible, enable code-reuse, and do the conversion at the latest point in time.
It is not possible to wrap the fn generate_stream function as a whole.

Atry · 2023-08-04T17:02:11Z

It is not possible to wrap the fn generate_stream function as a whole.

Sorry I didn't propose to wrap generate_stream. Instead, I proposed the reverse approach: don't reuse generate_stream as a whole.

My suggestion was to reuse fine-grained code pieces as self-contained functions if any.

If you did not find fine-grained code pieces worth to reuse, it's OK to copy the implementation and leave original generate_stream unchanged.

Atry · 2023-08-04T23:23:39Z

I will merge this PR if we could avoid patching to original generate_stream.

To be honest, the consideration here is that Preemo (or other contributors) are expected to change generate_stream in the future and we don't want to add complexity to generate_stream and the risk to break the patch function.

Duplication is relatively tolerable. Even if we want to deduplicate similar code, I would extract common code in an utility module, instead of patching existing implementation.

michaelfeil · 2023-08-07T10:37:53Z

Sorry for messing up history.
Commits are not squashed, changes are addressed.

Docker image is at:

docker pull docker.io/michaelf34/tgi:0.9.4-completion_endpoint

babytdream · 2023-09-22T02:19:13Z

Hello, is OpenAI API Wrapper now available? Can you give me an example? Thank you.

michaelfeil · 2023-09-27T08:46:50Z

@Atry Can this be tested on the CI?

I involved two other libs, which now respecivly integrated the changes, now just this one is waiting!
tokio-rs/axum#2149
openai/openai-python#559

Atry · 2023-09-27T15:57:45Z

I can merge it any way, but I cannot test this PR in our internal CI because our internal implementation is already partially refactored for modularization as we promised in README.

Once the internal version is ready to publish to this open-source repository, I expect that changes like this PR will need to migrate to new architecture.

michaelfeil marked this pull request as draft July 31, 2023 08:05

michaelfeil changed the title ~~Wrapping v1/completions endpoint~~ Wrapping completions and chat/completions endpoint Aug 1, 2023

michaelfeil marked this pull request as ready for review August 1, 2023 15:36

Atry reviewed Aug 1, 2023

View reviewed changes

router/src/completion.rs Outdated Show resolved Hide resolved

Atry reviewed Aug 2, 2023

View reviewed changes

router/src/lib.rs Outdated Show resolved Hide resolved

fmmoret reviewed Aug 2, 2023

View reviewed changes

router/Cargo.toml Outdated Show resolved Hide resolved

Atry reviewed Aug 4, 2023

View reviewed changes

Atry suggested changes Aug 4, 2023

View reviewed changes

Atry reviewed Aug 4, 2023

View reviewed changes

router/src/completion.rs Show resolved Hide resolved

michaelfeil added 2 commits August 7, 2023 12:25

rebase and squash commits on latest main

e8ca636

cargo fmt

ab58232

fix: 2038y problem

8130300

Atry merged commit 012c917 into Preemo-Inc:main Sep 27, 2023

michaelfeil deleted the completion_endpoint branch October 3, 2023 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrapping completions and chat/completions endpoint #2

Wrapping completions and chat/completions endpoint #2

michaelfeil commented Jul 31, 2023 •

edited

Loading

michaelfeil commented Aug 1, 2023 •

edited

Loading

Atry commented Aug 1, 2023

michaelfeil commented Aug 3, 2023

Atry Aug 4, 2023

Atry Aug 4, 2023 •

edited

Loading

Atry Aug 4, 2023

michaelfeil Aug 4, 2023

michaelfeil Aug 4, 2023

Atry Aug 4, 2023

michaelfeil Aug 4, 2023

Atry commented Aug 4, 2023

Atry left a comment •

edited

Loading

michaelfeil commented Aug 4, 2023

Atry commented Aug 4, 2023 •

edited

Loading

Atry commented Aug 4, 2023 •

edited

Loading

michaelfeil commented Aug 7, 2023

babytdream commented Sep 22, 2023

michaelfeil commented Sep 27, 2023

Atry commented Sep 27, 2023

Wrapping completions and chat/completions endpoint #2

Wrapping completions and chat/completions endpoint #2

Conversation

michaelfeil commented Jul 31, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

michaelfeil commented Aug 1, 2023 • edited Loading

Chat/Completion non-stream

Chat/completion stream

Completion

Completion stream

Atry commented Aug 1, 2023

michaelfeil commented Aug 3, 2023

Atry Aug 4, 2023

Choose a reason for hiding this comment

Atry Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Atry Aug 4, 2023

Choose a reason for hiding this comment

michaelfeil Aug 4, 2023

Choose a reason for hiding this comment

michaelfeil Aug 4, 2023

Choose a reason for hiding this comment

Atry Aug 4, 2023

Choose a reason for hiding this comment

michaelfeil Aug 4, 2023

Choose a reason for hiding this comment

Atry commented Aug 4, 2023

Atry left a comment • edited Loading

Choose a reason for hiding this comment

michaelfeil commented Aug 4, 2023

Atry commented Aug 4, 2023 • edited Loading

Atry commented Aug 4, 2023 • edited Loading

michaelfeil commented Aug 7, 2023

babytdream commented Sep 22, 2023

michaelfeil commented Sep 27, 2023

Atry commented Sep 27, 2023

michaelfeil commented Jul 31, 2023 •

edited

Loading

michaelfeil commented Aug 1, 2023 •

edited

Loading

Atry Aug 4, 2023 •

edited

Loading

Atry left a comment •

edited

Loading

Atry commented Aug 4, 2023 •

edited

Loading

Atry commented Aug 4, 2023 •

edited

Loading