Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapping completions and chat/completions endpoint #2

Merged
merged 3 commits into from
Sep 27, 2023
Merged

Wrapping completions and chat/completions endpoint #2

merged 3 commits into from
Sep 27, 2023

Conversation

michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Jul 31, 2023

What does this PR do?

This PR implements a comparability endpoint, by wrapping generate and generate_stream in server.rs. It does so by minimally modify existing files, packing the required changes into a mod completion.rs making it as modular as possible.

I guess there is a couple of applications who want to post exactly the following request, which is hard-coded in several ways in some python applications (e.g. Langchain, AutoGPT, stuff that is already legacy).
For background: Also a lot of issues were created, up until recently, e.g. huggingface/text-generation-inference#497 huggingface/text-generation-inference#735

Note: I recently (last week) learn RUST, and ist quite a leap for a Python dev - any constructive feedback is welcome.

Post:/completions

{
  "max_tokens": 20,
  "messages": [
    {
      "content": "What is the capital of Bavaria?",
      "role": "assistant"
    }
  ]  
}

And get an answer in this format:

{"id":"chatcmpl-12345","object":"chat.completion","created":1690902873,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The capital of Bavaria is Munich (München in German)."}}],"usage":{"completion_tokens":13,"prompt_tokens":16,"total_tokens":29}}

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR.

@michaelfeil michaelfeil marked this pull request as draft July 31, 2023 08:05
@michaelfeil
Copy link
Contributor Author

michaelfeil commented Aug 1, 2023

Current state.

  • chat/completion:
    • non-stream working
      - [ x] compliant
      - [ ] tested
    • stream working
      - [ ] tested
  • completion:
    • non-stream working
      - [ x] compliant
      - [ ] tested
    • stream working
      - [ ] tested

Chat/Completion non-stream

Expected:

{"id":"chatcmpl-12345","object":"chat.completion","created":1690902873,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The capital of Bavaria is Munich (München in German)."}}],"usage":{"completion_tokens":13,"prompt_tokens":16,"total_tokens":29}}

With current state:

{"id":"chatcmpl-1690902875","object":"chat.completion","created":1690902875,"model":"bigscience/bloom-560m","choices":[{"message":{"role":"assistant","content":"\"\n\n\"It is the capital of Bavaria,\" said the man, with a smile. \""},"finish_reason":"length","index":0}],"usage":{"total_tokens":20,"completion_tokens":20,"prompt_tokens":0}}

Completley working as is. Only: No "prompt tokens" if prefill is not on and a fake id can be improved.

Chat/completion stream

  • model name correct
  • fixed id and timestamp -> both are not non-fixed and depend on the streaming time.
  • special event: start with deltaMessage {"role":"assistant","content":" ""} Implemented
  • special event: Done Token Implemented
    Stream as expected for protocol.
data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant"}}],"usage":null}

data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":null,"delta":{"content":")."}}],"usage":null}

data: {"id":"chatcmpl-7ikx6GJPtSN7lcmdRHgj22hSGPCze","object":"chat.completion.chunk","created":1690901272,"model":"gpt","choices":[{"index":0,"finish_reason":"stop","delta":{}}],"usage":null}

data: [DONE]

Current state:

data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":" smile"},"finish_reason":null,"index":0}]}

data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":"."},"finish_reason":null,"index":0}]}

data:{"id":"id-1690901237","object":"text_completion","created":1690901237,"model":"TGImodel","choices":[{"delta":{"role":"assistant","content":" \""},"finish_reason":"length","index":0}]}

Completion

Expected

{"id":"cmpl-7ilSRaYo3","object":"text_completion","created":1690903215,"model":"gpt","choices":[{"text":" Sudhi !\n\nWhen u clik on the links it goes to the destination site right. there the","index":0,"finish_reason":"length","logprobs":null}],"usage":{"completion_tokens":20,"prompt_tokens":1,"total_tokens":21}}

With this PR:

{"id":"id-1690903295","object":"text_completion","created":1690903295,"model":"bigscience/bloom-560m","choices":[{"text":", I just want to say that I am just new to weblog and seriously savored your","finish_reason":"length","logprobs":null,"index":0}],"usage":{"total_tokens":20,"completion_tokens":20,"prompt_tokens":0}}

Completion stream

  • special event: Done Token Implemented

Expected

data: {"id":"cmpl-7ilVsJ","object":"text_completion","created":1690903428,"model":"gpt","choices":[{"text":"\"))","index":0,"finish_reason":null,"logprobs":null}],"usage":null}

...

data: {"id":"cmpl-7ilVsJ","object":"text_completion","created":1690903428,"model":"gpt","choices":[{"text":"                           ","index":0,"finish_reason":null,"logprobs":null}],"usage":null}"

data: {"id":"cmpl-7ilVsJ","object":"text_completion","created":1690903428,"model":"gpt","choices":[{"text":" +\"","index":0,"finish_reason":"length","logprobs":null}],"usage":null}

data: [DONE]

With this PR:

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":",","logprobs":null,"index":0}]}
...

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":"avor","logprobs":null,"index":0}]}

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":"ed","logprobs":null,"index":0}]}

data:{"id":"id-1690903496","object":"text_completion","created":1690903496,"model":"model_stream","choices":[{"text":" your","finish_reason":"length","logprobs":null,"index":0}]}

@michaelfeil michaelfeil changed the title Wrapping v1/completions endpoint Wrapping completions and chat/completions endpoint Aug 1, 2023
@michaelfeil michaelfeil marked this pull request as ready for review August 1, 2023 15:36
router/src/completion.rs Outdated Show resolved Hide resolved
@Atry
Copy link
Contributor

Atry commented Aug 1, 2023

I think the chat API is critical to avoid prompt injection. I just want to understand this PR better on how to let the server add special token while API callers cannot.

router/src/lib.rs Outdated Show resolved Hide resolved
router/Cargo.toml Outdated Show resolved Hide resolved
@michaelfeil
Copy link
Contributor Author

Changes are addressed now, stream and non stream are both working.

let span = tracing::Span::current();
let start_time = Instant::now();
let created_time = std::time::SystemTime::now().duration_since(std::time::UNIX_EPOCH).unwrap().as_secs() as u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a year 2038 problem?

match stream_type {
StreamType::ChatCompletionsStreamResponse => {
let start_msg = chat_start_message(created_time);
yield Ok(Event::from(Event::default().json_data(start_msg).unwrap()))
Copy link
Contributor

@Atry Atry Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not feeling good for unwrap call here, although I understand original Hugging Face TGI already includes a ton of unwrap, unfortunately.

I wonder if you could handle or convert the error instead of unwrap it.

If unwrap is unavoidable, could you instead use expect to give a message? See https://www.thecodedmessage.com/posts/2022-07-14-programming-unwrap/

})
}

pub(crate) fn patch_streaming_event (st: StreamResponse, stream_type: StreamType, created_time: u32) -> Box<dyn Send + erased_serde::Serialize> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of patch_streaming_event. Could you add some comment about what it does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If needed it reformats the StreamResponse with another protocol. It takes a StreamResponse
and may return as output:

  • StreamResponse
  • ChatCompletionsStreamResponse
  • CompletionsResponse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the only other viable way to switch waht the SSE returns.
Any other approach would be to duplicate

async fn generate_stream() per protocol, which would be highy redundant and more error prone in my view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that duplicating might be good for modularity if they are different. For example, if we change one protocol in the future, we would not affect other protocols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes also sense to me.
Originally, I was looking for a way to include all in completion.rs

That would make the code also more modular, changes would take some days

@Atry
Copy link
Contributor

Atry commented Aug 4, 2023

Could you run rustfmt?

Copy link
Contributor

@Atry Atry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I have concern about the complexity of _generate_stream_typed. Basically this PR is trying to implement 3 different endpoints in a single _generate_stream_typed function, making it complicated. What is worse is that if in the future OpenAI adds more features to their streaming API and we want to also adopt the same features, we will have to add more code into _generate_stream_typed.

I think a more sustainable way is to implement different streaming protocols in different functions, and reuse fine-grained code pieces as self-contained functions, instead of trying to reuse the whole function.

@michaelfeil
Copy link
Contributor Author

Also I have concern about the complexity of _generate_stream_typed. Basically this PR is trying to implement 3 different endpoints in a single _generate_stream_typed function, making it complicated. What is worse is that if in the future OpenAI adds more features to their streaming API and we want to also adopt the same features, we will have to add more code into _generate_stream_typed.

I think a more sustainable way is to implement different streaming protocols in different functions, and reuse fine-grained code pieces as self-contained functions, instead of trying to reuse the whole function.

I would argue, that the internal protobuf protocol is more likley to change, that the openai's stable external streaming api. It is therefore important to wrap the internal generate_stream as much as possible, enable code-reuse, and do the conversion at the latest point in time.
It is not possible to wrap the fn generate_stream function as a whole.

@Atry
Copy link
Contributor

Atry commented Aug 4, 2023

It is not possible to wrap the fn generate_stream function as a whole.

Sorry I didn't propose to wrap generate_stream. Instead, I proposed the reverse approach: don't reuse generate_stream as a whole.

My suggestion was to reuse fine-grained code pieces as self-contained functions if any.

If you did not find fine-grained code pieces worth to reuse, it's OK to copy the implementation and leave original generate_stream unchanged.

@Atry
Copy link
Contributor

Atry commented Aug 4, 2023

I will merge this PR if we could avoid patching to original generate_stream.

To be honest, the consideration here is that Preemo (or other contributors) are expected to change generate_stream in the future and we don't want to add complexity to generate_stream and the risk to break the patch function.

Duplication is relatively tolerable. Even if we want to deduplicate similar code, I would extract common code in an utility module, instead of patching existing implementation.

@michaelfeil
Copy link
Contributor Author

Sorry for messing up history.
Commits are not squashed, changes are addressed.

Docker image is at:

docker pull docker.io/michaelf34/tgi:0.9.4-completion_endpoint

@babytdream
Copy link

Hello, is OpenAI API Wrapper now available? Can you give me an example? Thank you.

@michaelfeil
Copy link
Contributor Author

@Atry Can this be tested on the CI?

I involved two other libs, which now respecivly integrated the changes, now just this one is waiting!
tokio-rs/axum#2149
openai/openai-python#559

@Atry
Copy link
Contributor

Atry commented Sep 27, 2023

I can merge it any way, but I cannot test this PR in our internal CI because our internal implementation is already partially refactored for modularization as we promised in README.

Once the internal version is ready to publish to this open-source repository, I expect that changes like this PR will need to migrate to new architecture.

@Atry Atry merged commit 012c917 into Preemo-Inc:main Sep 27, 2023
@michaelfeil michaelfeil deleted the completion_endpoint branch October 3, 2023 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants