Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of OpenAI Api format #51

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 43 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,26 @@ You can try it out at **https://chat.petals.dev** or run the backend on your ser
git clone https://github.com/petals-infra/chat.petals.dev.git
cd chat.petals.dev
pip install -r requirements.txt
flask run --host=0.0.0.0 --port=5000
python3 openai_api.py --host=0.0.0.0 --port=5000
```

🦙 **Want to serve Llama 2?** Request access to its weights at the ♾️ [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and 🤗 [Model Hub](https://huggingface.co/meta-llama/Llama-2-70b-hf), then run `huggingface-cli login` in the terminal before starting the web app. If you don't want Llama 2, just remove the `meta-llama` models from [config.py](https://github.com/petals-infra/chat.petals.dev/blob/main/config.py).

🦄 **Deploying with Gunicorn.** In production, we recommend using gunicorn instead of the Flask dev server:

```bash
gunicorn app:app --bind 0.0.0.0:5000 --worker-class gthread --threads 100 --timeout 1000
gunicorn openai_api:app --bind 0.0.0.0:5000 --worker-class gthread --threads 4 --timeout 1000
```

The chat uses the WebSocket API under the hood.

## APIs

The backend provides two APIs endpoints:
The backend provides three APIs endpoints:

- [WebSocket API](#websocket-api-apiv2generate) (`/api/v2/generate`, recommended)
- [OpenAI API Format Completion](#openai-api-v1completions) (`/v1/completions`, recommended)
- [OpenAI API Format Chat](#openai-api-v1chatcompletions) (`/v1/chat/completions`, recommended)
- [WebSocket API](#websocket-api-apiv2generate) (`/api/v2/generate`)
- [HTTP API](#http-api-apiv1) (`/api/v1/...`)

Please use the WebSocket API when possible - it is much faster, more powerful, and consumes less resources.

If you develop your own web app, you can use our endpoint at `https://chat.petals.dev/api/...` for research and development, then set up your own backend for production using the commands above.

> **Note:** We do not recommend using the endpoint at `https://chat.petals.dev/api/...` in production. It has a limited throughput, and we may pause or stop it any time.
Expand All @@ -61,6 +59,43 @@ If you develop your own web app, you can use our endpoint at `https://chat.petal
| BLOOM-176B, BLOOMZ-176B | 7.19 GB | 14.38 GB |
</details>

## OpenAI Format API

### Overview

Petals Chat introduces an API compatible with the OpenAI format, providing flexibility and familiarity for users accustomed to OpenAI's API structure. This new format is now the primary option, with other formats considered deprecated.

### API Endpoints

Two key API endpoints are provided in this format:

1. **POST /v1/completions**: For general text completions.
2. **POST /v1/chat/completions**: Specifically designed for chat-like interactions.

### Parameters

#### ChatCompletionRequest and CompletionRequest

Both endpoints use similar request structures, defined by the `ChatCompletionRequest` and `CompletionRequest` classes, respectively:

- **model (str)**: Specifies the model to use.
- **messages (Union[str, List[Dict[str, str]]])**: Used for chat interactions, defining the messages exchanged.
- **temperature (Optional[float])**: Sets the temperature for randomness in responses.
- **top_p (Optional[float])**: Controls the nucleus sampling.
- **n (Optional[int])**: Number of completions to generate.
- **max_tokens (int)**: Maximum number of tokens to generate.
- **stop (Optional[Union[str, bool]])**: Sequence or flag indicating when to stop generation.
- **stream (Optional[bool])**: Determines if the response should be streamed.
- **presence_penalty (Optional[float])**: Adjusts the likelihood of new concepts.
- **logit_bias (Optional[Dict[str, float]])**: Applies biases to specific tokens.
- **user (Optional[str])**: User identifier.
- Additional Parameters by Petals:
- **best_of (Optional[int])** (only for `/v1/completions`): Number of completions to generate and return.
- **top_k (Optional[int])** Controls the top-k sampling.
- **use_beam_search (Optional[bool])** Whether to use beam search instead of sampling.
- **skip_special_tokens (Optional[bool])** Whether to skip special tokens.
- **spaces_between_special_tokens (Optional[bool])** Whether to add spaces between special tokens.

## WebSocket API (`/api/v2/generate`)

This API implies that you open a WebSocket connection and exchange JSON-encoded requests and responses.
Expand Down