This page contains a curated list of examples, tutorials, blogs about WebLLM usecases. Please send a pull request if you find things that belongs to here.
Note that all examples below run in-browser and use WebGPU as a backend.
-
get-started: minimum get started example with chat completion.
-
simple-chat-js: a mininum and complete chat bot app in vanilla JavaScript.
-
simple-chat-ts: a mininum and complete chat bot app in TypeScript.
-
get-started-web-worker: same as get-started, but using web worker.
-
next-simple-chat: a mininum and complete chat bot app with Next.js.
-
multi-round-chat: while APIs are functional, we internally optimize so that multi round chat usage can reuse KV cache
-
text-completion: demonstrates API
engine.completions.create()
, which is pure text completion with no conversation, as opposed toengine.chat.completions.create()
-
embeddings: demonstrates API
engine.embeddings.create()
, integration withEmbeddingsInterface
andMemoryVectorStore
of Langchain.js, and RAG with Langchain.js using WebLLM for both LLM and Embedding in a single engine -
multi-models: demonstrates loading multiple models in a single engine concurrently
These examples demonstrate various capabilities via WebLLM's OpenAI-like API.
- streaming: return output as chunks in real-time in the form of an AsyncGenerator
- json-mode: efficiently ensure output is in json format, see OpenAI Reference for more.
- json-schema: besides guaranteeing output to be in JSON, ensure output to adhere to a specific JSON schema specified the user
- seed-to-reproduce: use seeding to ensure reproducible output with fields
seed
. - function-calling (WIP): function calling with fields
tools
andtool_choice
(with preliminary support). - vision-model: process request with image input using Vision Language Model (e.g. Phi3.5-vision)
- chrome-extension: chrome extension that does not have a persistent background
- chrome-extension-webgpu-service-worker: chrome extension using service worker, hence having a persistent background
- logit-processor: while
logit_bias
is supported, we additionally support stateful logit processing where users can specify their own rules. We also expose low-level APIforwardTokensAndSample()
. - cache-usage: demonstrates how WebLLM supports both the Cache API and IndexedDB cache, and
users can pick with
appConfig.useIndexedDBCache
. Also demonstrates various cache utils such as checking whether a model is cached, deleting a model's weights from cache, deleting a model library wasm from cache, etc. - simple-chat-upload: demonstrates how to upload local models to WebLLM instead of downloading via a URL link
- web-llm-embed: document chat prototype using react-llm with transformers.js embeddings
- DeVinci: AI chat app based on WebLLM and hosted on decentralized cloud platform