A highly scalable chat assistant that provides real-time Wikipedia information using the Llama-2-7b-chat
LLM, inferenced
with Kserve for high concurrency, monitored using Prometheus and implemented on a user-friendly Streamlit interface
This Project consists of 3 main components -
The llama-2-7b-chat-hf
model is inferenced using KServe, a standard, cloud agnostic Model Inference Platform on Kubernetes.
-
Data Source: The Wikipedia python library scrapes text data from Wikipedia Pages based on the user's question.
-
Vector Database: The data from Wikipedia is stored as a FAISS index with the help of Ray, significantly improving the speed of generating and persisting vector embeddings
-
Prompt: By utilizing the Langchain wrapper for OpenAI chat completion models, we can infer the hosted Llama model in our Retrieval Augmented Generation (RAG) approach -- using the context from the stored FAISS index and the user's question.
To enable prometheus metrics, add the annotation serving.kserve.io/enable-prometheus-scraping
to the InferenceService YAML. With the exported metrics (inference latency, explain request latency etc.), they can now be visualised on Grafana.
- Kubernetes Cluster (recommended 16 cpu x 64 Gb RAM x 1 Nvidia GPU per worker node )
- Python 3.9+
Install KServe on your cluster using the KServe Quick installation script -
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.13/hack/quick_install.sh" | bash
Ensure you accept the terms and conditions of the llama-2-7b-chat-hf
repository on huggingface in order to use the LLM. Once you have the required permissions, copy paste the API Key into a Kubernetes Secret -
export HF_TOKEN={your_token}
kubectl create secret generic hf-token --from-literal=hf-token="$HF_TOKEN"
Now deploy the the Llama 2 Chat model by deploying the InferenceService resource on your cluster -
kubectl apply -f deployments/kserve-llama.yaml
Note - The KServe HuggingFace runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
kubectl get inferenceservices huggingface-llama2
Wait for ~ 5 - 10 minutes, you should see the status READY=TRUE
In order to check if you can inference successfully we shall perform a sample inference using OpenAI's /v1/completions
endpoint.
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
export SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama2 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Who is the president of the United States?", "stream":false, "max_tokens": 30}'
Your model is now ready for use!
For the application, a Streamlit frontend provides a nice, interactive interface for users to input their questions and receive informative answers — you don't have to scour through Wikipedia pages anymore!
To deploy the application locally -
- Create a local python environment -
python -m venv env
source /env/bin/activate
- Install the required dependencies for the project -
pip install -r requirements.txt
- Run streamlit -
/env/bin/streamlit run app.py
The application will now run on your localhost!
A Dockerfile is provided to build your own image of the application, to do so run -
docker build -t wikispeaks:v1 .
To start the application run -
docker run -p 8080:8051 -e INGRESS_HOST=$INGRESS_HOST -e INGRESS_PORT=$INGRESS_PORT -e SERVICE_HOSTNAME=$SERVICE_HOSTNAME wikispeaks:v1
Run the application on localhost:8080 on your web browser.
A Kubernetes deployment file is provided to host your application on a K8s cluster -
kubectl apply -f deployments/app-deployment.yaml
Note: Please ensure to update the values of the environment variables in the secret before applying the deployment.
Ensure the deployment, pods and service are all up and running
kubectl get deployment wiki-app-deployment
kubectl get pod | grep wiki-app
kubectl get svc wiki-service
Access the application using the external IP of the load balancer service on your web browser!
Prometheus and Grafana are used to monitor the performance of the deployed model and application. For this application, I've just visualised the inferencing latency, but there are many other metrics than can be visualised.
On adding the serving.kserve.io/enable-prometheus-scraping: "true"
annotation to the InferenceService YAML, the kserve container exports it's custom metrics to the prometheus server, which can be visualised on a Grafana Dashboard.
Follow the Kserve guide to setup Prometheus using the Prometheus Operator
Once setup, to access the scraped metrics, port forward the Prometheus Service -
kubectl port-forward service/prometheus-operated -n kfserving-monitoring 9090:9090
To measure the model requests, inference the llama model from the application, the request latency metric can be captured on prometheus as shown below -
The captured prometheus metrics can be visualised better on Grafana, follow the Grafana installation guide to install and configure Grafana on your Kubernetes cluster
Open the Grafana dashboard by port-forwarding the service -
kubectl port-forward service/grafana 3000:3000 --namespace=my-grafana
Add Prometheus as a data-source and ensure you utilise the external IP of the prometheus service!
You can then visualise a bunch of metrics scraped by Prometheus, exported from kserve.
Any contributions are welcome! Please raise an issue or PR, and I'll address them as soon as possible!
- Ray Operator
- Custom Kserve model deployment