Skip to content

Commit

Permalink
Use simple in-memory vector database instead of Pinecone (#4)
Browse files Browse the repository at this point in the history
* Add simple in-memory vector database
* switch to blue color style
* better UI

---------

Co-authored-by: florian <[email protected]>
  • Loading branch information
fpgmaas and florian authored Jun 23, 2024
1 parent 0287408 commit ab800ac
Show file tree
Hide file tree
Showing 39 changed files with 795 additions and 825 deletions.
Binary file removed .DS_Store
Binary file not shown.
5 changes: 4 additions & 1 deletion .env.template
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
PINECONE_TOKEN=your-api-token
STORAGE_BACKEND=BLOB
STORAGE_BACKEND_BLOB_ACCOUNT_NAME=
STORAGE_BACKEND_BLOB_CONTAINER_NAME=
STORAGE_BACKEND_BLOB_KEY=
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
fail-fast: false
steps:
- name: Check out
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,4 @@ jobs:
tags: pypiscoutacr.azurecr.io/pypi-scout-frontend:latest
build-args: |
NEXT_PUBLIC_API_URL=https://pypiscout.com/api
NEXT_PUBLIC_GA_TRACKING_ID=${{ secrets.NEXT_PUBLIC_GA_TRACKING_ID }}
47 changes: 19 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,48 +12,38 @@ Inspired by [this blog post](https://koaning.io/posts/search-boxes/) about findi

## How does this work?

The project works by collecting project summaries and descriptions for all packages on PyPI with more than 50 weekly downloads. These are then converted into vector representations using [Sentence Transformers](https://www.sbert.net/). When the user enters a query, it is converted into a vector representation, and the most similar package descriptions are fetched from the vector database. Additional weight is given to the amount of weekly downloads before presenting the results to the user in a dashboard.
The project works by collecting project summaries and descriptions for all packages on PyPI with more than 100 weekly downloads. These are then converted into vector representations using [Sentence Transformers](https://www.sbert.net/). When the user enters a query, it is converted into a vector representation, and the most similar package descriptions are fetched from the vector database. Additional weight is given to the amount of weekly downloads before presenting the results to the user in a dashboard.

## Architecture
## Stack

The project uses the following technologies:

1. **[Pinecone](https://www.pinecone.io/)** as vector database
2. **[FastAPI](https://fastapi.tiangolo.com/)** for the API backend
3. **[NextJS](https://nextjs.org/) and [TailwindCSS](https://tailwindcss.com/)** for the frontend
4. **[Sentence Transformers](https://www.sbert.net/)** for vector embeddings

<br/>

![Architecture](./static/architecture.png)
1. **[FastAPI](https://fastapi.tiangolo.com/)** for the API backend
2. **[NextJS](https://nextjs.org/) and [TailwindCSS](https://tailwindcss.com/)** for the frontend
3. **[Sentence Transformers](https://www.sbert.net/)** for vector embeddings

## Getting Started

### Prerequisites

1. **Set Up Pinecone**

Since PyPI Scout uses [Pinecone](https://www.pinecone.io/) as the vector database, register for a free account on their website. Obtain your API key using the instructions [here](https://docs.pinecone.io/guides/get-started/quickstart).

2. **Create a `.env` File**
### Build and Setup

Copy the `.env.template` to create a new `.env` file:
#### 1. (Optional) **Create a `.env` file**

```sh
cp .env.template .env
```
By default, all data will be stored on your local machine. It is also possible to store the data for the API on Azure Blob storage, and
have the API read from there. To do so, create a `.env` file:

Then add your Pinecone API key from step 1 to this file.
```sh
cp .env.template .env
```

### Build and Setup
and fill in the required fields.

#### 1. **Run the Setup Script**
#### 2. **Run the Setup Script**

The setup script will:

- Download and process the PyPI dataset and store the results in the `data` directory.
- Set up your Pinecone index.
- Create vector embeddings for the PyPI dataset and upsert them to the Pinecone index.
- Create vector embeddings for the PyPI dataset.
- If the `STORAGE_BACKEND` environment variable is set to `BLOB`: Upload the datasets to blob storage.

There are three methods to run the setup script, dependent on if you have a NVIDIA GPU and [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed. Please run the setup script using the method that is applicable for you:

Expand All @@ -62,9 +52,10 @@ There are three methods to run the setup script, dependent on if you have a NVID
- [Option 3: Using Docker without NVIDIA GPU and NVIDIA Container Toolkit](SETUP.md#option-3-using-docker-without-nvidia-gpu-and-nvidia-container-toolkit)

> [!NOTE]
> Although the dataset contains all packages on PyPI with more than 50 weekly downloads, by default only the top 25% of packages with the highest weekly downloads (those with more than approximately 650 downloads per week) are added to the vector database. To include packages with less weekly downloads in the database, you can increase the value of `FRAC_DATA_TO_INCLUDE` in `pypi_scout/config.py`.
> The dataset contains approximately 100.000 packages on PyPI with more than 100 weekly downloads. To speed up local development,
> you can lower the amount of packages that is processed locally by lowering the value of `FRAC_DATA_TO_INCLUDE` in `pypi_scout/config.py`.
#### 2. **Run the Application**
#### 3. **Run the Application**

Start the application using Docker Compose:

Expand Down
4 changes: 2 additions & 2 deletions SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
The setup script will:

- Download and process the PyPI dataset and store the results in the `data` directory.
- Set up your Pinecone index.
- Create vector embeddings for the PyPI dataset and upsert them to the Pinecone index.
- Create vector embeddings for the PyPI dataset.
- If the `STORAGE_BACKEND` environment variable is set to `BLOB`: Upload the datasets to blob storage.

There are three ways to run the setup script:

Expand Down
6 changes: 3 additions & 3 deletions frontend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ RUN npm install
# Copy the rest of the application code to the container
COPY . .

# Build argument to accept the API URL during build time
# Add build arguments to environment
ARG NEXT_PUBLIC_API_URL

# Set environment variable within the container
ARG NEXT_PUBLIC_GA_TRACKING_ID
ENV NEXT_PUBLIC_API_URL=${NEXT_PUBLIC_API_URL}
ENV NEXT_PUBLIC_GA_TRACKING_ID=${NEXT_PUBLIC_GA_TRACKING_ID}

# Build the Next.js application
RUN npm run build
Expand Down
2 changes: 1 addition & 1 deletion frontend/app/components/GitHubButton.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ const GitHubButton: React.FC = () => {
href="https://github.com/fpgmaas/pypi-scout"
target="_blank"
rel="noopener noreferrer"
className="flex items-center p-2 border border-gray-700 rounded bg-gray-900 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-gray-700"
className="flex items-center p-2 border border-sky-700 rounded bg-sky-900 text-white hover:bg-sky-700 focus:outline-none focus:ring-2 focus:ring-sky-700"
>
<svg
height="24"
Expand Down
29 changes: 29 additions & 0 deletions frontend/app/components/GoogleAnalytics.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
// app/components/GoogleAnalytics.tsx
"use client";

import { useEffect } from "react";

const GoogleAnalytics = () => {
useEffect(() => {
const trackingId = process.env.NEXT_PUBLIC_GA_TRACKING_ID;
if (trackingId) {
const script1 = document.createElement("script");
script1.async = true;
script1.src = `https://www.googletagmanager.com/gtag/js?id=${trackingId}`;
document.head.appendChild(script1);

const script2 = document.createElement("script");
script2.innerHTML = `
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', '${trackingId}');
`;
document.head.appendChild(script2);
}
}, []);

return null;
};

export default GoogleAnalytics;
37 changes: 37 additions & 0 deletions frontend/app/components/Header.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import { useState } from "react";
import GitHubButton from "./GitHubButton";
import SupportButton from "./SupportButton";
import { FaBars, FaTimes } from "react-icons/fa";

const Header: React.FC = () => {
const [isMenuOpen, setIsMenuOpen] = useState(false);

const toggleMenu = () => {
setIsMenuOpen(!isMenuOpen);
};

return (
<header className="w-full flex justify-end items-center p-4 bg-sky-950">
<div className="hidden md:flex space-x-4 ">
<GitHubButton />
<SupportButton />
</div>
<div className="md:hidden flex-grow flex justify-end">
<button
onClick={toggleMenu}
className="text-white focus:outline-none focus:ring-2 focus:ring-sky-700"
>
{isMenuOpen ? <FaTimes size={24} /> : <FaBars size={24} />}
</button>
</div>
{isMenuOpen && (
<div className="absolute top-16 right-4 bg-sky-900 p-4 rounded shadow-lg flex flex-col space-y-4 md:hidden">
<GitHubButton />
<SupportButton />
</div>
)}
</header>
);
};

export default Header;
20 changes: 11 additions & 9 deletions frontend/app/components/InfoBox.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,23 @@ const InfoBox: React.FC<InfoBoxProps> = ({ infoBoxVisible }) => {
if (!infoBoxVisible) return null;

return (
<div className="w-3/5 bg-gray-800 p-6 rounded-lg shadow-lg mt-4 text-white">
<h2 className="text-2xl font-bold mb-2">How does this work?</h2>
<p className="text-gray-300">
<div className="w-3/5 bg-sky-900 p-6 rounded-lg shadow-lg mt-4 text-white">
<h2 className="text-2xl text-bold mb-2 text-gray-100">
How does this work?
</h2>
<p className="text-gray-100">
This application allows you to search for Python packages on PyPI using
natural language queries. For example, a query could be &quot;a package
that creates plots and beautiful visualizations&quot;.
</p>
<br />
<p className="text-gray-300">
<p className="text-gray-100">
Once you click search, your query will be matched against the summary
and the first part of the description of the ~30.000 most popular
packages on PyPI, which are all packages with at least ~600 downloads
per week. The results are then scored based on their similarity to the
query and their number of weekly downloads, and the best results are
displayed in the table below.
and the first part of the description of the ~100.000 most popular
packages on PyPI, which includes all packages with at least ~100
downloads per week. The results are then scored based on their
similarity to the query and their number of weekly downloads, and the
best results are displayed in the table below.
</p>
</div>
);
Expand Down
10 changes: 5 additions & 5 deletions frontend/app/components/SearchResultsTable.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ const SearchResultsTable: React.FC<SearchResultsTableProps> = ({

return (
<div className="overflow-x-auto w-full">
<table className="min-w-full divide-y divide-gray-700">
<thead className="bg-gray-800">
<table className="min-w-full divide-y divide-sky-800">
<thead className="bg-sky-950">
<tr>
<th
className="px-4 py-2 text-left text-xs font-medium text-gray-200 uppercase tracking-wider cursor-pointer whitespace-nowrap"
Expand Down Expand Up @@ -72,9 +72,9 @@ const SearchResultsTable: React.FC<SearchResultsTableProps> = ({
</th>
</tr>
</thead>
<tbody className="bg-gray-800 divide-y divide-gray-700">
<tbody className="bg-sky-900 divide-y divide-sky-800">
{results.map((result, index) => (
<tr key={index} className="hover:bg-gray-700">
<tr key={index} className="hover:bg-sky-800">
<td className="px-4 py-2 whitespace-nowrap text-gray-200">
{truncateText(result.name, 20)}
</td>
Expand All @@ -92,7 +92,7 @@ const SearchResultsTable: React.FC<SearchResultsTableProps> = ({
href={`https://pypi.org/project/${result.name}/`}
target="_blank"
rel="noopener noreferrer"
className="text-blue-400 hover:underline flex items-center"
className="text-sky-500 hover:underline flex items-center hover:text-orange-800"
>
<FaExternalLinkAlt className="mr-1" />
PyPI
Expand Down
2 changes: 1 addition & 1 deletion frontend/app/components/SupportButton.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ const SupportButton: React.FC = () => {
href="https://ko-fi.com/fpgmaas"
target="_blank"
rel="noopener noreferrer"
className="flex items-center p-2 border border-gray-700 rounded bg-gray-900 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-gray-700"
className="flex items-center p-2 border border-sky-700 rounded bg-sky-900 text-white hover:bg-sky-700 focus:outline-none focus:ring-2 focus:ring-sky-700"
>
<img
src="kofi.png"
Expand Down
4 changes: 2 additions & 2 deletions frontend/app/globals.css
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
--foreground-rgb: 0, 0, 0;
--background-start-rgb: 214, 219, 220;
--background-end-rgb: 255, 255, 255;
--dark-bg-start-rgb: 17, 24, 39; /* Dark gray (bg-gray-900) */
--dark-bg-end-rgb: 17, 24, 39; /* Dark gray (bg-gray-900) */
--dark-bg-start-rgb: 8, 47, 73; /* Dark sky (bg-sky-950) */
--dark-bg-end-rgb: 8, 47, 73; /* Dark sky (bg-sky-950) */
--dark-foreground-rgb: 255, 255, 255;
}

Expand Down
6 changes: 5 additions & 1 deletion frontend/app/layout.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import type { Metadata } from "next";
import { Inter } from "next/font/google";
import "./globals.css";
import GoogleAnalytics from "./components/GoogleAnalytics";

const inter = Inter({ subsets: ["latin"] });

Expand All @@ -16,7 +17,10 @@ export default function RootLayout({
}>) {
return (
<html lang="en">
<body className={inter.className}>{children}</body>
<body className={inter.className}>
<GoogleAnalytics />
{children}
</body>
</html>
);
}
Loading

0 comments on commit ab800ac

Please sign in to comment.