Skip to content

Commit

Permalink
Merge pull request #5 from KevKibe/main
Browse files Browse the repository at this point in the history
Update: Support for Indexing using Google Generative AI Embeddings and README, DOCS Update
  • Loading branch information
KevKibe authored Apr 8, 2024
2 parents 4c39cd7 + 43981c9 commit 11e9ae8
Show file tree
Hide file tree
Showing 8 changed files with 232 additions and 50 deletions.
45 changes: 45 additions & 0 deletions DOCS/CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment include:

- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Focusing on what is best for the community
- Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

- The use of sexualized language or imagery and unwelcome sexual attention or advances
- Trolling, insulting/derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned with this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project email address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at [[email protected]](mailto:[email protected]). All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
48 changes: 48 additions & 0 deletions DOCS/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Contributing to DocIndex
Welcome to DocIndex! We appreciate your interest in contributing to our open-source project. Please take a moment to review the following guidelines to ensure a smooth and collaborative experience for everyone.

## Code of Conduct

Before contributing, please read and adhere to our [Code of Conduct](https://github.com/KevKibe/docindex/blob/master/DOCS/CODE_OF_CONDUCT.md). We are committed to fostering an inclusive and respectful community.

## How to Contribute

### Reporting Bugs or Issues

If you encounter a bug or issue with the project, please search the [issue tracker](https://github.com/KevKibe/docindex/issues) to see if it has already been reported. If not, please open a new issue with a clear and descriptive title, along with detailed steps to reproduce the issue.

### Suggesting Enhancements or New Features

We welcome suggestions for enhancements or new features. Please open a new issue with a clear description of the enhancement or feature you'd like to see, along with any relevant context or use cases.

### Submitting Pull Requests

We appreciate contributions via pull requests. Before submitting a pull request, please ensure that:

<!-- - Your code follows our [code style guidelines](link-to-code-style-guidelines) -->
- You have added appropriate tests (if applicable)
- Your pull request addresses a specific issue or feature request

Please reference the relevant issue or feature request in your pull request description.

## Getting Started

To get started with contributing to African Whisper, follow these steps:

1. Fork the repository and clone it to your local machine.
2. Install dependencies by running `pip install -r requirements.txt` (or equivalent).
3. Create a new branch for your changes: `git checkout -b my-feature-branch`.
4. Make your changes and commit them: `git commit -am 'Add new feature'`.
5. Push your changes to your fork: `git push origin my-feature-branch`.
6. Submit a pull request to the repository's `master` branch.

## Communication

<!-- Join our [community forum](link-to-forum) or [chat channel](link-to-chat-channel) to connect with other contributors and project maintainers. -->

## License

By contributing to DocIndex, you agree to license your contributions under the [project's license](https://github.com/KevKibe/docindex/blob/master/LICENSE).

Thank you for your contributions!

110 changes: 95 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<h1 align="center">DocIndex: Fast Document Storage for RAG</h1>
<h1 align="center">DocIndex: Fast Document Embeddings Storage for RAG</h1>
<p align="center">

<a href="https://github.com/KevKibe/docindex/commits/">
Expand All @@ -8,14 +8,14 @@
<img src="https://img.shields.io/github/license/KevKibe/docindex?" alt="License">
</a>

*Efficiently store multiple documents and their metadata, whether they're offline or online, in a Pinecone Vector Database optimized for Retrieval Augmented Generation (RAG) models Fast*
*Efficiently store multiple document embeddings and their metadata, whether they're offline or online, in a Pinecone Vector Database optimized for Retrieval Augmented Generation (RAG) models Fast*

## Features

- ⚡️ **Rapid Indexing**: Quickly index multiple documents along with their metadata, including source, page details, and content, into Pinecone DB.<br>
- 📚 **Document Flexibility**: Index documents from your local storage or online sources with ease.<br>
- 📂 **Format Support**: Seamlessly handle various document formats, including PDF, docx(in-development), etc.<br>
- 🔁 **Embedding Services Integration**: Enjoy support for multiple embedding services such as OpenAIEmbeddings, GoogleGenerativeAIEmbeddings and more in development.<br>
- 🔁 **Embedding Services Integration**: Enjoy support for multiple embedding services such as OpenAI Embeddings, Google Generative AI Embeddings and more in development.<br>
- 🛠️ **Configurable Vectorstore**: Configure a vectorstore directly from the index to facilitate RAG pipelines effortlessly.

## Setup
Expand All @@ -24,20 +24,20 @@
pip install docindex
```

## Usage
## Getting Started
## Using OpenAI
```python
from _openai.index import OpenaiPineconeIndexer
from _openai.docindex import OpenaiPineconeIndexer

# Replace these values with your actual Pinecone API key, index name, OpenAI API key, and environment
pinecone_api_key = "pinecone-api-key"
index_name = "pinecone-index-name"
openai_api_key = "openai-api-key"
environment = "pinecone-index-environment"
batch_limit = 20 # Batch limit for upserting documents
chunk_size = 256 # Optional: size of texts per chunk.

# Define the batch limit for indexing, how many pages per pass.
batch_limit = 20

# List of URLs of the documents to be indexed. (offline on your computer or an online)
# List of URLs of the documents to be indexed. (offline on your computer or online)
urls = [
"your-document-1.pdf",
"your-document-2.pdf"
Expand All @@ -47,10 +47,9 @@ urls = [
pinecone_index = OpenaiPineconeIndexer(index_name, pinecone_api_key, environment, openai_api_key)

# Index the documents with the specified URLs and batch limit
pinecone_index.index_documents(urls,batch_limit)
pinecone_index.index_documents(urls,batch_limit,chunk_size)
```

## Initialize Vectorstore
## Initialize Vectorstore(using OpenAI)

```python
from pinecone import Pinecone as IndexPinecone
Expand All @@ -71,9 +70,64 @@ embed = OpenAIEmbeddings(
text_field = "text"

# Initialize the Vectorstore with the Pinecone index and OpenAI embeddings
vectorstore = VectorStorePinecone(index, embed.embed_query, text_field)
vectorstore = VectorStorePinecone(index, embed, text_field)
```


## Using Google Generative AI

```python
from _google.docindex import GooglePineconeIndexer

# Replace these values with your actual Pinecone API key, index name, OpenAI API key, and environment
pinecone_api_key = "pinecone-api-key"
index_name = "pinecone-index-name"
google_api_key = "google-api-key"
environment = "pinecone-index-environment"
batch_limit = 20 # Batch limit for upserting documents
chunk_size = 256 # Optional: size of texts per chunk.

# List of URLs of the documents to be indexed. (offline on your computer or an online)
urls = [
"your-document-1.pdf",
"your-document-2.pdf"
]

# Initialize the Pinecone indexer
pinecone_index = GooglePineconeIndexer(index_name, pinecone_api_key, environment, google_api_key)

# Index the documents with the specified URLs and batch limit
pinecone_index.index_documents(urls,batch_limit,chunk_size)
```


## Initialize Vectorstore(using Google Generative AI)

```python
from pinecone import Pinecone as IndexPinecone
from langchain_community.vectorstores import Pinecone as VectorStorePinecone
from langchain_google_genai import GoogleGenerativeAIEmbeddings

# Initialize the Pinecone index
index_pc = IndexPinecone(api_key=pinecone_api_key)
index = index_pc.Index(index_name)

# Initialize embeddings
embed = GoogleGenerativeAIEmbeddings(
model="models/embedding-001",
google_api_key=google_api_key
)

# Define the text field
text_field = "text"

# Initialize the Vectorstore with the Pinecone index and OpenAI embeddings
vectorstore = VectorStorePinecone(index, embed, text_field)
```




## Using the CLI

- Clone the Repository: Clone or download the application code to your local machine.
Expand All @@ -83,21 +137,47 @@ git clone https://github.com/KevKibe/docindex.git

- Create a virtual environment for the project and activate it.
```bash
# Navigate to project repository
cd docindex

# create virtual environment
python -m venv venv

# activate virtual environment
source venv/bin/activate
```
- Install dependencies by running this command
```bash
pip install -r requirements.txt
```

- Navigate to src and run this command to index documents
- Navigate to src
```bash
cd src
```

python -m _openai.doc_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key" --environment "your_environment" --batch_limit 10 --docs "doc-1.pdf" "doc-2.pdf'
- Run the command to start indexing the documents

```bash
# Using OpenAI
python -m _openai.doc_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key" --environment "your_environment" --batch_limit 10 --docs "doc-1.pdf" "doc-2.pdf' --chunk_size 256
```
```bash
# Using Google Generative AI
python -m _google.doc_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key" --environment "your_environment" --batch_limit 10 --docs "doc-1.pdf" "doc-2.pdf' --chunk_size 256
```
## Contributing
Contributions are welcome and encouraged.
Before contributing, please take a moment to review our [Contribution Guidelines](https://github.com/KevKibe/docindex/blob/master/DOCS/CONTRIBUTING.md) for important information on how to contribute to this project.
If you're unsure about anything or need assistance, don't hesitate to reach out to us or open an issue to discuss your ideas.
We look forward to your contributions!
## License
This project is licensed under the MIT License - see the [LICENSE](https://github.com/KevKibe/docindex/blob/master/LICENSE) file for details.
## Contact
For any enquiries, please reach out to me through [email protected]
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ install_requires =
langchain-community==0.0.31
langchain==0.1.14
langchain-openai==0.1.1
langchain-google-genai==1.0.1
package_dir=
=src

Expand Down
19 changes: 19 additions & 0 deletions src/_google/doc_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from .docindex import GooglePineconeIndexer
import argparse

def parse_args():
parser = argparse.ArgumentParser(description="Index documents on Pinecone using OpenAI embeddings.")
parser.add_argument("--pinecone_api_key", type=str, help="Pinecone API key")
parser.add_argument("--index_name", type=str, help="Name of the Pinecone index")
parser.add_argument("--google_api_key", type=str, help="OpenAI API key")
parser.add_argument("--environment", type=str, help="Environment for Pinecone service")
parser.add_argument("--batch_limit", type=int, help="Maximum batch size for indexing")
parser.add_argument("--docs", nargs="+", help="URLs of the documents to be indexed")
parser.add_argument("--chunk_size", help="size of texts per chunk")
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()
pinecone_indexer = GooglePineconeIndexer(args.index_name, args.pinecone_api_key, args.environment, args.google_api_key)
pinecone_indexer.index_documents(args.docs, args.batch_limit, args.chunk_size)
28 changes: 11 additions & 17 deletions src/_google/docindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,34 +77,27 @@ def embed(self) -> GoogleGenerativeAIEmbeddings:
google_api_key=self.google_api_key
)

def text_splitter(self) -> RecursiveCharacterTextSplitter:
"""
Initialize RecursiveCharacterTextSplitter object.
Returns:
RecursiveCharacterTextSplitter: RecursiveCharacterTextSplitter object.
"""
return RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=20,
length_function=self.tiktoken_len,
separators=["\n\n", "\n", " ", ""]
)

def upsert_documents(self, documents: List[Page], batch_limit: int) -> None:
def upsert_documents(self, documents: List[Page], batch_limit: int, chunk_size: int = 256) -> None:
"""
Upsert documents into the Pinecone index.
Args:
documents (List[Page]): List of documents to upsert.
batch_limit (int): Maximum batch size for upsert operation.
chunks_size(int): size of texts per chunk.
Returns:
None
"""
texts = []
metadatas = []
text_splitter = self.text_splitter()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=int(chunk_size),
chunk_overlap=20,
length_function=self.tiktoken_len,
separators=["\n\n", "\n", " ", ""]
)
embed = self.embed()
for i, record in enumerate(tqdm(documents)):
metadata = {
Expand All @@ -126,13 +119,14 @@ def upsert_documents(self, documents: List[Page], batch_limit: int) -> None:
metadatas = []


def index_documents(self, urls: List[str], batch_limit: int) -> None:
def index_documents(self, urls: List[str], batch_limit: int, chunk_size: int = 256) -> None:
"""
Process a list of URLs and upsert documents to a Pinecone index.
Args:
urls (List[str]): List of URLs to process.
batch_limit (int): Batch limit for upserting documents.
chunks_size(int): size of texts per chunk.
Returns:
None
Expand All @@ -152,6 +146,6 @@ def index_documents(self, urls: List[str], batch_limit: int) -> None:
]

print(f"Upserting {len(pages_data)} pages to the Pinecone index...")
self.upsert_documents(pages_data, batch_limit)
self.upsert_documents(pages_data, batch_limit, chunk_size)
print("Finished upserting documents for this URL.")
print("Indexing complete.")
3 changes: 2 additions & 1 deletion src/_openai/doc_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,14 @@ def parse_args():
parser.add_argument("--environment", type=str, help="Environment for Pinecone service")
parser.add_argument("--batch_limit", type=int, help="Maximum batch size for indexing")
parser.add_argument("--docs", nargs="+", help="URLs of the documents to be indexed")
parser.add_argument("--chunk_size", help="size of texts per chunk")
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()
pinecone_indexer = OpenaiPineconeIndexer(args.index_name, args.pinecone_api_key, args.environment, args.openai_api_key)
pinecone_indexer.index_documents(args.docs, args.batch_limit)
pinecone_indexer.index_documents(args.docs, args.batch_limit, args.chunk_size)



Loading

0 comments on commit 11e9ae8

Please sign in to comment.