Skip to content

Commit

Permalink
Liqun/doc retriever role (#334)
Browse files Browse the repository at this point in the history
Re-implement document retriever as a role from the previous plugin.
  • Loading branch information
ShilinHe authored May 10, 2024
2 parents cc3368a + 0a5fe91 commit 3207520
Show file tree
Hide file tree
Showing 12 changed files with 220 additions and 162 deletions.
62 changes: 0 additions & 62 deletions project/plugins/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,65 +24,3 @@ Finally, we use the SQL query to pull data from the sqlite database.
Because we need to generate the SQL query, we need to access GPT model.
So, you need to configure the GPT model (similar with configuring the main project) in the plugin configuration file `sql_pull_data.yaml`.


## vision_web_explorer
This plugin has been re-implemented as a role in the `taskweaver/ext_role/web_explorer` directory.

[Plugin Demo](https://github.com/microsoft/TaskWeaver/assets/7489260/7f819524-2c5b-46a8-9c0c-e001a2c7131b)

## web_search

This plugin has been re-implemented as a role in the `taskweaver/ext_role/web_search` directory.

A video demo using web search to find out information and then complete the task based on the retrieved information:

[Plugin Demo](https://github.com/microsoft/TaskWeaver/assets/7489260/d078a05b-a19b-498c-b712-6f8c4855cefa)


## document_retriever

This plugin by default is **not** enabled. If you want to use this plugin, you need to enable it in the `document_retriever.yaml` file.
In this plugin, we load a previously indexed document collection and retrieve the top-k documents based on a natural language query.
To use this plugin, you need to configure the path to the folder containing the index files in the plugin configuration file `document_retriever.yaml`.
A pre-built sample index is provided in the `project/sample_data/knowledge_base` folder which contains all documents for TaskWeaver under `website/docs` folder.

To build your own index, we provide a script in `script/document_indexer.py` to build the index.
You can run the following command to build the index:
```bash
python script/document_indexer.py \
--doc_path project/sample_data/knowledge_base/website/docs \
--output_path project/sample_data/knowledge_base/index
```
Please take a look at the import section in the script to install the required python packages.
There are two parameters `--chunk_step` and `--chunk_size` that can be specified to control the chunking of the documents.
The `--chunk_step` is the step size of the sliding window and the `--chunk_size` is the size of the sliding window.
The default values are `--chunk_step=64` and `--chunk_size=64`.
The size is measured in number of tokens and the tokenizer is based on OpenAI GPT model (i.e., `gpt-3.5-turbo`).
We intentionally split the documents with this small chunk size to make sure the chunks are small enough.
The reason is that small chunks are easier to match with the query, improving the retrieval accuracy.
Make sure you understand the consequence of changing these two parameters before you change them, for example,
by experimenting with different values on your dataset.

The retrieval is based on FAISS. You can find more details about FAISS [here](https://ai.meta.com/tools/faiss/).
FAISS is a library for vector similarity search of dense vectors.
In our implementation, we use the wrapper class provided by Langchain to call FAISS.
The embedding of the documents and the query is based on HuggingFace's Sentence Transformers.

The retrieved document chunks are presented in the following format:
```json
{
"chunk": "The chunk of the document",
"metadata": {
"source": "str, the path to the document",
"title": "str, the title of the document",
"chunk_id": "integer, the id of the chunk inside the document"
}
}
```
The title in the metadata is inferred from the file content in a heuristic way.
The chunk_id is the id of the chunk inside the document.
Neighboring chunks in the same document have consecutive chunk ids, so we can find the previous and next chunks in the same document.
In our implementation, we expand the retrieved chunks to include the previous and next chunks in the same document.
Recall that the raw chunk size is only 64 tokens, the expanded chunk size is 256 tokens by default.


39 changes: 0 additions & 39 deletions project/plugins/document_retriever.yaml

This file was deleted.

Binary file modified project/sample_data/knowledge_base/chunk_id_to_index.pkl
Binary file not shown.
Binary file modified project/sample_data/knowledge_base/index.faiss
Binary file not shown.
Binary file modified project/sample_data/knowledge_base/index.pkl
Binary file not shown.
97 changes: 59 additions & 38 deletions scripts/document_indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import pickle
import re
import traceback
from typing import Dict, List, Literal, Tuple
from typing import Dict, List, Literal, Optional, Tuple

try:
import tiktoken
Expand Down Expand Up @@ -169,6 +169,12 @@ def text_parser(
soup = None
supported_extensions = ["md", "markdown", "html", "htm", "txt", "json", "jsonl"]
other_extensions = ["docx", "pptx", "pdf", "csv"]
if extension not in supported_extensions + other_extensions:
print(
f"Not support for file with extension: {extension}. "
f"The supported extensions are {supported_extensions}",
)
return title, ""

# utf-8-sig will treat BOM header as a metadata of a file not a part of the file content
default_encoding = "utf-8-sig"
Expand Down Expand Up @@ -218,13 +224,14 @@ def text_parser(


def chunk_document(
doc_path: str,
doc_paths: List[str],
chunk_size: int,
chunk_step: int,
extensions: Optional[List[str]] = None,
) -> Tuple[int, List[str], List[Dict[str, str]], Dict[str, int]]:
"""
Split documents into chunks
:param doc_path: the path of the documents
:param doc_paths: the paths of the documents
:param chunk_size: the size of the chunk
:param chunk_step: the step size of the chunk
"""
Expand All @@ -237,39 +244,43 @@ def chunk_document(

# traverse all files under dir
print("Split documents into chunks...")
for root, dirs, files in os.walk(doc_path):
for name in files:
f = os.path.join(root, name)
print(f"Reading {f}")
try:
title, content = text_parser(f)
file_count += 1
if file_count % 100 == 0:
print(f"{file_count} files read.")

if len(content) == 0:
for doc_path in doc_paths:
for root, dirs, files in os.walk(doc_path):
for name in files:
extension = name.split(".")[-1]
if extensions is not None and extension not in extensions:
continue

chunks = chunk_str_overlap(
content.strip(),
num_tokens=chunk_size,
step_tokens=chunk_step,
separator="\n",
encoding=enc,
)
source = os.path.sep.join(f.split(os.path.sep)[4:])
for i in range(len(chunks)):
# custom metadata if needed
metadata = {
"source": source,
"title": title,
"chunk_id": i,
}
chunk_id_to_index[f"{source}_{i}"] = len(texts) + i
metadata_list.append(metadata)
texts.extend(chunks)
except Exception as e:
print(f"Error encountered when reading {f}: {traceback.format_exc()} {e}")
f = os.path.join(root, name)
print(f"Reading {f}")
try:
title, content = text_parser(f)
file_count += 1
if file_count % 100 == 0:
print(f"{file_count} files read.")

if len(content) == 0:
continue

chunks = chunk_str_overlap(
content.strip(),
num_tokens=chunk_size,
step_tokens=chunk_step,
separator="\n",
encoding=enc,
)
source = os.path.sep.join(f.split(os.path.sep)[4:])
for i in range(len(chunks)):
# custom metadata if needed
metadata = {
"source": source,
"title": title,
"chunk_id": i,
}
chunk_id_to_index[f"{source}_{i}"] = len(texts) + i
metadata_list.append(metadata)
texts.extend(chunks)
except Exception as e:
print(f"Error encountered when reading {f}: {traceback.format_exc()} {e}")
return file_count, texts, metadata_list, chunk_id_to_index


Expand All @@ -278,10 +289,11 @@ def chunk_document(
parser = argparse.ArgumentParser()
parser.add_argument(
"-d",
"--doc_path",
"--doc_paths",
help="the path of the documents",
type=str,
default="",
nargs="+",
default=".",
)
parser.add_argument(
"-c",
Expand All @@ -304,12 +316,21 @@ def chunk_document(
type=str,
default="",
)
parser.add_argument(
"-e",
"--extensions",
help="the extensions of the files",
type=str,
nargs="+",
default=None,
)
args = parser.parse_args()

file_count, texts, metadata_list, chunk_id_to_index = chunk_document(
doc_path=args.doc_path,
doc_paths=args.doc_paths,
chunk_size=args.chunk_size,
chunk_step=args.chunk_step,
extensions=args.extensions,
)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_texts(
Expand Down
53 changes: 53 additions & 0 deletions taskweaver/ext_role/document_retriever/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
In this role, we load a previously indexed document collection and retrieve the top-k documents based on a natural language query.
To enable this role, you need to configure the path to the folder containing the index files in the project configuration file `project/taskweaver_config.json`.
In addition, you need to add `document_retriever` to the `session.roles` list in the project configuration file `project/taskweaver_config.json`.
A pre-built sample index is provided which contains all documents for TaskWeaver under `project/sample_data/knowledge_base` folder.
So, an example configuration is as follows:
```json
{
"session.roles": ["document_retriever", "planner", "code_interpreter"],
"document_retriever.index_folder": "/path/to/TaskWeaver/project/sample_data/knowledge_base"
}
```

To build your own index, we provide a script in `script/document_indexer.py` to build the index.
You can run the following command to build the index:
```bash
cd TaskWeaver
python script/document_indexer.py \
--doc_paths website/docs website/blog \
--output_path project/sample_data/knowledge_base \
--extensions md
```
Please take a look at the import section in the script to install the required python packages.
There are two parameters `--chunk_step` and `--chunk_size` that can be specified to control the chunking of the documents.
The `--chunk_step` is the step size of the sliding window and the `--chunk_size` is the size of the sliding window.
The default values are `--chunk_step=64` and `--chunk_size=64`.
The size is measured in number of tokens and the tokenizer is based on OpenAI GPT model (i.e., `gpt-3.5-turbo`).
We intentionally split the documents with this small chunk size to make sure the chunks are small enough.
The reason is that small chunks are easier to match with the query, improving the retrieval accuracy.
Make sure you understand the consequence of changing these two parameters before you change them, for example,
by experimenting with different values on your dataset.

The retrieval is based on FAISS. You can find more details about FAISS [here](https://ai.meta.com/tools/faiss/).
FAISS is a library for vector similarity search of dense vectors.
In our implementation, we use the wrapper class provided by Langchain to call FAISS.
The embedding of the documents and the query is based on HuggingFace's Sentence Transformers.

The retrieved document chunks are presented in the following format:
```json
{
"chunk": "The chunk of the document",
"metadata": {
"source": "str, the path to the document",
"title": "str, the title of the document",
"chunk_id": "integer, the id of the chunk inside the document"
}
}
```
The title in the metadata is inferred from the file content in a heuristic way.
The chunk_id is the id of the chunk inside the document.
Neighboring chunks in the same document have consecutive chunk ids, so we can find the previous and next chunks in the same document.
In our implementation, we expand the retrieved chunks to include the previous and next chunks in the same document.
Recall that the raw chunk size is only 64 tokens, the expanded chunk size is 256 tokens by default.

Empty file.
Loading

0 comments on commit 3207520

Please sign in to comment.