Liqun/doc retriever role (#334)

Re-implement document retriever as a role from the previous plugin.
microsoft · May 10, 2024 · 3207520 · 3207520
2 parents cc3368a + 0a5fe91
commit 3207520
Show file tree

Hide file tree

Showing 12 changed files with 220 additions and 162 deletions.
diff --git a/project/plugins/README.md b/project/plugins/README.md
@@ -24,65 +24,3 @@ Finally, we use the SQL query to pull data from the sqlite database.
 Because we need to generate the SQL query, we need to access GPT model. 
 So, you need to configure the GPT model (similar with configuring the main project) in the plugin configuration file `sql_pull_data.yaml`.
 
-
-## vision_web_explorer
-This plugin has been re-implemented as a role in the `taskweaver/ext_role/web_explorer` directory.
-
-[Plugin Demo](https://github.com/microsoft/TaskWeaver/assets/7489260/7f819524-2c5b-46a8-9c0c-e001a2c7131b)
-
-## web_search
-
-This plugin has been re-implemented as a role in the `taskweaver/ext_role/web_search` directory.
-
-A video demo using web search to find out information and then complete the task based on the retrieved information:
-
-[Plugin Demo](https://github.com/microsoft/TaskWeaver/assets/7489260/d078a05b-a19b-498c-b712-6f8c4855cefa)
-
-
-## document_retriever
-
-This plugin by default is **not** enabled. If you want to use this plugin, you need to enable it in the `document_retriever.yaml` file.
-In this plugin, we load a previously indexed document collection and retrieve the top-k documents based on a natural language query.
-To use this plugin, you need to configure the path to the folder containing the index files in the plugin configuration file `document_retriever.yaml`.
-A pre-built sample index is provided in the `project/sample_data/knowledge_base` folder which contains all documents for TaskWeaver under `website/docs` folder.
-
-To build your own index, we provide a script in `script/document_indexer.py` to build the index.
-You can run the following command to build the index:
-```bash
-python script/document_indexer.py \
-  --doc_path project/sample_data/knowledge_base/website/docs \
-  --output_path project/sample_data/knowledge_base/index
-```
-Please take a look at the import section in the script to install the required python packages.
-There are two parameters `--chunk_step` and `--chunk_size` that can be specified to control the chunking of the documents.
-The `--chunk_step` is the step size of the sliding window and the `--chunk_size` is the size of the sliding window.
-The default values are `--chunk_step=64` and `--chunk_size=64`.
-The size is measured in number of tokens and the tokenizer is based on OpenAI GPT model (i.e., `gpt-3.5-turbo`).
-We intentionally split the documents with this small chunk size to make sure the chunks are small enough.
-The reason is that small chunks are easier to match with the query, improving the retrieval accuracy.
-Make sure you understand the consequence of changing these two parameters before you change them, for example, 
-by experimenting with different values on your dataset.
-
-The retrieval is based on FAISS. You can find more details about FAISS [here](https://ai.meta.com/tools/faiss/).
-FAISS is a library for vector similarity search of dense vectors.
-In our implementation, we use the wrapper class provided by Langchain to call FAISS.
-The embedding of the documents and the query is based on HuggingFace's Sentence Transformers.
-
-The retrieved document chunks are presented in the following format:
-```json
-{
-    "chunk": "The chunk of the document",
-    "metadata": {
-      "source": "str, the path to the document", 
-      "title": "str, the title of the document",
-      "chunk_id": "integer, the id of the chunk inside the document"
-    }
-}
-```
-The title in the metadata is inferred from the file content in a heuristic way.
-The chunk_id is the id of the chunk inside the document.
-Neighboring chunks in the same document have consecutive chunk ids, so we can find the previous and next chunks in the same document.
-In our implementation, we expand the retrieved chunks to include the previous and next chunks in the same document.
-Recall that the raw chunk size is only 64 tokens, the expanded chunk size is 256 tokens by default.
-
-
diff --git a/project/plugins/document_retriever.yaml b/project/plugins/document_retriever.yaml
diff --git a/project/sample_data/knowledge_base/chunk_id_to_index.pkl b/project/sample_data/knowledge_base/chunk_id_to_index.pkl
diff --git a/project/sample_data/knowledge_base/index.faiss b/project/sample_data/knowledge_base/index.faiss
diff --git a/project/sample_data/knowledge_base/index.pkl b/project/sample_data/knowledge_base/index.pkl
diff --git a/scripts/document_indexer.py b/scripts/document_indexer.py
@@ -5,7 +5,7 @@
 import pickle
 import re
 import traceback
-from typing import Dict, List, Literal, Tuple
+from typing import Dict, List, Literal, Optional, Tuple
 
 try:
     import tiktoken
@@ -169,6 +169,12 @@ def text_parser(
     soup = None
     supported_extensions = ["md", "markdown", "html", "htm", "txt", "json", "jsonl"]
     other_extensions = ["docx", "pptx", "pdf", "csv"]
+    if extension not in supported_extensions + other_extensions:
+        print(
+            f"Not support for file with extension: {extension}. "
+            f"The supported extensions are {supported_extensions}",
+        )
+        return title, ""
 
     # utf-8-sig will treat BOM header as a metadata of a file not a part of the file content
     default_encoding = "utf-8-sig"
@@ -218,13 +224,14 @@ def text_parser(
 
 
 def chunk_document(
-    doc_path: str,
+    doc_paths: List[str],
     chunk_size: int,
     chunk_step: int,
+    extensions: Optional[List[str]] = None,
 ) -> Tuple[int, List[str], List[Dict[str, str]], Dict[str, int]]:
     """
     Split documents into chunks
-    :param doc_path: the path of the documents
+    :param doc_paths: the paths of the documents
     :param chunk_size: the size of the chunk
     :param chunk_step: the step size of the chunk
     """
@@ -237,39 +244,43 @@ def chunk_document(
 
     # traverse all files under dir
     print("Split documents into chunks...")
-    for root, dirs, files in os.walk(doc_path):
-        for name in files:
-            f = os.path.join(root, name)
-            print(f"Reading {f}")
-            try:
-                title, content = text_parser(f)
-                file_count += 1
-                if file_count % 100 == 0:
-                    print(f"{file_count} files read.")
-
-                if len(content) == 0:
+    for doc_path in doc_paths:
+        for root, dirs, files in os.walk(doc_path):
+            for name in files:
+                extension = name.split(".")[-1]
+                if extensions is not None and extension not in extensions:
                     continue
-
-                chunks = chunk_str_overlap(
-                    content.strip(),
-                    num_tokens=chunk_size,
-                    step_tokens=chunk_step,
-                    separator="\n",
-                    encoding=enc,
-                )
-                source = os.path.sep.join(f.split(os.path.sep)[4:])
-                for i in range(len(chunks)):
-                    # custom metadata if needed
-                    metadata = {
-                        "source": source,
-                        "title": title,
-                        "chunk_id": i,
-                    }
-                    chunk_id_to_index[f"{source}_{i}"] = len(texts) + i
-                    metadata_list.append(metadata)
-                texts.extend(chunks)
-            except Exception as e:
-                print(f"Error encountered when reading {f}: {traceback.format_exc()} {e}")
+                f = os.path.join(root, name)
+                print(f"Reading {f}")
+                try:
+                    title, content = text_parser(f)
+                    file_count += 1
+                    if file_count % 100 == 0:
+                        print(f"{file_count} files read.")
+
+                    if len(content) == 0:
+                        continue
+
+                    chunks = chunk_str_overlap(
+                        content.strip(),
+                        num_tokens=chunk_size,
+                        step_tokens=chunk_step,
+                        separator="\n",
+                        encoding=enc,
+                    )
+                    source = os.path.sep.join(f.split(os.path.sep)[4:])
+                    for i in range(len(chunks)):
+                        # custom metadata if needed
+                        metadata = {
+                            "source": source,
+                            "title": title,
+                            "chunk_id": i,
+                        }
+                        chunk_id_to_index[f"{source}_{i}"] = len(texts) + i
+                        metadata_list.append(metadata)
+                    texts.extend(chunks)
+                except Exception as e:
+                    print(f"Error encountered when reading {f}: {traceback.format_exc()} {e}")
     return file_count, texts, metadata_list, chunk_id_to_index
 
 
@@ -278,10 +289,11 @@ def chunk_document(
     parser = argparse.ArgumentParser()
     parser.add_argument(
         "-d",
-        "--doc_path",
+        "--doc_paths",
         help="the path of the documents",
         type=str,
-        default="",
+        nargs="+",
+        default=".",
     )
     parser.add_argument(
         "-c",
@@ -304,12 +316,21 @@ def chunk_document(
         type=str,
         default="",
     )
+    parser.add_argument(
+        "-e",
+        "--extensions",
+        help="the extensions of the files",
+        type=str,
+        nargs="+",
+        default=None,
+    )
     args = parser.parse_args()
 
     file_count, texts, metadata_list, chunk_id_to_index = chunk_document(
-        doc_path=args.doc_path,
+        doc_paths=args.doc_paths,
         chunk_size=args.chunk_size,
         chunk_step=args.chunk_step,
+        extensions=args.extensions,
     )
     embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
     vectorstore = FAISS.from_texts(

diff --git a/taskweaver/ext_role/document_retriever/README.md b/taskweaver/ext_role/document_retriever/README.md
@@ -0,0 +1,53 @@
+In this role, we load a previously indexed document collection and retrieve the top-k documents based on a natural language query.
+To enable this role, you need to configure the path to the folder containing the index files in the project configuration file `project/taskweaver_config.json`.
+In addition, you need to add `document_retriever` to the `session.roles` list in the project configuration file `project/taskweaver_config.json`.
+A pre-built sample index is provided which contains all documents for TaskWeaver under `project/sample_data/knowledge_base` folder.
+So, an example configuration is as follows:
+```json
+{
+  "session.roles": ["document_retriever", "planner", "code_interpreter"],
+  "document_retriever.index_folder": "/path/to/TaskWeaver/project/sample_data/knowledge_base"
+}
+```
+
+To build your own index, we provide a script in `script/document_indexer.py` to build the index.
+You can run the following command to build the index:
+```bash
+cd TaskWeaver
+python script/document_indexer.py \
+  --doc_paths website/docs website/blog \
+  --output_path project/sample_data/knowledge_base \
+  --extensions md
+```
+Please take a look at the import section in the script to install the required python packages.
+There are two parameters `--chunk_step` and `--chunk_size` that can be specified to control the chunking of the documents.
+The `--chunk_step` is the step size of the sliding window and the `--chunk_size` is the size of the sliding window.
+The default values are `--chunk_step=64` and `--chunk_size=64`.
+The size is measured in number of tokens and the tokenizer is based on OpenAI GPT model (i.e., `gpt-3.5-turbo`).
+We intentionally split the documents with this small chunk size to make sure the chunks are small enough.
+The reason is that small chunks are easier to match with the query, improving the retrieval accuracy.
+Make sure you understand the consequence of changing these two parameters before you change them, for example, 
+by experimenting with different values on your dataset.
+
+The retrieval is based on FAISS. You can find more details about FAISS [here](https://ai.meta.com/tools/faiss/).
+FAISS is a library for vector similarity search of dense vectors.
+In our implementation, we use the wrapper class provided by Langchain to call FAISS.
+The embedding of the documents and the query is based on HuggingFace's Sentence Transformers.
+
+The retrieved document chunks are presented in the following format:
+```json
+{
+    "chunk": "The chunk of the document",
+    "metadata": {
+      "source": "str, the path to the document", 
+      "title": "str, the title of the document",
+      "chunk_id": "integer, the id of the chunk inside the document"
+    }
+}
+```
+The title in the metadata is inferred from the file content in a heuristic way.
+The chunk_id is the id of the chunk inside the document.
+Neighboring chunks in the same document have consecutive chunk ids, so we can find the previous and next chunks in the same document.
+In our implementation, we expand the retrieved chunks to include the previous and next chunks in the same document.
+Recall that the raw chunk size is only 64 tokens, the expanded chunk size is 256 tokens by default.
+
diff --git a/taskweaver/ext_role/document_retriever/__init__.py b/taskweaver/ext_role/document_retriever/__init__.py