Skip to content

codelittinc/tasketeer-nlp-processor

Repository files navigation

Tasketeer NLP

Tasketeer NLP is a microservice written in Python that serves as the NLP (Natural Language Processing) processor for the Tasketeer application. Tasketeer is a bot designed to run on Slack and assist users in quickly finding information within their private documents. By simply asking the bot a question, it will search through indexed files to retrieve relevant information.

For more information about Tasketeer, please visit the Tasketeer GitHub repository.

Requirements

  • Docker and Docker Compose

Service Characteristics

  1. Written in Python:
    The Tasketeer NLP microservice is implemented using the Python programming language, leveraging its rich ecosystem of libraries and frameworks for natural language processing.

  2. Uses MongoDB to store data:
    The microservice utilizes MongoDB, a popular NoSQL database, to store and manage the indexed file contents and associated metadata. MongoDB's flexible document model and scalability make it suitable for storing and retrieving large volumes of data efficiently.

  3. Uses Pinecone to store vector data:
    Pinecone is employed as the storage engine for vector data. Pinecone is a vector database that allows fast indexing and similarity search over high-dimensional vectors, making it well-suited for storing and querying the vectorized data generated by the OpenAI embeddings API.

  4. Uses OpenAI embeddings API to vectorize the data from the documents:
    The microservice leverages the OpenAI embeddings API to transform the text data from the documents into high-dimensional vector representations. These vectors capture semantic similarities and relationships between different documents, enabling efficient search and retrieval based on context.

  5. Uses Langchain to coordinate everything needed for OpenAI and Pinecone:
    Langchain serves as the coordination layer for integrating the functionalities of the OpenAI embeddings API and Pinecone. It provides a streamlined interface for managing and orchestrating the data processing pipeline, ensuring smooth communication between the microservice, OpenAI, and Pinecone.

  6. Requires an Authorization header:
    To ensure secure access and protect sensitive information, the Tasketeer NLP microservice requires an Authorization header. The value of the Authorization header should match the AUTHORIZATION environment variable, serving as a form of authentication for authorized users interacting with the API.

How to Run the Project

Make sure to update the environment file .env with your environment variables.

There is a bash script that will build the necessary containers. Simply run sh bin/dev, and it will start the Docker container and take you to a bash session inside it.

  • Install Python libraries and dependencies
pip install -r requirements.txt
  • Run the server
python3 app.py

API Endpoints Overview

Health Endpoint

A checker to verify if the server is running as expected.

curl --location 'http://localhost:8080/health'

Response:

{
    "datetime": "Fri, 10 Mar 2023 17:37:36 GMT",
    "success": "true"
}

Index Content

Store file contents to be used by an organization for searching by context.

curl --location 'http://localhost:8080/contents' \
--header 'Content-Type: application/json' \
--data '{
    "organization": "codelitt",
    "content": "Travel Expense Report Process on Codelitt: You need to make sure to include: 1) the business purpose of the trip, 2) dates traveled, and 3) the client’s information and details (if applicable). You need to make sure to include all receipts or documents related to the expense for our review. Business trip expense reports need to be submitted to Cody, cc Mary no more than a week after traveling. When submitting your report and receipts, please make sure they are in PDF format and email them in a .Zip file."
}'

Response:

{
    "Document_ID": "640a0bb44789ef2014f53513",
    "Status": "Successfully Inserted"
}

Search Indexed Content

Search by context based on files already uploaded or the OpenAI global knowledge database.

curl --location 'http://localhost:8080/search?organization=codelitt&q=explain%20the%20Travel%20Expense%20Report%20Process%20on%20Codelitt'

Response:

{
    "response": {
        "extra_info": null,
        "response": "\nThe Travel Expense Report Process on Codelitt requires that the employee submit a report with the business purpose of the trip, dates traveled, and client information (if applicable). All receipts and documents related to the expense must be included and submitted to Cody, cc Mary, no more than a week after traveling. The report and receipts must be in PDF format and emailed in a .Zip file.",
        "source_nodes": [
            {
                "doc_id": "c76a1959-f620-406d-bfe1-258e2ac3481f",
                "extra_info": null,
                "node_info": null,
                "similarity": null,
                "source_text": "Travel Expense Report Process on Codelitt: You need to make sure to include: 1) the business purpose of the trip, 2) dates traveled, and 3) the client’s information and details (if applicable). You need to make sure to include all receipts or documents related to the expense for our review. Business trip expense reports need to be submitted to Cody, cc Mary no more than a week after traveling. When submitting your report and receipts, please make sure they are in PDF format and email them in a .Zip file."
            }
        ]
    }
}