Updated: 2024.09.30
Create an information retrieval system that who effectiveness as retrieving information can be measured.
This project will be a graph-based information retrieval system that uses Neo4J, Python, and the Natural Language Toolkit (NLTK) to map content and retrieve information. The system will focus on measuring performance using an F-score. You can find my explanation of the project and its progress for the Hackathon 2024.
This exercises will load a list of repositories and crawl the repo to find 'toc.yml' objects. It then feeds each toc.yml file to a function that graphs the function. A routine then produces Cypher files that can be loaded into Neo4j.
The output of the graphing section is format agnostic, and could accommodate different graph formats.
markdownvalidator: https://github.com/mattbriggs/markdown-validator. I am using this as a local package.
The system starts with: tocgrapher.py
-
Update
jobtoc.yml
with the repos. Here is the following example of the jobtoc.yml.output: "C:\\data\\tocgraphs\\" type: "neo4j" limit: 1000 folders: - folder: "C:\\git\\ms\\azure-docs-pr\\articles\\"
Property Value Description output file path (escaped virgule) Output directory where the logs will be stored or with formats with an output, where the outputs will be placed. type Enum neo4j
: will connect to a Neo4J graph database and load the graph.csv
: Qill drop each toc graph as a node/edge pair of files into the output folder.limit number Limits the number of TOCs. Nothing will happen if you type 0. folders array a list of file path (escaped virgule)s to repositories to scan for toc.ymls
. -
Update
wokring/fowler.yml
with Neo4J credentials. Here is the following example of theworking/fowler.yml
.--- username: <username> password: <token> domain: <neo url>
-
Type:
tocgrapher.py
-
Type:
tockeywords.py
-
Type:
toctaxonomy.py
This script calculates the F-score for an information retrieval system using a Neo4j database. It reads configuration data and Cypher queries from a queries.yml
file.
- Initialization: Establishes a connection to the Neo4j database using the credentials.
- Query Execution: Runs Cypher queries against the database based on provided terms, retrieving relevant content IDs.
- F-Score Calculation: Compares the retrieved IDs with expected (golden) results to compute precision, recall, and F-score metrics.
- Report Generation: Outputs a summary of the results to
f_score_report.txt
.
- Prepare
working/fowler.yml
with Neo4j credentials. - Create
queries.yml
withqueries
andgolden_queries
mappings. - Run the script:
python script_name.py
. - Check the generated report at
output/f_score_report.txt
.
Used for evaluating and fine-tuning search queries in an information retrieval system.
This script queries a Neo4j database to retrieve a hierarchical structure of categories and terms starting from a given root node, then outputs it as a formatted text file. It uses fowler.yml
to load Neo4j credentials and connects to the database to run a Cypher query that captures category and term relationships.
- Initialize Connection: Establishes a connection to the Neo4j database.
- Query Execution: Runs a Cypher query to retrieve categories and terms starting from the specified root ID.
- Hierarchy Construction: Builds a nested structure of categories and their children, linking terms.
- Export to File: Outputs the hierarchy in a readable tree format to a specified text file.
- Prepare
fowler.yml
with Neo4j credentials. - Update the script with your
root_id
and desiredoutput_file
path. - Run the script (
python script_name.py
). - Check the generated hierarchy output file.
Used for visualizing category structures in Neo4j databases.
This script is used for graphing Table of Contents (TOCs) from specified repositories. It supports multiple output formats, including Neo4j and CSV. The script processes the TOC files in parallel using multiple threads for efficiency and can handle up to four separate ranges. The primary components include reading configuration settings from a YAML file, fetching the TOC files from the repository, and creating graph representations for the TOCs.
- Load Configuration File (
jobtoc.yml
):- Reads settings like output type, path, and folder locations to collect TOC files.
- Fetch TOC Files:
- Collects TOC files from each folder specified in the config file.
- Split TOCs into Ranges:
- Splits the TOC list into four segments for parallel processing.
- Process TOCs in Parallel:
- Each TOC segment is processed in a separate thread. Depending on the configuration, TOCs are either written to a Neo4j database or exported as a CSV.
- Output the Results:
- Writes logs and graph outputs to the specified output directory.
yaml
: To parse configuration files.threading
: To enable parallel processing.datetime
: To handle timestamps and date formatting.logging
: To capture runtime logs and errors.neo4j
: To connect and write to a Neo4j database.tocharvestor
,tocscanner
,tocformats
,mdbutilities
: Custom modules for TOC parsing, file scanning, graph creation, and utilities.
Splits a given number into four equal segments. Returns a list of 4 tuples, each indicating the start and end index of a segment.
-
Parameters:
innumber
(int): The total number of items to split.
-
Returns:
- List of tuples with start and end indices:
[(a1, a2), (b1, b2), (c1, c2), (d1, d2)]
.
- List of tuples with start and end indices:
Processes a segment of the TOC list. Converts each TOC file to a graph format and writes the output to a specified file or database.
-
Parameters:
index_start
(int): Start index for this segment.index_end
(int): End index for this segment.outtype
(str): The output type (neo4j
orcsv
).outputpath
(str): Path for output files.
-
Functionality:
- Loads the credentials for Neo4j from a YAML file.
- Writes graph representations to Neo4j or outputs as a CSV file.
The main execution point for the script. It performs the following steps:
- Load the
jobtoc.yml
Config File:- Reads and parses the YAML config file to extract settings.
- Determine Output Type (
neo4j
orcsv
). - Fetch the TOC Files:
- Retrieves TOCs from the specified folders in the config file.
- Limit TOC Processing:
- If a limit is specified in the config file, only processes up to that number of TOCs.
- Process TOCs in Segments:
- Splits the TOC list into four segments and launches a separate thread for each segment.
- Logs Start and Finish Times:
- Records the process's start and finish times.
The configuration file should be structured as follows:
type: "neo4j" # Output type: "neo4j" or "csv"
output: "path_to_output_directory"
limit: 10 # Limit the number of TOCs to process (0 for no limit)
folders: # List of folders containing TOCs
- folder: "folder_path_1"
- folder: "folder_path_2"
If the script is executed directly, it calls the main()
function, which runs the entire workflow.
if __name__ == "__main__":
main()
- The script is ideal for processing TOCs in DocFX/Learn.microsoft.com repositories, building graph structures from TOCs, and outputting those graphs to a database or text file for further analysis.
- Logs are written to a file with the format:
{output_path}/{todays_date}-logs.log
. - Errors encountered while processing TOCs are captured using
logging.error()
and output to the logs.
Here's an example of what the file that contains keys file might look like, including the necessary keys for Neo4j credentials, OpenAI API key, and content context:
# Neo4j Credentials
domain: bolt://localhost:7687
username: neo4j
password: your-neo4j-password
# OpenAI API Key
openai-key: your-openai-api-key
# Content context
content: "an Azure billing service for Microsoft."
# Root node name
rootnode: "Root node name"
-
Neo4j Credentials:
domain
: Specifies the connection string to your Neo4j database.username
: Your Neo4j username.password
: Your Neo4j password.
-
OpenAI API Key:
openai-key
: The key you use to authenticate OpenAI's API.
-
Content:
content
: The specific context related to the subject domain (e.g., in this case, an Azure billing service for Microsoft). This will be used in the prompt to OpenAI GPT-4 to help generate more contextually appropriate category names.
-
Root Node Name:
rootnode
: The name of the root node in the graph.
Make sure to replace the placeholders (your-neo4j-password
, your-openai-api-key
) with your actual credentials and content description.