Merge pull request #21 from discoverygarden/DDST_71

DDST-71: Update analysis tooling
discoverygarden · May 2, 2024 · 49b4be5 · 49b4be5
2 parents c7e2514 + a10f682
commit 49b4be5
Show file tree

Hide file tree

Showing 17 changed files with 761 additions and 226 deletions.
diff --git a/scripts/.gitignore b/scripts/.gitignore
@@ -0,0 +1,37 @@
+# Python
+*.py[cod]
+__pycache__/
+venv/
+*.pyc
+
+# Jupyter Notebook
+.ipynb_checkpoints/
+
+# IDEs
+.idea/
+.vscode/
+
+# Compiled files
+*.pyd
+*.pyo
+*.pyw
+*.pyz
+*.pyzw
+
+# Distribution / packaging
+dist/
+build/
+*.egg-info/
+*.egg
+
+# Environment
+.env
+.env.*
+
+# Script Outputs
+results/*
+output/*
+*.xml
+
+# Etc
+.DS_Store
diff --git a/scripts/README.md b/scripts/README.md
@@ -1,56 +1,93 @@
 # FCREPO3 Analysis Helpers
-
 ## Introduction
-Scripts to analyse and export metadata from an FCREPO3 instance.
+Tools to analyse and export metadata from an FCREPO3 instance using Python scripts.
 
 ## Table of Contents
-
 * [Setup](#setup)
 * [Features](#features)
 * [Usage](#usage)
 
 ## Setup
+These tools are designed to be run with a Python environment. Ensure Python 3.6 or higher is installed on your system; you can check the version with `python3 --version`. You will need to set up a Python virtual environment and install the required packages; this can be done using these command within this 'scripts' directory:
 
-These scripts require an FCREPO3 instance to be run over. In the event, these scripts are run on a separate system from where
-the repository lives, modifications may be required to the `fedora-xacml-policies` directory located at `$FEDORA_HOME/data/fedora-xacml-policies`.
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
 
-The metadata export command requires [GNU Parallel](https://www.gnu.org/software/parallel/parallel.html) to be installed
-for faster processing.
+The scripts also require an FCREPO3 instance. If these tools are run on a system separate from where the repository is hosted, modifications might be necessary in the `fedora-xacml-policies` directory at `$FEDORA_HOME/data/fedora-xacml-policies`.
 
 ## Features
-
 ### Metadata Analysis
-A script to generate the following:
-1. A total count of all objects in the repository.
-2. A breakdown of objects by content models and their count in CSV form (`models.csv`).
-3. A breakdown of unique datastream IDs and their count in CSV form (`dsids.csv`).
+Script to run SPARQL queries against an FCREPO's RI and gather information. Current queries include:
+ - Content model distribution
+ - Total object count
+ - Count of active and deleted objects
+ - List of deleted objects
+ - Datastream distribution
+ - Owner distribution
+ - Collection distribution
+ - List of relationships
+ - List of orphaned objects
+ - MIME type distribution
+
+ Before running this as a script, you might want to verify that the queries provided in `queries.py` are compatible with the system you are querying. If the system has Mulgara instead of Blazegraph, it would be restricted to SPARQL 1.0; to check what features would not be available for SPARQL 1.0, you can see a list of new features added in SPARQL 1.1 at the bottom of https://www.w3.org/TR/sparql11-query/.
+
+If you find that the system you are querying has some relationships outside of the ones covered in these queries by default, you will need to modify the relevant queries to get an accurate analysis. For example, sometimes there are relationships that have different capitalization or typos; these relationships need to be accounted for in this analysis phase for completeness and accuracy of the analysis as well as ensuring they get mapped appropriately in the actual migration.
 
 ### Metadata Export
-A script to export all objects within the repository that contain a specified metadata datastream ID.
+Script to export all objects (or a specified list of PIDs) within the repository that contain a specified metadata datastream ID, saving results as XML.
 
-## Usage
+### FOXML Export
+Script to export FOXML archival objects from a Fedora repository given a list of PIDs.
+
+### Datastream Updater
+Script to inject a binary into an archival FOXML as base64 encoded data within a datastream.
 
+## Usage
 ### Metadata Analysis
 #### Command
 ```bash
-sudo bash /path_to_the_module/scripts/metadata_analysis.sh --fedora_pass=the_password
+python3 data_analysis.py --url=<http://your-fedora-url> --user=<admin> --password=<secret> --output_dir=<./results>
 ```
-
 #### Output
+Exports all queries found in `queries.py` to their own CSV in the `results` folder by default. Can be changed with the `--output_dir` flag.
+
+### Metadata Export
+#### Command
+```bash
+python3 datastream_export.py --url=<http://your-fedora-url:8080> --user=<admin> --password=<secret> --dsid=<DSID> --output_dir=<./output> --pid_file=<./some_pids>
 ```
-The total number of objects is 40.
-Outputted model breakdown to CSV (models.csv).
-Outputted DSID breakdown to CSV (dsids.csv).
+> The script supports adding comments in the pid_file using `#`. PIDs can also contain URL encoded characters (e.g., `%3A` for `:` which will be automatically decoded). Expected format of the `pid_file` is one PID per line.
+If `--pid_file` isn't specified, the script will do a query intended to get a list of all pids in the system and export all of them.
+
+#### Output
+Exports all metadata entries related to the specified DSID into XML files stored in the defined output directory.
+Each file's name will be in the format `pid-DSID.xml`.
+
+### FOXML Export
+#### Command
+```bash
+python3 foxml_export.py --url=<http://your-fedora-url:8080> --user=<admin> --pasword=<secret> --pid_file=<./some_pids_to_export> --output_dir=<./output>
 ```
+> The script supports adding comments in the pid_file using `#`. PIDs can also contain URL encoded characters (e.g., `%3A` for `:` which will be automatically decoded). Expected format of the `pid_file` is one PID per line.
 
-### Metadata Export
+#### Output
+Exports all archival FOXML found in the associated PID file passed in through arguments to their own folder in `output_dir/FOXML`.
+
+### Datastream Updater
 #### Command
 ```bash
-sudo bash shell_scripts/export_metadata.sh --fedora_pass=the_password --skip_auth_check
+python3 datastream_updater.py --xml=<input.xml> --dsid=<DSID> --content=<content.bin> --label=<'New Version'> --output=<output.xml>
 ```
+> This script allows you to specify the XML file to modify, the datastream ID, the binary content file (which will be base64 encoded), and optionally a label for the new datastream version.
 
-> Utilizing the `--skip_auth_check` flag here is an important performance optimization as it will greatly speed up the
-export operation due to not needing to validate the request prior.
+The only non-required argument is `label` which is in the case if you want to specify a custom label. If previous datastream versions do not have a label and you didn't specify one in the args, it will prompt you for a new one.
 
 #### Output
-The command does not output anything but will export all objects in the form of `the:pid-DSID.xml`.
+Updates the specified XML file with a new version of the datastream, encoding the provided binary content into base64. The updated XML is saved to the specified output file.
+
+## Known Issues:
+* `datastream_updater.py` is very finnicky and will probably fail on most FOXML objects.
+  * The eventual intention with this script is to update it using `xmltodict`, and simplify it even more. Most of its current issues derive from XML namespaces.
diff --git a/scripts/data_analysis.py b/scripts/data_analysis.py
@@ -0,0 +1,55 @@
+import argparse
+import os
+from utils import perform_http_request
+from queries import queries
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Process SPARQL queries and save results."
+    )
+    parser.add_argument("--url", type=str, help="Fedora server URL", required=True)
+    parser.add_argument("--user", type=str, help="Fedora username", required=True)
+    parser.add_argument("--password", type=str, help="Fedora password", required=True)
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default="./results",
+        help="Directory to save CSV files",
+    )
+    return parser.parse_args()
+
+
+def save_to_csv(data, filename, output_dir):
+    """
+    Save the given data to a CSV file.
+
+    Args:
+        data (str): The data to be written to the CSV file.
+        filename (str): The name of the CSV file.
+        output_dir (str): The directory where the CSV file will be saved.
+
+    Returns:
+        None
+    """
+    os.makedirs(output_dir, exist_ok=True)
+    with open(os.path.join(output_dir, filename), "w", newline="") as file:
+        file.write(data)
+
+
+def main():
+    args = parse_args()
+
+    for query_name, query in queries.items():
+        print(f"Processing query '{query_name}'...")
+        result = perform_http_request(query, args.url, args.user, args.password)
+        if result:
+            csv_filename = f"{query_name}.csv"
+            print(f"Saving results to {csv_filename}...\n")
+            save_to_csv(result, csv_filename, args.output_dir)
+        else:
+            print(f"Failed to retrieve data for query '{query_name}'.\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/datastream_export.py b/scripts/datastream_export.py
@@ -0,0 +1,113 @@
+import argparse
+import requests
+from tqdm import tqdm
+import concurrent.futures
+import os
+import mimetypes
+from utils import perform_http_request, process_pid_file
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Export metadata using SPARQL query and save as XML."
+    )
+    parser.add_argument("--url", required=True, help="Fedora base URL")
+    parser.add_argument("--user", required=True, help="Username for Fedora access")
+    parser.add_argument("--password", required=True, help="Password for Fedora access")
+    parser.add_argument("--dsid", required=True, help="Datastream ID for querying")
+    parser.add_argument(
+        "--output_dir", default="./output", help="Directory to save XML files"
+    )
+    parser.add_argument(
+        "--pid_file", type=str, help="File containing PIDs to process", required=False
+    )
+    return parser.parse_args()
+
+
+def fetch_data(dsid, base_url, user, password, output_dir, pid):
+    """
+    Fetches the datastream content for a given datastream ID (dsid) and PID from a Fedora repository.
+
+    Args:
+        dsid (str): The ID of the datastream to fetch.
+        base_url (str): The base URL of the Fedora repository.
+        user (str): The username for authentication.
+        password (str): The password for authentication.
+        output_dir (str): The directory where the fetched data will be saved.
+        pid (str): The PID of the object that contains the datastream.
+
+    Returns:
+        bool: True if the datastream content was successfully fetched and saved, False otherwise.
+    """
+    pid = pid.replace("info:fedora/", "")
+    url = f"{base_url}/fedora/objects/{pid}/datastreams/{dsid}/content"
+    print(f"Downloading {dsid} for PID: {pid}")
+    try:
+        response = requests.get(url, auth=(user, password))
+        response.raise_for_status()
+        dsid_dir = os.path.join(output_dir, dsid)
+        os.makedirs(dsid_dir, exist_ok=True)
+        content_type = response.headers.get("Content-Type", "")
+        extension = mimetypes.guess_extension(content_type) if content_type else ""
+        filename = f"{pid}-{dsid}{extension}"
+        with open(os.path.join(dsid_dir, filename), "wb") as f:
+            f.write(response.content)
+        print(f"Successfully saved {filename}\n")
+        return True
+    except Exception as e:
+        print(f"Failed to fetch data for {pid}, error: {str(e)}\n")
+        return False
+
+
+def main():
+    args = parse_args()
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    pids = []
+
+    # If a PID file is provided, process the file to get the list of PIDs.
+    if args.pid_file:
+        pids = process_pid_file(args.pid_file)
+    else:
+        query = f"""
+        SELECT ?obj WHERE {{
+          ?obj <fedora-model:hasModel> <info:fedora/fedora-system:FedoraObject-3.0>;
+               <fedora-model:hasModel> ?model;
+               <fedora-view:disseminates> ?ds.
+          ?ds <fedora-view:disseminationType> <info:fedora/*/{args.dsid}>
+          FILTER(!sameTerm(?model, <info:fedora/fedora-system:FedoraObject-3.0>))
+          FILTER(!sameTerm(?model, <info:fedora/fedora-system:ContentModel-3.0>))
+        }}
+        """
+
+        result = perform_http_request(query, args.url, args.user, args.password)
+        pids.extend(result.strip().split("\n")[1:])
+
+    # Download metadata for each PID in parallel using ThreadPoolExecutor.
+    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor, tqdm(
+        total=len(pids), desc="Downloading Metadata"
+    ) as progress:
+        futures = {
+            executor.submit(
+                fetch_data,
+                args.dsid,
+                args.url,
+                args.user,
+                args.password,
+                args.output_dir,
+                pid,
+            ): pid
+            for pid in pids
+        }
+        for future in concurrent.futures.as_completed(futures):
+            pid = futures[future]
+            try:
+                success = future.result()
+                if success:
+                    progress.update(1)
+            except Exception as exc:
+                print(f"{pid} generated an exception: {exc}")
+
+
+if __name__ == "__main__":
+    main()