Update analysis tooling

discoverygarden · Apr 19, 2024 · c1eb474 · c1eb474
1 parent c7e2514
commit c1eb474
Show file tree

Hide file tree

Showing 11 changed files with 598 additions and 213 deletions.
diff --git a/scripts/.gitignore b/scripts/.gitignore
@@ -0,0 +1,37 @@
+# Python
+*.py[cod]
+__pycache__/
+venv/
+*.pyc
+
+# Jupyter Notebook
+.ipynb_checkpoints/
+
+# IDEs
+.idea/
+.vscode/
+
+# Compiled files
+*.pyd
+*.pyo
+*.pyw
+*.pyz
+*.pyzw
+
+# Distribution / packaging
+dist/
+build/
+*.egg-info/
+*.egg
+
+# Environment
+.env
+.env.*
+
+# Script Outputs
+results/*
+output/*
+*.xml
+
+# Etc
+.DS_Store
diff --git a/scripts/README.md b/scripts/README.md
@@ -1,56 +1,62 @@
 # FCREPO3 Analysis Helpers
-
 ## Introduction
-Scripts to analyse and export metadata from an FCREPO3 instance.
+Tools to analyse and export metadata from an FCREPO3 instance using Python scripts.
 
 ## Table of Contents
-
 * [Setup](#setup)
 * [Features](#features)
 * [Usage](#usage)
 
 ## Setup
+These tools are designed to be run with a Python environment. Ensure Python 3.6 or higher is installed on your system. You will need to set up a Python virtual environment and install the required packages:
 
-These scripts require an FCREPO3 instance to be run over. In the event, these scripts are run on a separate system from where
-the repository lives, modifications may be required to the `fedora-xacml-policies` directory located at `$FEDORA_HOME/data/fedora-xacml-policies`.
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
 
-The metadata export command requires [GNU Parallel](https://www.gnu.org/software/parallel/parallel.html) to be installed
-for faster processing.
+The scripts also require an FCREPO3 instance. If these tools are run on a system separate from where the repository is hosted, modifications might be necessary in the `fedora-xacml-policies` directory at `$FEDORA_HOME/data/fedora-xacml-policies`.
 
 ## Features
-
 ### Metadata Analysis
-A script to generate the following:
-1. A total count of all objects in the repository.
-2. A breakdown of objects by content models and their count in CSV form (`models.csv`).
-3. A breakdown of unique datastream IDs and their count in CSV form (`dsids.csv`).
+Python scripts that perform the following:
+1. Count all objects in the repository.
+2. Provide a breakdown of objects by content models (`models.csv`).
+3. Output a breakdown of unique datastream IDs (`dsids.csv`).
 
 ### Metadata Export
-A script to export all objects within the repository that contain a specified metadata datastream ID.
+Scripts to export all objects within the repository that contain a specified metadata datastream ID, saving results as XML.
 
 ## Usage
-
 ### Metadata Analysis
 #### Command
 ```bash
-sudo bash /path_to_the_module/scripts/metadata_analysis.sh --fedora_pass=the_password
+python3 data_analysis.py --url=http://your-fedora-url --user=admin --password=secret --output_dir=./results
 ```
-
 #### Output
-```
-The total number of objects is 40.
-Outputted model breakdown to CSV (models.csv).
-Outputted DSID breakdown to CSV (dsids.csv).
-```
+Exports all queries found in `queries.py` to their own CSV in the `results` folder by default. Can be changed with the `--output_dir` flag.
 
 ### Metadata Export
 #### Command
 ```bash
-sudo bash shell_scripts/export_metadata.sh --fedora_pass=the_password --skip_auth_check
+python3 datastream_export.py --url=http://your-fedora-url:8080 --user=admin --password=secret --dsid=DSID --output_dir=./output --pid_file=./some_pids
+```
+> The script supports adding comments in the pid_file using `#`. PIDs can also contain URL encoded characters (e.g., `%3A` for `:` which will be automatically decoded).
+
+#### Output
+Exports all metadata entries related to the specified DSID into XML files stored in the defined output directory.
+Each file's name will be in the format `pid-DSID.xml`.
+
+### Datastream Updater
+#### Command
+```bash
+python3 datastream_updater.py --xml=input.xml --dsid=DSID --content=content.bin --label='New Version' --output=output.xml
 ```
+> This script allows you to specify the XML file to modify, the datastream ID, the binary content file (which will be base64 encoded), and optionally a label for the new datastream version.
 
-> Utilizing the `--skip_auth_check` flag here is an important performance optimization as it will greatly speed up the
-export operation due to not needing to validate the request prior.
+The only non-required argument is `label` which is in the case if you want to specify a custom label. If previous datastream versions do not have a label and you didn't specify one in the args, it will prompt you for a new one.
 
 #### Output
-The command does not output anything but will export all objects in the form of `the:pid-DSID.xml`.
+Updates the specified XML file with a new version of the datastream, encoding the provided binary content into base64. The updated XML is saved to the specified output file.
+
diff --git a/scripts/data_analysis.py b/scripts/data_analysis.py
@@ -0,0 +1,55 @@
+import argparse
+import os
+from utils import perform_http_request
+from queries import queries
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Process SPARQL queries and save results."
+    )
+    parser.add_argument("--url", type=str, help="Fedora server URL", required=True)
+    parser.add_argument("--user", type=str, help="Fedora username", required=True)
+    parser.add_argument("--password", type=str, help="Fedora password", required=True)
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default="./results",
+        help="Directory to save CSV files",
+    )
+    return parser.parse_args()
+
+
+def save_to_csv(data, filename, output_dir):
+    """
+    Save the given data to a CSV file.
+
+    Args:
+        data (str): The data to be written to the CSV file.
+        filename (str): The name of the CSV file.
+        output_dir (str): The directory where the CSV file will be saved.
+
+    Returns:
+        None
+    """
+    os.makedirs(output_dir, exist_ok=True)
+    with open(os.path.join(output_dir, filename), "w", newline="") as file:
+        file.write(data)
+
+
+def main():
+    args = parse_args()
+
+    for query_name, query in queries.items():
+        print(f"Processing query '{query_name}'...")
+        result = perform_http_request(query, args.url, args.user, args.password)
+        if result:
+            csv_filename = f"{query_name}.csv"
+            print(f"Saving results to {csv_filename}...\n")
+            save_to_csv(result, csv_filename, args.output_dir)
+        else:
+            print(f"Failed to retrieve data for query '{query_name}'.\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/datastream_export.py b/scripts/datastream_export.py
@@ -0,0 +1,137 @@
+import argparse
+import requests
+from tqdm import tqdm
+import concurrent.futures
+import os
+import mimetypes
+from utils import perform_http_request
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Export metadata using SPARQL query and save as XML."
+    )
+    parser.add_argument("--url", required=True, help="Fedora base URL")
+    parser.add_argument("--user", required=True, help="Username for Fedora access")
+    parser.add_argument("--password", required=True, help="Password for Fedora access")
+    parser.add_argument("--dsid", required=True, help="Datastream ID for querying")
+    parser.add_argument(
+        "--output_dir", default="./output", help="Directory to save XML files"
+    )
+    parser.add_argument(
+        "--pid_file", type=str, help="File containing PIDs to process", required=False
+    )
+    return parser.parse_args()
+
+
+def fetch_data(dsid, base_url, user, password, output_dir, obj_id):
+    """
+    Fetches the datastream content for a given datastream ID (dsid) and object ID (obj_id) from a Fedora repository.
+
+    Args:
+        dsid (str): The ID of the datastream to fetch.
+        base_url (str): The base URL of the Fedora repository.
+        user (str): The username for authentication.
+        password (str): The password for authentication.
+        output_dir (str): The directory where the fetched data will be saved.
+        obj_id (str): The ID of the object that contains the datastream.
+
+    Returns:
+        bool: True if the datastream content was successfully fetched and saved, False otherwise.
+    """
+    obj_id = obj_id.replace("info:fedora/", "")
+    url = f"{base_url}/fedora/objects/{obj_id}/datastreams/{dsid}/content"
+    print(f"Downloading {dsid} for PID: {obj_id}")
+    try:
+        response = requests.get(url, auth=(user, password))
+        response.raise_for_status()
+        dsid_dir = os.path.join(output_dir, dsid)
+        os.makedirs(dsid_dir, exist_ok=True)
+        content_type = response.headers.get("Content-Type", "")
+        extension = mimetypes.guess_extension(content_type) if content_type else ""
+        filename = f"{obj_id}-{dsid}{extension}"
+        with open(os.path.join(dsid_dir, filename), "wb") as f:
+            f.write(response.content)
+        print(f"Successfully saved {filename}")
+        return True
+    except Exception as e:
+        print(f"Failed to fetch data for {obj_id}, error: {str(e)}")
+        return False
+
+
+def process_pid_file(filepath):
+    """
+    Process a file containing PIDs (Persistent Identifiers) and return a list of PIDs.
+    Supports comments in the file using '#' character.
+    Replace '%3A' with ':' in PIDs.
+
+    Args:
+        filepath (str): The path to the file containing PIDs.
+
+    Returns:
+        list: A list of PIDs extracted from the file.
+    """
+    pids = []
+    with open(filepath, "r") as file:
+        for line in file:
+            line = line.strip()
+            if "#" in line:
+                line = line[: line.index("#")].strip()
+            if line:
+                line = line.replace("%3A", ":")
+                pids.append(line)
+    return pids
+
+
+def main():
+    args = parse_args()
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    object_ids = []
+
+    # If a PID file is provided, process the file to get the list of PIDs.
+    if args.pid_file:
+        object_ids = process_pid_file(args.pid_file)
+    else:
+        query = f"""
+        SELECT ?obj WHERE {{
+          ?obj <fedora-model:hasModel> <info:fedora/fedora-system:FedoraObject-3.0>;
+               <fedora-model:hasModel> ?model;
+               <fedora-view:disseminates> ?ds.
+          ?ds <fedora-view:disseminationType> <info:fedora/*/{args.dsid}>
+          FILTER(!sameTerm(?model, <info:fedora/fedora-system:FedoraObject-3.0>))
+          FILTER(!sameTerm(?model, <info:fedora/fedora-system:ContentModel-3.0>))
+        }}
+        """
+
+        result = perform_http_request(query, args.url, args.user, args.password)
+        object_ids.extend(result.strip().split("\n")[1:])
+
+    # Download metadata for each PID in parallel using ThreadPoolExecutor.
+    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor, tqdm(
+        total=len(object_ids), desc="Downloading Metadata"
+    ) as progress:
+        futures = {
+            executor.submit(
+                fetch_data,
+                args.dsid,
+                args.url,
+                args.user,
+                args.password,
+                args.output_dir,
+                obj_id,
+            ): obj_id
+            for obj_id in object_ids
+        }
+        for future in concurrent.futures.as_completed(futures):
+            obj_id = futures[future]
+            try:
+                success = future.result()
+                if success:
+                    progress.update(1)
+            except Exception as exc:
+                print(f"{obj_id} generated an exception: {exc}")
+
+
+if __name__ == "__main__":
+    main()