Skip to content

Commit

Permalink
Merge pull request #21 from discoverygarden/DDST_71
Browse files Browse the repository at this point in the history
DDST-71: Update analysis tooling
  • Loading branch information
chrismacdonaldw authored May 2, 2024
2 parents c7e2514 + a10f682 commit 49b4be5
Show file tree
Hide file tree
Showing 17 changed files with 761 additions and 226 deletions.
37 changes: 37 additions & 0 deletions scripts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Python
*.py[cod]
__pycache__/
venv/
*.pyc

# Jupyter Notebook
.ipynb_checkpoints/

# IDEs
.idea/
.vscode/

# Compiled files
*.pyd
*.pyo
*.pyw
*.pyz
*.pyzw

# Distribution / packaging
dist/
build/
*.egg-info/
*.egg

# Environment
.env
.env.*

# Script Outputs
results/*
output/*
*.xml

# Etc
.DS_Store
85 changes: 61 additions & 24 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,93 @@
# FCREPO3 Analysis Helpers

## Introduction
Scripts to analyse and export metadata from an FCREPO3 instance.
Tools to analyse and export metadata from an FCREPO3 instance using Python scripts.

## Table of Contents

* [Setup](#setup)
* [Features](#features)
* [Usage](#usage)

## Setup
These tools are designed to be run with a Python environment. Ensure Python 3.6 or higher is installed on your system; you can check the version with `python3 --version`. You will need to set up a Python virtual environment and install the required packages; this can be done using these command within this 'scripts' directory:

These scripts require an FCREPO3 instance to be run over. In the event, these scripts are run on a separate system from where
the repository lives, modifications may be required to the `fedora-xacml-policies` directory located at `$FEDORA_HOME/data/fedora-xacml-policies`.
```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

The metadata export command requires [GNU Parallel](https://www.gnu.org/software/parallel/parallel.html) to be installed
for faster processing.
The scripts also require an FCREPO3 instance. If these tools are run on a system separate from where the repository is hosted, modifications might be necessary in the `fedora-xacml-policies` directory at `$FEDORA_HOME/data/fedora-xacml-policies`.

## Features

### Metadata Analysis
A script to generate the following:
1. A total count of all objects in the repository.
2. A breakdown of objects by content models and their count in CSV form (`models.csv`).
3. A breakdown of unique datastream IDs and their count in CSV form (`dsids.csv`).
Script to run SPARQL queries against an FCREPO's RI and gather information. Current queries include:
- Content model distribution
- Total object count
- Count of active and deleted objects
- List of deleted objects
- Datastream distribution
- Owner distribution
- Collection distribution
- List of relationships
- List of orphaned objects
- MIME type distribution

Before running this as a script, you might want to verify that the queries provided in `queries.py` are compatible with the system you are querying. If the system has Mulgara instead of Blazegraph, it would be restricted to SPARQL 1.0; to check what features would not be available for SPARQL 1.0, you can see a list of new features added in SPARQL 1.1 at the bottom of https://www.w3.org/TR/sparql11-query/.

If you find that the system you are querying has some relationships outside of the ones covered in these queries by default, you will need to modify the relevant queries to get an accurate analysis. For example, sometimes there are relationships that have different capitalization or typos; these relationships need to be accounted for in this analysis phase for completeness and accuracy of the analysis as well as ensuring they get mapped appropriately in the actual migration.

### Metadata Export
A script to export all objects within the repository that contain a specified metadata datastream ID.
Script to export all objects (or a specified list of PIDs) within the repository that contain a specified metadata datastream ID, saving results as XML.

## Usage
### FOXML Export
Script to export FOXML archival objects from a Fedora repository given a list of PIDs.

### Datastream Updater
Script to inject a binary into an archival FOXML as base64 encoded data within a datastream.

## Usage
### Metadata Analysis
#### Command
```bash
sudo bash /path_to_the_module/scripts/metadata_analysis.sh --fedora_pass=the_password
python3 data_analysis.py --url=<http://your-fedora-url> --user=<admin> --password=<secret> --output_dir=<./results>
```

#### Output
Exports all queries found in `queries.py` to their own CSV in the `results` folder by default. Can be changed with the `--output_dir` flag.

### Metadata Export
#### Command
```bash
python3 datastream_export.py --url=<http://your-fedora-url:8080> --user=<admin> --password=<secret> --dsid=<DSID> --output_dir=<./output> --pid_file=<./some_pids>
```
The total number of objects is 40.
Outputted model breakdown to CSV (models.csv).
Outputted DSID breakdown to CSV (dsids.csv).
> The script supports adding comments in the pid_file using `#`. PIDs can also contain URL encoded characters (e.g., `%3A` for `:` which will be automatically decoded). Expected format of the `pid_file` is one PID per line.
If `--pid_file` isn't specified, the script will do a query intended to get a list of all pids in the system and export all of them.

#### Output
Exports all metadata entries related to the specified DSID into XML files stored in the defined output directory.
Each file's name will be in the format `pid-DSID.xml`.

### FOXML Export
#### Command
```bash
python3 foxml_export.py --url=<http://your-fedora-url:8080> --user=<admin> --pasword=<secret> --pid_file=<./some_pids_to_export> --output_dir=<./output>
```
> The script supports adding comments in the pid_file using `#`. PIDs can also contain URL encoded characters (e.g., `%3A` for `:` which will be automatically decoded). Expected format of the `pid_file` is one PID per line.
### Metadata Export
#### Output
Exports all archival FOXML found in the associated PID file passed in through arguments to their own folder in `output_dir/FOXML`.

### Datastream Updater
#### Command
```bash
sudo bash shell_scripts/export_metadata.sh --fedora_pass=the_password --skip_auth_check
python3 datastream_updater.py --xml=<input.xml> --dsid=<DSID> --content=<content.bin> --label=<'New Version'> --output=<output.xml>
```
> This script allows you to specify the XML file to modify, the datastream ID, the binary content file (which will be base64 encoded), and optionally a label for the new datastream version.
> Utilizing the `--skip_auth_check` flag here is an important performance optimization as it will greatly speed up the
export operation due to not needing to validate the request prior.
The only non-required argument is `label` which is in the case if you want to specify a custom label. If previous datastream versions do not have a label and you didn't specify one in the args, it will prompt you for a new one.

#### Output
The command does not output anything but will export all objects in the form of `the:pid-DSID.xml`.
Updates the specified XML file with a new version of the datastream, encoding the provided binary content into base64. The updated XML is saved to the specified output file.

## Known Issues:
* `datastream_updater.py` is very finnicky and will probably fail on most FOXML objects.
* The eventual intention with this script is to update it using `xmltodict`, and simplify it even more. Most of its current issues derive from XML namespaces.
55 changes: 55 additions & 0 deletions scripts/data_analysis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import argparse
import os
from utils import perform_http_request
from queries import queries


def parse_args():
parser = argparse.ArgumentParser(
description="Process SPARQL queries and save results."
)
parser.add_argument("--url", type=str, help="Fedora server URL", required=True)
parser.add_argument("--user", type=str, help="Fedora username", required=True)
parser.add_argument("--password", type=str, help="Fedora password", required=True)
parser.add_argument(
"--output_dir",
type=str,
default="./results",
help="Directory to save CSV files",
)
return parser.parse_args()


def save_to_csv(data, filename, output_dir):
"""
Save the given data to a CSV file.
Args:
data (str): The data to be written to the CSV file.
filename (str): The name of the CSV file.
output_dir (str): The directory where the CSV file will be saved.
Returns:
None
"""
os.makedirs(output_dir, exist_ok=True)
with open(os.path.join(output_dir, filename), "w", newline="") as file:
file.write(data)


def main():
args = parse_args()

for query_name, query in queries.items():
print(f"Processing query '{query_name}'...")
result = perform_http_request(query, args.url, args.user, args.password)
if result:
csv_filename = f"{query_name}.csv"
print(f"Saving results to {csv_filename}...\n")
save_to_csv(result, csv_filename, args.output_dir)
else:
print(f"Failed to retrieve data for query '{query_name}'.\n")


if __name__ == "__main__":
main()
113 changes: 113 additions & 0 deletions scripts/datastream_export.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
import argparse
import requests
from tqdm import tqdm
import concurrent.futures
import os
import mimetypes
from utils import perform_http_request, process_pid_file


def parse_args():
parser = argparse.ArgumentParser(
description="Export metadata using SPARQL query and save as XML."
)
parser.add_argument("--url", required=True, help="Fedora base URL")
parser.add_argument("--user", required=True, help="Username for Fedora access")
parser.add_argument("--password", required=True, help="Password for Fedora access")
parser.add_argument("--dsid", required=True, help="Datastream ID for querying")
parser.add_argument(
"--output_dir", default="./output", help="Directory to save XML files"
)
parser.add_argument(
"--pid_file", type=str, help="File containing PIDs to process", required=False
)
return parser.parse_args()


def fetch_data(dsid, base_url, user, password, output_dir, pid):
"""
Fetches the datastream content for a given datastream ID (dsid) and PID from a Fedora repository.
Args:
dsid (str): The ID of the datastream to fetch.
base_url (str): The base URL of the Fedora repository.
user (str): The username for authentication.
password (str): The password for authentication.
output_dir (str): The directory where the fetched data will be saved.
pid (str): The PID of the object that contains the datastream.
Returns:
bool: True if the datastream content was successfully fetched and saved, False otherwise.
"""
pid = pid.replace("info:fedora/", "")
url = f"{base_url}/fedora/objects/{pid}/datastreams/{dsid}/content"
print(f"Downloading {dsid} for PID: {pid}")
try:
response = requests.get(url, auth=(user, password))
response.raise_for_status()
dsid_dir = os.path.join(output_dir, dsid)
os.makedirs(dsid_dir, exist_ok=True)
content_type = response.headers.get("Content-Type", "")
extension = mimetypes.guess_extension(content_type) if content_type else ""
filename = f"{pid}-{dsid}{extension}"
with open(os.path.join(dsid_dir, filename), "wb") as f:
f.write(response.content)
print(f"Successfully saved {filename}\n")
return True
except Exception as e:
print(f"Failed to fetch data for {pid}, error: {str(e)}\n")
return False


def main():
args = parse_args()
os.makedirs(args.output_dir, exist_ok=True)

pids = []

# If a PID file is provided, process the file to get the list of PIDs.
if args.pid_file:
pids = process_pid_file(args.pid_file)
else:
query = f"""
SELECT ?obj WHERE {{
?obj <fedora-model:hasModel> <info:fedora/fedora-system:FedoraObject-3.0>;
<fedora-model:hasModel> ?model;
<fedora-view:disseminates> ?ds.
?ds <fedora-view:disseminationType> <info:fedora/*/{args.dsid}>
FILTER(!sameTerm(?model, <info:fedora/fedora-system:FedoraObject-3.0>))
FILTER(!sameTerm(?model, <info:fedora/fedora-system:ContentModel-3.0>))
}}
"""

result = perform_http_request(query, args.url, args.user, args.password)
pids.extend(result.strip().split("\n")[1:])

# Download metadata for each PID in parallel using ThreadPoolExecutor.
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor, tqdm(
total=len(pids), desc="Downloading Metadata"
) as progress:
futures = {
executor.submit(
fetch_data,
args.dsid,
args.url,
args.user,
args.password,
args.output_dir,
pid,
): pid
for pid in pids
}
for future in concurrent.futures.as_completed(futures):
pid = futures[future]
try:
success = future.result()
if success:
progress.update(1)
except Exception as exc:
print(f"{pid} generated an exception: {exc}")


if __name__ == "__main__":
main()
Loading

0 comments on commit 49b4be5

Please sign in to comment.