Skip to content

Commit

Permalink
Update analysis tooling
Browse files Browse the repository at this point in the history
  • Loading branch information
chrismacdonaldw committed Apr 19, 2024
1 parent c7e2514 commit c1eb474
Show file tree
Hide file tree
Showing 11 changed files with 598 additions and 213 deletions.
37 changes: 37 additions & 0 deletions scripts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Python
*.py[cod]
__pycache__/
venv/
*.pyc

# Jupyter Notebook
.ipynb_checkpoints/

# IDEs
.idea/
.vscode/

# Compiled files
*.pyd
*.pyo
*.pyw
*.pyz
*.pyzw

# Distribution / packaging
dist/
build/
*.egg-info/
*.egg

# Environment
.env
.env.*

# Script Outputs
results/*
output/*
*.xml

# Etc
.DS_Store
56 changes: 31 additions & 25 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,62 @@
# FCREPO3 Analysis Helpers

## Introduction
Scripts to analyse and export metadata from an FCREPO3 instance.
Tools to analyse and export metadata from an FCREPO3 instance using Python scripts.

## Table of Contents

* [Setup](#setup)
* [Features](#features)
* [Usage](#usage)

## Setup
These tools are designed to be run with a Python environment. Ensure Python 3.6 or higher is installed on your system. You will need to set up a Python virtual environment and install the required packages:

These scripts require an FCREPO3 instance to be run over. In the event, these scripts are run on a separate system from where
the repository lives, modifications may be required to the `fedora-xacml-policies` directory located at `$FEDORA_HOME/data/fedora-xacml-policies`.
```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

The metadata export command requires [GNU Parallel](https://www.gnu.org/software/parallel/parallel.html) to be installed
for faster processing.
The scripts also require an FCREPO3 instance. If these tools are run on a system separate from where the repository is hosted, modifications might be necessary in the `fedora-xacml-policies` directory at `$FEDORA_HOME/data/fedora-xacml-policies`.

## Features

### Metadata Analysis
A script to generate the following:
1. A total count of all objects in the repository.
2. A breakdown of objects by content models and their count in CSV form (`models.csv`).
3. A breakdown of unique datastream IDs and their count in CSV form (`dsids.csv`).
Python scripts that perform the following:
1. Count all objects in the repository.
2. Provide a breakdown of objects by content models (`models.csv`).
3. Output a breakdown of unique datastream IDs (`dsids.csv`).

### Metadata Export
A script to export all objects within the repository that contain a specified metadata datastream ID.
Scripts to export all objects within the repository that contain a specified metadata datastream ID, saving results as XML.

## Usage

### Metadata Analysis
#### Command
```bash
sudo bash /path_to_the_module/scripts/metadata_analysis.sh --fedora_pass=the_password
python3 data_analysis.py --url=http://your-fedora-url --user=admin --password=secret --output_dir=./results
```

#### Output
```
The total number of objects is 40.
Outputted model breakdown to CSV (models.csv).
Outputted DSID breakdown to CSV (dsids.csv).
```
Exports all queries found in `queries.py` to their own CSV in the `results` folder by default. Can be changed with the `--output_dir` flag.

### Metadata Export
#### Command
```bash
sudo bash shell_scripts/export_metadata.sh --fedora_pass=the_password --skip_auth_check
python3 datastream_export.py --url=http://your-fedora-url:8080 --user=admin --password=secret --dsid=DSID --output_dir=./output --pid_file=./some_pids
```
> The script supports adding comments in the pid_file using `#`. PIDs can also contain URL encoded characters (e.g., `%3A` for `:` which will be automatically decoded).
#### Output
Exports all metadata entries related to the specified DSID into XML files stored in the defined output directory.
Each file's name will be in the format `pid-DSID.xml`.

### Datastream Updater
#### Command
```bash
python3 datastream_updater.py --xml=input.xml --dsid=DSID --content=content.bin --label='New Version' --output=output.xml
```
> This script allows you to specify the XML file to modify, the datastream ID, the binary content file (which will be base64 encoded), and optionally a label for the new datastream version.
> Utilizing the `--skip_auth_check` flag here is an important performance optimization as it will greatly speed up the
export operation due to not needing to validate the request prior.
The only non-required argument is `label` which is in the case if you want to specify a custom label. If previous datastream versions do not have a label and you didn't specify one in the args, it will prompt you for a new one.

#### Output
The command does not output anything but will export all objects in the form of `the:pid-DSID.xml`.
Updates the specified XML file with a new version of the datastream, encoding the provided binary content into base64. The updated XML is saved to the specified output file.

55 changes: 55 additions & 0 deletions scripts/data_analysis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import argparse
import os
from utils import perform_http_request
from queries import queries


def parse_args():
parser = argparse.ArgumentParser(
description="Process SPARQL queries and save results."
)
parser.add_argument("--url", type=str, help="Fedora server URL", required=True)
parser.add_argument("--user", type=str, help="Fedora username", required=True)
parser.add_argument("--password", type=str, help="Fedora password", required=True)
parser.add_argument(
"--output_dir",
type=str,
default="./results",
help="Directory to save CSV files",
)
return parser.parse_args()


def save_to_csv(data, filename, output_dir):
"""
Save the given data to a CSV file.
Args:
data (str): The data to be written to the CSV file.
filename (str): The name of the CSV file.
output_dir (str): The directory where the CSV file will be saved.
Returns:
None
"""
os.makedirs(output_dir, exist_ok=True)
with open(os.path.join(output_dir, filename), "w", newline="") as file:
file.write(data)


def main():
args = parse_args()

for query_name, query in queries.items():
print(f"Processing query '{query_name}'...")
result = perform_http_request(query, args.url, args.user, args.password)
if result:
csv_filename = f"{query_name}.csv"
print(f"Saving results to {csv_filename}...\n")
save_to_csv(result, csv_filename, args.output_dir)
else:
print(f"Failed to retrieve data for query '{query_name}'.\n")


if __name__ == "__main__":
main()
137 changes: 137 additions & 0 deletions scripts/datastream_export.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
import argparse
import requests
from tqdm import tqdm
import concurrent.futures
import os
import mimetypes
from utils import perform_http_request


def parse_args():
parser = argparse.ArgumentParser(
description="Export metadata using SPARQL query and save as XML."
)
parser.add_argument("--url", required=True, help="Fedora base URL")
parser.add_argument("--user", required=True, help="Username for Fedora access")
parser.add_argument("--password", required=True, help="Password for Fedora access")
parser.add_argument("--dsid", required=True, help="Datastream ID for querying")
parser.add_argument(
"--output_dir", default="./output", help="Directory to save XML files"
)
parser.add_argument(
"--pid_file", type=str, help="File containing PIDs to process", required=False
)
return parser.parse_args()


def fetch_data(dsid, base_url, user, password, output_dir, obj_id):
"""
Fetches the datastream content for a given datastream ID (dsid) and object ID (obj_id) from a Fedora repository.
Args:
dsid (str): The ID of the datastream to fetch.
base_url (str): The base URL of the Fedora repository.
user (str): The username for authentication.
password (str): The password for authentication.
output_dir (str): The directory where the fetched data will be saved.
obj_id (str): The ID of the object that contains the datastream.
Returns:
bool: True if the datastream content was successfully fetched and saved, False otherwise.
"""
obj_id = obj_id.replace("info:fedora/", "")
url = f"{base_url}/fedora/objects/{obj_id}/datastreams/{dsid}/content"
print(f"Downloading {dsid} for PID: {obj_id}")
try:
response = requests.get(url, auth=(user, password))
response.raise_for_status()
dsid_dir = os.path.join(output_dir, dsid)
os.makedirs(dsid_dir, exist_ok=True)
content_type = response.headers.get("Content-Type", "")
extension = mimetypes.guess_extension(content_type) if content_type else ""
filename = f"{obj_id}-{dsid}{extension}"
with open(os.path.join(dsid_dir, filename), "wb") as f:
f.write(response.content)
print(f"Successfully saved {filename}")
return True
except Exception as e:
print(f"Failed to fetch data for {obj_id}, error: {str(e)}")
return False


def process_pid_file(filepath):
"""
Process a file containing PIDs (Persistent Identifiers) and return a list of PIDs.
Supports comments in the file using '#' character.
Replace '%3A' with ':' in PIDs.
Args:
filepath (str): The path to the file containing PIDs.
Returns:
list: A list of PIDs extracted from the file.
"""
pids = []
with open(filepath, "r") as file:
for line in file:
line = line.strip()
if "#" in line:
line = line[: line.index("#")].strip()
if line:
line = line.replace("%3A", ":")
pids.append(line)
return pids


def main():
args = parse_args()
os.makedirs(args.output_dir, exist_ok=True)

object_ids = []

# If a PID file is provided, process the file to get the list of PIDs.
if args.pid_file:
object_ids = process_pid_file(args.pid_file)
else:
query = f"""
SELECT ?obj WHERE {{
?obj <fedora-model:hasModel> <info:fedora/fedora-system:FedoraObject-3.0>;
<fedora-model:hasModel> ?model;
<fedora-view:disseminates> ?ds.
?ds <fedora-view:disseminationType> <info:fedora/*/{args.dsid}>
FILTER(!sameTerm(?model, <info:fedora/fedora-system:FedoraObject-3.0>))
FILTER(!sameTerm(?model, <info:fedora/fedora-system:ContentModel-3.0>))
}}
"""

result = perform_http_request(query, args.url, args.user, args.password)
object_ids.extend(result.strip().split("\n")[1:])

# Download metadata for each PID in parallel using ThreadPoolExecutor.
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor, tqdm(
total=len(object_ids), desc="Downloading Metadata"
) as progress:
futures = {
executor.submit(
fetch_data,
args.dsid,
args.url,
args.user,
args.password,
args.output_dir,
obj_id,
): obj_id
for obj_id in object_ids
}
for future in concurrent.futures.as_completed(futures):
obj_id = futures[future]
try:
success = future.result()
if success:
progress.update(1)
except Exception as exc:
print(f"{obj_id} generated an exception: {exc}")


if __name__ == "__main__":
main()
Loading

0 comments on commit c1eb474

Please sign in to comment.