-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c7e2514
commit c1eb474
Showing
11 changed files
with
598 additions
and
213 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Python | ||
*.py[cod] | ||
__pycache__/ | ||
venv/ | ||
*.pyc | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints/ | ||
|
||
# IDEs | ||
.idea/ | ||
.vscode/ | ||
|
||
# Compiled files | ||
*.pyd | ||
*.pyo | ||
*.pyw | ||
*.pyz | ||
*.pyzw | ||
|
||
# Distribution / packaging | ||
dist/ | ||
build/ | ||
*.egg-info/ | ||
*.egg | ||
|
||
# Environment | ||
.env | ||
.env.* | ||
|
||
# Script Outputs | ||
results/* | ||
output/* | ||
*.xml | ||
|
||
# Etc | ||
.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,56 +1,62 @@ | ||
# FCREPO3 Analysis Helpers | ||
|
||
## Introduction | ||
Scripts to analyse and export metadata from an FCREPO3 instance. | ||
Tools to analyse and export metadata from an FCREPO3 instance using Python scripts. | ||
|
||
## Table of Contents | ||
|
||
* [Setup](#setup) | ||
* [Features](#features) | ||
* [Usage](#usage) | ||
|
||
## Setup | ||
These tools are designed to be run with a Python environment. Ensure Python 3.6 or higher is installed on your system. You will need to set up a Python virtual environment and install the required packages: | ||
|
||
These scripts require an FCREPO3 instance to be run over. In the event, these scripts are run on a separate system from where | ||
the repository lives, modifications may be required to the `fedora-xacml-policies` directory located at `$FEDORA_HOME/data/fedora-xacml-policies`. | ||
```bash | ||
python3 -m venv venv | ||
source venv/bin/activate | ||
pip install -r requirements.txt | ||
``` | ||
|
||
The metadata export command requires [GNU Parallel](https://www.gnu.org/software/parallel/parallel.html) to be installed | ||
for faster processing. | ||
The scripts also require an FCREPO3 instance. If these tools are run on a system separate from where the repository is hosted, modifications might be necessary in the `fedora-xacml-policies` directory at `$FEDORA_HOME/data/fedora-xacml-policies`. | ||
|
||
## Features | ||
|
||
### Metadata Analysis | ||
A script to generate the following: | ||
1. A total count of all objects in the repository. | ||
2. A breakdown of objects by content models and their count in CSV form (`models.csv`). | ||
3. A breakdown of unique datastream IDs and their count in CSV form (`dsids.csv`). | ||
Python scripts that perform the following: | ||
1. Count all objects in the repository. | ||
2. Provide a breakdown of objects by content models (`models.csv`). | ||
3. Output a breakdown of unique datastream IDs (`dsids.csv`). | ||
|
||
### Metadata Export | ||
A script to export all objects within the repository that contain a specified metadata datastream ID. | ||
Scripts to export all objects within the repository that contain a specified metadata datastream ID, saving results as XML. | ||
|
||
## Usage | ||
|
||
### Metadata Analysis | ||
#### Command | ||
```bash | ||
sudo bash /path_to_the_module/scripts/metadata_analysis.sh --fedora_pass=the_password | ||
python3 data_analysis.py --url=http://your-fedora-url --user=admin --password=secret --output_dir=./results | ||
``` | ||
|
||
#### Output | ||
``` | ||
The total number of objects is 40. | ||
Outputted model breakdown to CSV (models.csv). | ||
Outputted DSID breakdown to CSV (dsids.csv). | ||
``` | ||
Exports all queries found in `queries.py` to their own CSV in the `results` folder by default. Can be changed with the `--output_dir` flag. | ||
|
||
### Metadata Export | ||
#### Command | ||
```bash | ||
sudo bash shell_scripts/export_metadata.sh --fedora_pass=the_password --skip_auth_check | ||
python3 datastream_export.py --url=http://your-fedora-url:8080 --user=admin --password=secret --dsid=DSID --output_dir=./output --pid_file=./some_pids | ||
``` | ||
> The script supports adding comments in the pid_file using `#`. PIDs can also contain URL encoded characters (e.g., `%3A` for `:` which will be automatically decoded). | ||
#### Output | ||
Exports all metadata entries related to the specified DSID into XML files stored in the defined output directory. | ||
Each file's name will be in the format `pid-DSID.xml`. | ||
|
||
### Datastream Updater | ||
#### Command | ||
```bash | ||
python3 datastream_updater.py --xml=input.xml --dsid=DSID --content=content.bin --label='New Version' --output=output.xml | ||
``` | ||
> This script allows you to specify the XML file to modify, the datastream ID, the binary content file (which will be base64 encoded), and optionally a label for the new datastream version. | ||
> Utilizing the `--skip_auth_check` flag here is an important performance optimization as it will greatly speed up the | ||
export operation due to not needing to validate the request prior. | ||
The only non-required argument is `label` which is in the case if you want to specify a custom label. If previous datastream versions do not have a label and you didn't specify one in the args, it will prompt you for a new one. | ||
|
||
#### Output | ||
The command does not output anything but will export all objects in the form of `the:pid-DSID.xml`. | ||
Updates the specified XML file with a new version of the datastream, encoding the provided binary content into base64. The updated XML is saved to the specified output file. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
import argparse | ||
import os | ||
from utils import perform_http_request | ||
from queries import queries | ||
|
||
|
||
def parse_args(): | ||
parser = argparse.ArgumentParser( | ||
description="Process SPARQL queries and save results." | ||
) | ||
parser.add_argument("--url", type=str, help="Fedora server URL", required=True) | ||
parser.add_argument("--user", type=str, help="Fedora username", required=True) | ||
parser.add_argument("--password", type=str, help="Fedora password", required=True) | ||
parser.add_argument( | ||
"--output_dir", | ||
type=str, | ||
default="./results", | ||
help="Directory to save CSV files", | ||
) | ||
return parser.parse_args() | ||
|
||
|
||
def save_to_csv(data, filename, output_dir): | ||
""" | ||
Save the given data to a CSV file. | ||
Args: | ||
data (str): The data to be written to the CSV file. | ||
filename (str): The name of the CSV file. | ||
output_dir (str): The directory where the CSV file will be saved. | ||
Returns: | ||
None | ||
""" | ||
os.makedirs(output_dir, exist_ok=True) | ||
with open(os.path.join(output_dir, filename), "w", newline="") as file: | ||
file.write(data) | ||
|
||
|
||
def main(): | ||
args = parse_args() | ||
|
||
for query_name, query in queries.items(): | ||
print(f"Processing query '{query_name}'...") | ||
result = perform_http_request(query, args.url, args.user, args.password) | ||
if result: | ||
csv_filename = f"{query_name}.csv" | ||
print(f"Saving results to {csv_filename}...\n") | ||
save_to_csv(result, csv_filename, args.output_dir) | ||
else: | ||
print(f"Failed to retrieve data for query '{query_name}'.\n") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
import argparse | ||
import requests | ||
from tqdm import tqdm | ||
import concurrent.futures | ||
import os | ||
import mimetypes | ||
from utils import perform_http_request | ||
|
||
|
||
def parse_args(): | ||
parser = argparse.ArgumentParser( | ||
description="Export metadata using SPARQL query and save as XML." | ||
) | ||
parser.add_argument("--url", required=True, help="Fedora base URL") | ||
parser.add_argument("--user", required=True, help="Username for Fedora access") | ||
parser.add_argument("--password", required=True, help="Password for Fedora access") | ||
parser.add_argument("--dsid", required=True, help="Datastream ID for querying") | ||
parser.add_argument( | ||
"--output_dir", default="./output", help="Directory to save XML files" | ||
) | ||
parser.add_argument( | ||
"--pid_file", type=str, help="File containing PIDs to process", required=False | ||
) | ||
return parser.parse_args() | ||
|
||
|
||
def fetch_data(dsid, base_url, user, password, output_dir, obj_id): | ||
""" | ||
Fetches the datastream content for a given datastream ID (dsid) and object ID (obj_id) from a Fedora repository. | ||
Args: | ||
dsid (str): The ID of the datastream to fetch. | ||
base_url (str): The base URL of the Fedora repository. | ||
user (str): The username for authentication. | ||
password (str): The password for authentication. | ||
output_dir (str): The directory where the fetched data will be saved. | ||
obj_id (str): The ID of the object that contains the datastream. | ||
Returns: | ||
bool: True if the datastream content was successfully fetched and saved, False otherwise. | ||
""" | ||
obj_id = obj_id.replace("info:fedora/", "") | ||
url = f"{base_url}/fedora/objects/{obj_id}/datastreams/{dsid}/content" | ||
print(f"Downloading {dsid} for PID: {obj_id}") | ||
try: | ||
response = requests.get(url, auth=(user, password)) | ||
response.raise_for_status() | ||
dsid_dir = os.path.join(output_dir, dsid) | ||
os.makedirs(dsid_dir, exist_ok=True) | ||
content_type = response.headers.get("Content-Type", "") | ||
extension = mimetypes.guess_extension(content_type) if content_type else "" | ||
filename = f"{obj_id}-{dsid}{extension}" | ||
with open(os.path.join(dsid_dir, filename), "wb") as f: | ||
f.write(response.content) | ||
print(f"Successfully saved {filename}") | ||
return True | ||
except Exception as e: | ||
print(f"Failed to fetch data for {obj_id}, error: {str(e)}") | ||
return False | ||
|
||
|
||
def process_pid_file(filepath): | ||
""" | ||
Process a file containing PIDs (Persistent Identifiers) and return a list of PIDs. | ||
Supports comments in the file using '#' character. | ||
Replace '%3A' with ':' in PIDs. | ||
Args: | ||
filepath (str): The path to the file containing PIDs. | ||
Returns: | ||
list: A list of PIDs extracted from the file. | ||
""" | ||
pids = [] | ||
with open(filepath, "r") as file: | ||
for line in file: | ||
line = line.strip() | ||
if "#" in line: | ||
line = line[: line.index("#")].strip() | ||
if line: | ||
line = line.replace("%3A", ":") | ||
pids.append(line) | ||
return pids | ||
|
||
|
||
def main(): | ||
args = parse_args() | ||
os.makedirs(args.output_dir, exist_ok=True) | ||
|
||
object_ids = [] | ||
|
||
# If a PID file is provided, process the file to get the list of PIDs. | ||
if args.pid_file: | ||
object_ids = process_pid_file(args.pid_file) | ||
else: | ||
query = f""" | ||
SELECT ?obj WHERE {{ | ||
?obj <fedora-model:hasModel> <info:fedora/fedora-system:FedoraObject-3.0>; | ||
<fedora-model:hasModel> ?model; | ||
<fedora-view:disseminates> ?ds. | ||
?ds <fedora-view:disseminationType> <info:fedora/*/{args.dsid}> | ||
FILTER(!sameTerm(?model, <info:fedora/fedora-system:FedoraObject-3.0>)) | ||
FILTER(!sameTerm(?model, <info:fedora/fedora-system:ContentModel-3.0>)) | ||
}} | ||
""" | ||
|
||
result = perform_http_request(query, args.url, args.user, args.password) | ||
object_ids.extend(result.strip().split("\n")[1:]) | ||
|
||
# Download metadata for each PID in parallel using ThreadPoolExecutor. | ||
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor, tqdm( | ||
total=len(object_ids), desc="Downloading Metadata" | ||
) as progress: | ||
futures = { | ||
executor.submit( | ||
fetch_data, | ||
args.dsid, | ||
args.url, | ||
args.user, | ||
args.password, | ||
args.output_dir, | ||
obj_id, | ||
): obj_id | ||
for obj_id in object_ids | ||
} | ||
for future in concurrent.futures.as_completed(futures): | ||
obj_id = futures[future] | ||
try: | ||
success = future.result() | ||
if success: | ||
progress.update(1) | ||
except Exception as exc: | ||
print(f"{obj_id} generated an exception: {exc}") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.