Skip to content

Commit

Permalink
Update readme.
Browse files Browse the repository at this point in the history
Remove unused files.
  • Loading branch information
recalcitrantsupplant committed Jun 24, 2024
1 parent 73c91e6 commit b2f40e5
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 141 deletions.
14 changes: 9 additions & 5 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
This docker image is used to generate TDB2 datasets used by Fuseki.
## Overview
This repository contains a Dockerfile to build an image which generates TDB2 datasets used by Fuseki.
It includes:

1. Validation of RDF files using Apache Jena RIOT
Files that fail validation are renamed with the suffix `.error` - this prevents tdbloader attempting to load them.
2. Creation of TDB2 datasets using `tdb2.tdbloader` or `tdb2.xloader` (for large datasets)
3. Creation of a Spatial Index for use with Apache Jena GeoSPARQL
4. Addition of Feature counts (via. a tdb2.update SPARQL update) - this is specific to OGC conformant datasets which contain geo:Features, and will be made optional in future versions (though the command will run harmlessly otherwise!).

An additional set of instructions is also provided for running this Dockerfile on an EC2 instance - note this has only been necessary for very large datasets.

Example command to build this image:
`docker build -t tdb-generation .`
`docker build -t tdb-generation:<tag> .`

Example command to run this image locally.
```
docker run -v $(pwd)/output:/databases -v $(pwd)/data:/rdf tdb2-generation
docker run -v $(pwd)/output:/databases -v $(pwd)/data:/rdf tdb2-generation:<tag>
```

Where:
- `$(pwd)/output` is the directory where the TDB2 databases will be created
- `$(pwd)/data` is the directory containing the RDF files to be loaded
- `$(pwd)/data` is the directory containing the RDF files to be loaded

## Text indexing

Not currently supported - can be supported by adding functionality to optionally include a mounted config.ttl file. This is required as the text index is not configurable via the command line.
27 changes: 0 additions & 27 deletions bitbucket-pipelines.yml

This file was deleted.

102 changes: 0 additions & 102 deletions project_configs/idn-config.ttl

This file was deleted.

8 changes: 4 additions & 4 deletions running_ec2_to_generate_data.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@
0.5 ssh in and run `sudo apt update`
1. run `sudo apt install awscli -y`
2. run `aws configure` -> enter creds for AWS account
3. run `aws ecr get-login-password --region ap-southeast-2 | docker login --username AWS --password-stdin 049648851863.dkr.ecr.ap-southeast-2.amazonaws.com`
4. run `docker pull 049648851863.dkr.ecr.ap-southeast-2.amazonaws.com/tdb-generation:0.1.5`
3. run `aws ecr get-login-password --region ap-southeast-2 | docker login --username AWS --password-stdin <ecr repo>`
4. run `docker pull <ecr repo>/tdb-generation:<tag>`
5. if using an nvme disk, mount it, see: https://stackoverflow.com/questions/45167717/mounting-a-nvme-disk-on-aws-ec2
5.1 `lsblk`
5.2 `file -s /dev/nvme0n1`
5.3 `mkfs -t xfs /dev/nvme0n1`
5.4 `mkdir /data`
5.5 `mount /dev/nvme1n1 /data`
6. optional: run aws s3 sync manually to get RDF on to the EC2 instance - if you don't do this and you make a mistake in the container command, you will need to re-download all of the s3 content (RDF) that you are trying to process. If using an NVME, suggest downloading the data here and mounting it to the container.
e.g. `aws s3 sync s3://digital-atlas-rdf /data`
7. run the processing, e.g.: `docker run -v /mnt/efs/fs1/:/newdb --mount type=bind,source=/data,target=/rdf -e AWS_ACCESS_KEY_ID=<key> -e AWS_SECRET_ACCESS_KEY=<secret> -e DATASET=fsdf -e SKIP_VALIDATION=true -e THREADS=47 -e USE_XLOADER=true 049648851863.dkr.ecr.ap-southeast-2.amazonaws.com/tdb-generation:0.1.5`
e.g. `aws s3 sync s3://<my-bucket> /data`
7. run the processing, e.g.: `docker run -v /mnt/efs/fs1/:/databases --mount type=bind,source=/data,target=/rdf -e DATASET=fsdf -e SKIP_VALIDATION=true -e THREADS=47 -e USE_XLOADER=true <ecr repo>/tdb-generation:<tag>`
3 changes: 0 additions & 3 deletions select_feature_counts.sparql

This file was deleted.

0 comments on commit b2f40e5

Please sign in to comment.