Update readme.

Remove unused files.
Kurrawong · Jun 24, 2024 · b2f40e5 · b2f40e5
1 parent 73c91e6
commit b2f40e5
Show file tree

Hide file tree

Showing 5 changed files with 13 additions and 141 deletions.
diff --git a/Readme.md b/Readme.md
@@ -1,22 +1,26 @@
-This docker image is used to generate TDB2 datasets used by Fuseki.
+## Overview
+This repository contains a Dockerfile to build an image which generates TDB2 datasets used by Fuseki.
 It includes:
 
 1. Validation of RDF files using Apache Jena RIOT
    Files that fail validation are renamed with the suffix `.error` - this prevents tdbloader attempting to load them.
 2. Creation of TDB2 datasets using `tdb2.tdbloader` or `tdb2.xloader` (for large datasets)
 3. Creation of a Spatial Index for use with Apache Jena GeoSPARQL
-4. Addition of Feature counts (via. a tdb2.update SPARQL update) - this is specific to OGC conformant datasets which contain geo:Features, and will be made optional in future versions (though the command will run harmlessly otherwise!).
 
 An additional set of instructions is also provided for running this Dockerfile on an EC2 instance - note this has only been necessary for very large datasets.
 
 Example command to build this image:
-`docker build -t tdb-generation .`
+`docker build -t tdb-generation:<tag> .`
 
 Example command to run this image locally.
 ```
-docker run -v $(pwd)/output:/databases -v $(pwd)/data:/rdf tdb2-generation
+docker run -v $(pwd)/output:/databases -v $(pwd)/data:/rdf tdb2-generation:<tag>
 ```
 
 Where:
 - `$(pwd)/output` is the directory where the TDB2 databases will be created
-- `$(pwd)/data` is the directory containing the RDF files to be loaded
+- `$(pwd)/data` is the directory containing the RDF files to be loaded
+
+## Text indexing
+
+Not currently supported - can be supported by adding functionality to optionally include a mounted config.ttl file. This is required as the text index is not configurable via the command line.
diff --git a/bitbucket-pipelines.yml b/bitbucket-pipelines.yml
diff --git a/project_configs/idn-config.ttl b/project_configs/idn-config.ttl
diff --git a/running_ec2_to_generate_data.txt b/running_ec2_to_generate_data.txt
@@ -5,14 +5,14 @@
 0.5 ssh in and run `sudo apt update`
 1. run `sudo apt install awscli -y`
 2. run `aws configure` -> enter creds for AWS account
-3. run `aws ecr get-login-password --region ap-southeast-2 | docker login --username AWS --password-stdin 049648851863.dkr.ecr.ap-southeast-2.amazonaws.com`
-4. run `docker pull 049648851863.dkr.ecr.ap-southeast-2.amazonaws.com/tdb-generation:0.1.5`
+3. run `aws ecr get-login-password --region ap-southeast-2 | docker login --username AWS --password-stdin <ecr repo>`
+4. run `docker pull <ecr repo>/tdb-generation:<tag>`
 5. if using an nvme disk, mount it, see: https://stackoverflow.com/questions/45167717/mounting-a-nvme-disk-on-aws-ec2
     5.1 `lsblk`
     5.2 `file -s /dev/nvme0n1`
     5.3 `mkfs -t xfs /dev/nvme0n1`
     5.4 `mkdir /data`
     5.5 `mount /dev/nvme1n1 /data`
 6. optional: run aws s3 sync manually to get RDF on to the EC2 instance - if you don't do this and you make a mistake in the container command, you will need to re-download all of the s3 content (RDF) that you are trying to process. If using an NVME, suggest downloading the data here and mounting it to the container.
-    e.g. `aws s3 sync s3://digital-atlas-rdf /data`
-7. run the processing, e.g.: `docker run -v /mnt/efs/fs1/:/newdb --mount type=bind,source=/data,target=/rdf -e AWS_ACCESS_KEY_ID=<key> -e AWS_SECRET_ACCESS_KEY=<secret> -e DATASET=fsdf -e SKIP_VALIDATION=true -e THREADS=47 -e USE_XLOADER=true 049648851863.dkr.ecr.ap-southeast-2.amazonaws.com/tdb-generation:0.1.5`
+    e.g. `aws s3 sync s3://<my-bucket> /data`
+7. run the processing, e.g.: `docker run -v /mnt/efs/fs1/:/databases --mount type=bind,source=/data,target=/rdf -e DATASET=fsdf -e SKIP_VALIDATION=true -e THREADS=47 -e USE_XLOADER=true <ecr repo>/tdb-generation:<tag>`
diff --git a/select_feature_counts.sparql b/select_feature_counts.sparql