Skip to content

Commit

Permalink
feat: Action only runs if changes were made to the input files
Browse files Browse the repository at this point in the history
  • Loading branch information
elytvyno committed Dec 17, 2021
1 parent 62453a3 commit 086538f
Show file tree
Hide file tree
Showing 5 changed files with 304 additions and 29 deletions.
48 changes: 32 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# rml-action

`rml-action` is a GitHub Action that converts a structured data source file (e.g. JSON, XML, CSV...) to RDF
[Resource Description Framework (RDF)](https://www.w3.org/RDF/). Multiple serialization formats are supported: `nquads` (default), `turtle`, `trig`, `trix`, `jsonld`, `hdt`.
`rml-action` is a GitHub Action that converts a structured data source file (e.g. JSON, XML, CSV...) to [Resource Description Framework (RDF)](https://www.w3.org/RDF/) rules.
Multiple serialization formats are supported: `nquads` (default), `turtle`, `trig`, `trix`, `jsonld`, `hdt`.

## Usage

Create a `.github/workflows/data.yaml` file in the repository where you want to fetch data. An example:
Create a `.github/workflows/data.yaml` file in the repository where you want to fetch and convert data. An example:

```yaml
name: Convert to RDF Workflow
Expand All @@ -26,45 +26,48 @@ jobs:
GLOBAL_PATTERN: "*.yml"
SERIALIZATION_FORMAT: turtle
OUTPUT_DIRECTORY: output
CONVERT_ALL: true
steps:
# Checks-out your repository
- uses: actions/checkout@v2

- name: Creates an output directory for RDF files (if doesn't exist)
run: mkdir -p output
shell: bash

- name: Converts YARRRML rules to RDF
uses: RMLio/rml-action@main
uses: RMLio/rml-action@v1.0.0
with:
# the global pattern for all YARRRML mappings
global-pattern: ${{ env.GLOBAL_PATTERN }}
# serialization format is optional; default - "nquads"
serialization-format: ${{ env.SERIALIZATION_FORMAT }}
# the name of the directory where all the output files will be stored
output-directory: ${{ env.OUTPUT_DIRECTORY }}
# convert-all is optional; default - "false"
# if convert-all is "true", the action will always convert all the files to
# RDF based on the yarrrml-files provided by `GLOBAL_PATTERN`, even if no
# changes were detected
convert-all: ${{ env.CONVERT_ALL }}

# Push the generated RDF files to the repository
- name: Commit and push the output
run: |
git config --global user.name 'your_username'
git config --global user.email '[email protected]'
git add .
set +e
git status | grep "nothing to commit, working tree clean"
if [ $? -eq 0 ]; then set -e; echo "No changes since last run"; else set -e; \
if [ $? -eq 0 ]; then set -e; echo "INFO: No changes since last run"; else set -e; \
git commit -m "feat: convert to RDF with Github Actions"; git push; fi
shell: bash
```

If you are using the example that was provided above:
If you are using the example workflow that was provided above, make sure to update it as follows:

- Make sure to check whether the conditions to trigger the action are set properly (change the name of the branch(-es) if needed etc.).
- Configure the input parameters for the action (`GLOBAL_PATTERN`, `SERIALIZATION_FORMAT` and `OUTPUT_DIRECTORY`).
- Verify whether the conditions to trigger the action are set properly (change the name of the branch(-es) if needed etc.).
- Configure the environment variables for the input parameters for the action under `jobs` > `build` > `env` (`GLOBAL_PATTERN`, `SERIALIZATION_FORMAT`, `OUTPUT_DIRECTORY` and `CONVERT_ALL`).
- In the "Commit and push the output" step, replace `user.name` and `user.email` from the example with your github username and email. You may also want to change the commit message that will be used to commit the files created by the action.

The `RMLio/rml-action` action will perform the following operations:

1. iterate over all files matching the provided global pattern (which are all expected to contain `YARRRML` rules)
1. iterate over all files matching the provided global pattern (which are all expected to contain `YARRRML` rules and have an extension `.yaml` or `.yml`)
2. convert `YARRRML` rules in all these files to RDF

**Note:** you need to follow the guidelines of the above workflow file example (step "Commit and push the output") to commit and push all of the generated data to your repository.
Expand All @@ -73,12 +76,25 @@ The `RMLio/rml-action` action will perform the following operations:

### `global-pattern`

The global pattern that matches all the mapping files that need to be converted.
The global pattern that matches all the mapping files that need to be converted (e.g. `"*.yml"`). The pattern has to be surrounded by quotes.

### `serialization-format` (optional)

The serialization format that needs to be used for convertion. Default: `nquads`.
The serialization format that needs to be used for conversion. Default: `nquads`. Possible values: `nquads`, `turtle`, `trig`, `trix`, `jsonld`, `hdt`.

### `output-directory`

The relative path from the root of your repository to a directory where the output files will be stored.
The relative path from the root of your repository to a directory where the output files will be stored (e.g. `output` (or `path_from_root/output_folder_name`), this will save all the output files to a folder named `output` (or `path_from_root/output_folder_name`) that can be found at the root of the repository).

### `convert-all` (optional)

An indicator as to whether or not the conversion should be run for all files. Default: `false`. Possible values: `true`, `false`.
If `convert-all` is set to `true`, all files will be converted, even if no changes were detected.
If the meta folder of the action (`rml_action_meta`) or some file in that folder is not present (e.g. it was deleted), again, all files will be converted, even if no changes were made to the input files.

## Important remarks

- Don't remove the meta folder for this action (`rml_action_meta`). This folder is created when the action runs for the first time and contains the information that is relevant for it. Removing this folder won't cause any errors - it will just be created again, but this will result in a performance loss, since all the files will be converted again.
- Changes to the output folder are not detected. This means that if you remove a part of or all of the files that were already generated and are stored in the output folder, they will not be generated again by default. In this case, you might want to set `convert-all` to `true` to convert all the files once again.
- If some files (yarrrml-files or data source files) have been added/removed or renamed, the action will run for all the files (all of them will be converted).
- If some files (yarrrml-files or data source files) have been modified, the action will only convert the modified files (if data source files were modified) or the files that are a part of yarrrml-files that were modified.
68 changes: 64 additions & 4 deletions action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,33 +12,93 @@ inputs:
output-directory:
description: "Output directory to store Linked Data"
required: true
convert-all:
description: "Convert all data to RDF (false (default), true)"
required: false
default: "false"
runs:
using: "composite"
steps:
- name: "Get all YARRRML filenames"
# create a directory with metadata if it does not exist
# create a directory for the output if it does not exist
# save all filenames (YARRRML) that will be used for conversion
run: |
mkdir -p rml_action_meta
mkdir -p ${{ inputs.output-directory }}
find . -name "${{ inputs.global-pattern }}" -type f | grep -v ".github/workflows" > \
rml_action_meta/filenames.txt
shell: bash

- name: "Setup Node"
uses: actions/setup-node@v2
with:
node-version: "14"

- name: "Get yarrrml-parser"
run: npm i -g @rmlio/yarrrml-parser
id: set-node-modules
run: |
NODE_MODULES_PATH=$(npm i -g @rmlio/yarrrml-parser | \
grep -o -m 1 ' -> .*node_modules' | sed 's/ -> //')
echo "::set-output name=NODE_MODULES_PATH::${NODE_MODULES_PATH}"
shell: bash

- name: "Map YARRRML mappings to data source files"
# export NODE_PATH to be able to use yarrrml-parser
# that was installed globally instead of installing it again
run: |
cd ${{ github.action_path }}
export NODE_PATH=${{ steps.set-node-modules.outputs.NODE_MODULES_PATH }}
chmod +x datasource_mapping.js
npm i n-readlines
npm i uuid
npm i yamljs
npm i path
cd ${{ github.workspace }}
node ${{ github.action_path }}/datasource_mapping.js
rm rml_action_meta/filenames.txt
shell: bash

- name: "Check filenames and contents"
# decide whether the action should be fully run or not
# (e.g. there's no need to convert the files if nothing
# was changed)
id: check-contents
run: |
chmod +x ${{ github.action_path }}/check_contents.sh
set +e
${{ github.action_path }}/check_contents.sh
[ $? == 1 ] && RUN_ACTION="true" || RUN_ACTION="false"
set -e
echo "::set-output name=RUN_ACTION::${RUN_ACTION}"
shell: bash
env:
WORKING_DIRECTORY: ${{ github.workspace }}
GLOBAL_PATTERN: ${{ inputs.global-pattern }}
CONVERT_ALL: ${{ inputs.convert-all }}
RUN_ACTION: "false"

- name: "Get rml-mapper"
run: curl -L https://github.com/RMLio/rmlmapper-java/releases/download/v4.12.0/rmlmapper.jar --output rmlmapper.jar
if: ${{ steps.check-contents.outputs.RUN_ACTION == 'true' }}
run: |
curl -L https://github.com/RMLio/rmlmapper-java/releases/download/v4.12.0/rmlmapper.jar \
--output rmlmapper.jar
shell: bash

- name: "Convert YARRRML rules to RDF for all files"
# Get all files that have to be converted and run the "converter.sh"-script on each file to convert YARRRML to RDF
if: ${{ steps.check-contents.outputs.RUN_ACTION == 'true' }}
# run the "converter.sh"-script to convert YARRRML rules to RDF rules
run: |
chmod +x ${{ github.action_path }}/converter.sh
find . -name "${{ inputs.global-pattern }}" -type f | ${{ github.action_path }}/converter.sh
${{ github.action_path }}/converter.sh
shell: bash
env:
INPUTS_OUTPUT_DIRECTORY: ${{ inputs.output-directory }}
WORKING_DIRECTORY: ${{ github.workspace }}
SERIALIZATION_FORMAT: ${{ inputs.serialization-format }}
CONVERT_ALL: ${{ inputs.convert-all }}

- name: "Remove the rml-mapper jar file"
if: ${{ steps.check-contents.outputs.RUN_ACTION == 'true' }}
run: rm rmlmapper.jar
shell: bash
47 changes: 47 additions & 0 deletions check_contents.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/bin/bash

# if "convert_all" input parameter was set to TRUE,
# convert all files anyway (e.g. if one of the output files or the output folder and
# all the files were deleted and need to be recovered)
if [[ $CONVERT_ALL == "true" ]]
then
echo "INFO: convert-all is true => running the action"
exit 1
fi

cd $WORKING_DIRECTORY
meta_dir="rml_action_meta"

# "rml_action_meta" is the directory that contains the metadata for the action
# "yamlToDs.md5" is the checksum of a list of all mapping files used for conversion
# and the data source files that need to be converted
# "contents.md5" is the checksum of contents of all mapping files and all the data source files
# if both checksums exist and are correct, finish the action without conversion

if [[ ! -f $meta_dir/yamlToDs.md5 || ! -f $meta_dir/contents.md5 ]]
then
# one of the checksums is not present => run the action
echo "INFO: one of the checksums is not present => running the action"
exit 1
fi

# check if the checksum for the list of filenames hasn't changed
# if it has, some files have been added/removed or renamed,
# so the action should be run further (conversion)
md5sum --status --check $meta_dir/yamlToDs.md5
FIRST_CHECKSUM_RESULT=$?
# check if the checksum for the contents of mapping files and data source files
# is still the same; if it's not, run the action further (conversion)
md5sum --status --check $meta_dir/contents.md5
SECOND_CHECKSUM_RESULT=$?

if [[ $FIRST_CHECKSUM_RESULT == 0 && $SECOND_CHECKSUM_RESULT == 0 ]]
then
# there are no changes, don't run the action
echo "INFO: No changes, stopping the action"
exit 0
fi

# the action needs to be run in this case
echo "INFO: Changes detected: running the action"
exit 1
103 changes: 94 additions & 9 deletions converter.sh
Original file line number Diff line number Diff line change
@@ -1,28 +1,113 @@
#!/bin/bash

while read filepath
do
if [[ "$filepath" == *".github"* ]]; then
continue
cd $WORKING_DIRECTORY
meta_dir="rml_action_meta"

# a list of files that will be used for conversion
temp_filenames="$meta_dir/temp_filenames.txt"
> $temp_filenames

# if the action was called, then either 'yamlToDs.md5' or 'contents.md5' has changed,
# or one of them is not present
# if 'yamlToDs.md5' has changed, some files have been added/removed or renamed
# if 'contents.md5' has changed, the contents of some files listed in 'yamlToDs' have changed

# check if the checksum for "yamlToDs.txt" has changed
md5sum --status --check $meta_dir/yamlToDs.md5
FIRST_CHECKSUM_RESULT=$?

# if convert-all input parameter was set to true or there is no "yamlToDs.md5" or
# the checksum "yamlToDs.md5" has changed, convert all files that were given by the
# global pattern
if [[ $CONVERT_ALL == "true" || ! -f $meta_dir/yamlToDs.md5 || \
! -f $meta_dir/contents.md5 || $FIRST_CHECKSUM_RESULT != 0 ]]
then
echo "INFO: either convert-all is true or one of the checksums is not present \
or the checksum for the list of filenames does not match"
# add all the filenames to a list of files that will be used for conversion, because either
# some files have been added/removed/renamed or the checksum of the contents is not present
cat $meta_dir/yamlToDs.txt | cut -d ' ' -f1 > $temp_filenames
if [[ ! -f $meta_dir/yamlToDs.md5 || $FIRST_CHECKSUM_RESULT != 0 ]]
then
echo "INFO: the checksum for the list of filenames is not present or doesn't match"
# recalculate the checksum if it does not exist or has changed
md5sum $meta_dir/yamlToDs.txt > $meta_dir/yamlToDs.md5
fi
else
# the checksum "contents.md5" has changed
# get all files that have changed, save yaml files to a list of files that will be used for
# conversion, map data source files to the mapping files and then save these mapping files
# to the same list in `$temp_filenames`
echo "INFO: the second checksum (contents) doesn't match"
md5sum --check $meta_dir/contents.md5 | grep -F "FAILED" | cut -f 1 -d ":" > changed_files.txt
echo "INFO: changed files are:"
cat changed_files.txt
echo
egrep "*.yml|*.yaml" changed_files.txt >> $temp_filenames
egrep -v "*.yml|*.yaml" changed_files.txt | grep -F -f - $meta_dir/yamlToDs.txt | \
cut -d " " -f1 >> $temp_filenames
rm -f changed_files.txt
fi

# (re-)calculate the checksum for the contents
md5sum $(cat $meta_dir/yamlToDs.txt | tr -s ' ' '\n') > $meta_dir/contents.md5

# determine the correct extension for the output file based on the chosen serialization format
EXTENSION=""
if [[ $SERIALIZATION_FORMAT == "nquads" ]]
then
EXTENSION="nq"
elif [[ $SERIALIZATION_FORMAT == "turtle" ]]
then
EXTENSION="ttl"
elif [[ $SERIALIZATION_FORMAT == "trig" ]]
then
EXTENSION="trig"
elif [[ $SERIALIZATION_FORMAT == "trix" ]]
then
EXTENSION="xml"
elif [[ $SERIALIZATION_FORMAT == "jsonld" ]]
then
EXTENSION="jsonld"
elif [[ $SERIALIZATION_FORMAT == "hdt" ]]
then
EXTENSION="hdt"
else
echo "ERROR: Unsupported serialization format" >> /dev/stderr
exit 1
fi

# get rid of the duplicates (e.g. in case multiple data source files
# for the same mapping file were modified)
sort $temp_filenames | uniq > unique_filenames.txt
cp unique_filenames.txt $temp_filenames
rm -f unique_filenames.txt

echo "INFO: Files for conversion are:"
cat $temp_filenames

while read filepath
do
# get a basename from the path
FILE_BASENAME=$(basename $filepath)
# get a filename without an extension
FILENAME=$(echo "$FILE_BASENAME" | sed -e 's/\..*//')
# filename for the output file containing RDF
OUTPUT_FILENAME="${FILENAME}_output.ttl"
OUTPUT_FILENAME="${FILENAME}_output.${EXTENSION}"
# get a directory name from the path
# and go to that directory
FILE_DIRNAME=$(dirname $filepath)
cd $FILE_DIRNAME
# convert YARRRML rules to RML
# convert YARRRML rules to RML rules
yarrrml-parser -i $FILE_BASENAME -o $WORKING_DIRECTORY/temp_rml_rules.rml.ttl
# convert RML rules to RDF and save it to the output folder
# convert RML rules to RDF and save the result to the output folder
java -jar $WORKING_DIRECTORY/rmlmapper.jar -m $WORKING_DIRECTORY/temp_rml_rules.rml.ttl \
-o $WORKING_DIRECTORY/$INPUTS_OUTPUT_DIRECTORY/$OUTPUT_FILENAME -s $SERIALIZATION_FORMAT
cd $WORKING_DIRECTORY
done
done < $temp_filenames

# remove the temporary file with RML rules
rm -f $WORKING_DIRECTORY/temp_rml_rules.rml.ttl
rm -f temp_rml_rules.rml.ttl

# remove the temporary file with all the filenames for the action
rm -f $temp_filenames
Loading

0 comments on commit 086538f

Please sign in to comment.