-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/13 add json schema #16
Merged
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
57f0f5d
feat(app): #13 Add json schema for reexports
AlexAxthelm b278289
feat(package): #13 Change metadata schema to expect array
AlexAxthelm 4187df0
feat(app): #13 Add metadata validation to directory processing
AlexAxthelm b43443b
feat(app): #14 Add metadata file export
AlexAxthelm f7296ec
test(app): #14 Add test checking for exported metadata file
AlexAxthelm fc8e4e2
test(app): #14 Add test for validity of output files
AlexAxthelm 000ce0f
feat(app): #13 factor validation and serialization to function
AlexAxthelm 1165b93
feat(app): Add schema validation to reexport_portfolio()
AlexAxthelm a7207d9
feat(app): #13 Add schema validation to portfolio exports
AlexAxthelm 359e084
chore(app): linting
AlexAxthelm fc25d14
feat(app): Add system info to exported metadata
AlexAxthelm 87ec644
add $id to schema
AlexAxthelm 2c94503
Use more informative feild name (md5 vs digest)
AlexAxthelm 5322ac7
Add length validation to arrays
AlexAxthelm 56cd079
fix logging string
AlexAxthelm 0006c1d
Disallow additional properties
AlexAxthelm 861967b
add system_info to json schema
AlexAxthelm bb34d9f
Define minimum lengths for input entities
AlexAxthelm b8c6018
Make package name expectation explicit
AlexAxthelm b966a22
Version JSON Schema
AlexAxthelm 309b8b4
Linting
AlexAxthelm 7a55ea3
Add missing system dependencies
AlexAxthelm 245e532
Simplify package splitting for dependencies
AlexAxthelm acf0d8e
Resolve package version issue
AlexAxthelm f0f153b
Update README
AlexAxthelm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
schema_serialize <- function( | ||
object, | ||
schema_file = system.file( | ||
"extdata", "schema", "parsedPortfolio_0-1-0.json", | ||
package = "workflow.portfolio.parsing" | ||
), | ||
reference = NULL | ||
) { | ||
sch <- jsonvalidate::json_schema[["new"]]( | ||
schema = readLines(schema_file), | ||
strict = TRUE, | ||
engine = "ajv", | ||
reference = reference | ||
) | ||
json <- sch[["serialise"]](object) | ||
json_is_valid <- sch[["validate"]](json, verbose = TRUE) | ||
if (json_is_valid) { | ||
logger::log_trace("JSON is valid.") | ||
} else { | ||
json_errors <- attributes(json_is_valid)[["errors"]] | ||
logger::log_warn( | ||
"object could not be validated against ", | ||
"JSON schema: \"", schema_file, "\",", | ||
" reference: \"", reference, "\"." | ||
) | ||
logger::log_trace( | ||
logger::skip_formatter(paste("JSON string: ", json)) | ||
) | ||
logger::log_trace("Validation errors:") | ||
for (i in seq(from = 1L, to = nrow(json_errors), by = 1L)) { | ||
logger::log_trace( | ||
"instancePath: ", json_errors[i, "instancePath"], | ||
" message: ", json_errors[i, "message"] | ||
) | ||
} | ||
warning("Object could not be validated against schema.") | ||
} | ||
return(json) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
get_system_info <- function() { | ||
logger::log_trace("Getting system information") | ||
package <- getPackageName() | ||
version <- as.character(packageVersion(package)) | ||
logger::log_trace("Package: ", package, " version: ", version) | ||
raw_deps <- trimws( | ||
strsplit( | ||
x = packageDescription(package)[["Imports"]], | ||
split = ",", | ||
fixed = TRUE | ||
)[[1L]] | ||
) | ||
deps <- trimws( | ||
gsub( | ||
x = raw_deps, | ||
pattern = "\\s+\\(.*\\)", | ||
replacement = "" | ||
) | ||
) | ||
deps_version <- as.list( | ||
lapply( | ||
X = deps, | ||
FUN = function(x) { | ||
list( | ||
package = x, | ||
version = as.character(packageVersion(x)) | ||
) | ||
} | ||
) | ||
) | ||
|
||
return( | ||
list( | ||
timestamp = format( | ||
Sys.time(), | ||
format = "%Y-%m-%dT%H:%M:%SZ", | ||
tz = "UTC" | ||
), | ||
package = package, | ||
packageVersion = version, | ||
RVersion = as.character(getRversion()), | ||
dependencies = deps_version | ||
) | ||
) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,4 +2,107 @@ | |
|
||
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip) | ||
|
||
The docker image defines by this repo accepts a directory of portfolios (mounted to `/mnt/input/`) and exports sanitized versions of those portfolios ready for further processing via PACTA (in `/mnt/outputs`) | ||
The docker image defined by this repo accepts a directory of portfolios (mounted to `/mnt/input/`, see [Inputs](#inputs)) and exports sanitized versions of those portfolios ready for further processing via PACTA (in `/mnt/output`, see [Outputs](#outputs),) | ||
|
||
## Docker Image | ||
|
||
The intended method for invoking this workflow is with a Docker container defined by the image in the [Dockerfile](Dockerfile). | ||
GitHub Actions builds the offical image, which is available at: `ghcr.io/rmi-pacta/workflow.portfolio.parsing:main` | ||
|
||
Running the workflow from a docker image requires mounting an input and output directory: | ||
|
||
```sh | ||
|
||
# note that the input mount can have a readonly connection | ||
# You can set logging verbosity via the LOG_LEVEL envvar | ||
|
||
docker run --rm \ | ||
--mount type=bind,source="$(pwd)"/input,target=/mnt/input,readonly \ | ||
--mount type=bind,source="$(pwd)"/output,target=/mnt/output \ | ||
--env LOG_LEVEL=TRACE \ | ||
ghcr.io/rmi-pacta/workflow.portfolio.parsing:pr16 | ||
|
||
``` | ||
|
||
The container will process any files in the `input` directory, and export any valid portfolios along with a metadata files (see [Outputs](#outputs), below). | ||
|
||
## Metadata file (`processed_portfolios.json`) | ||
|
||
Along with portfolios (in a standardized `csv` format), the parser exports a metadata file about the parsed inputs, and the exported portfolio files. | ||
The file is in JSON format, and validates against a [JSON Schema in this repository](inst/extdata/schema/parsedPortfolio_0-1-0.json). | ||
|
||
The file is array of objects, with the highest level opbects in the array centered around the input files, with the exported files contained in an array with the `portfolio` key in each input file object. | ||
|
||
A simple example of the output file: | ||
|
||
```jsonc | ||
[ | ||
{ | ||
"input_filename": "simple.csv", | ||
"input_md5": "8e84d71c0f3892e34e0d9342cfc91a4d", | ||
"system_info": { | ||
"timestamp": "2024-01-31T19:11:56Z", | ||
"package": "workflow.portfolio.parsing", | ||
"packageVersion": "0.0.0.9000", | ||
"RVersion": "4.3.2", | ||
"dependencies": [ | ||
{ | ||
"package": "digest", | ||
"version": "0.6.33" | ||
}, | ||
// ... array elided | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does "elided" mean? |
||
] | ||
}, | ||
"input_entries": 1, | ||
"group_cols": [], | ||
"subportfolios_count": 1, | ||
"portfolios": [ | ||
{ | ||
"output_md5": "0f51946d64ef6ee4daca1a6969317cba", | ||
"output_filename": "be1e7db9-3d7c-4978-9c1c-4eba4ad2cff5.csv", | ||
"output_rows": 1 | ||
} | ||
] | ||
} | ||
] | ||
``` | ||
|
||
Note that a input file object may have an `errors` key (exclusive to the `portfolios` key), or `warnings` (not exclusive to `portfolios` or `errors`) which indicates a processing error (or warning). | ||
The `errors` object will be an array with messages which are suitable for presentation to end users. | ||
|
||
Here is an example `jq` query to see a simple mapping between input and output files: | ||
|
||
```sh | ||
|
||
cat output/processed_portfolios.json | jq ' | ||
.[] | | ||
[{ | ||
input_file: .input_filename, | ||
output_file: .portfolios[].output_filename | ||
}] | ||
' | ||
|
||
``` | ||
|
||
## R Package | ||
|
||
This repo defines an R Package, `{workflow.portfolio.parsing}`. | ||
The R package structure allows for easy management of dependencies, tests, and access to package files (such as the [JSON Schema](inst/extdata/schema/parsedPortfolio_0-1-0.json)). | ||
Because using the R Package locally not intended as the primary use-case, running locally (beyond development) is technically unsupported, but should not pose any issues. | ||
|
||
The package exports functions, but the main entrypoint is `process_directory()`. When called with default arguments, it works as intended for use with the docker image from this repo. | ||
|
||
## Inputs | ||
|
||
This workflow reads files from a directory (by convention mounted in docker container as `/mnt/input`). | ||
The files must be plain csv files (though they do not need to have a `csv` file extension), parsable by [`pacta.portfolio.import::read_portfolio_csv()`](https://rmi-pacta.github.io/pacta.portfolio.import/reference/read_portfolio_csv.html). | ||
The workflow will attempt to parse other files in that directory, but will throw warnings. | ||
The workflow will not recurse into subdirectories. | ||
|
||
## Outputs | ||
|
||
This workflow writes files to a directory (by convention mounted in docker container as `/mnt/output`): | ||
|
||
- `*.csv`: csv files that contain portfolio data, with columns and column names standardized, 1 portfolio per file. | ||
- `processed_portfolios.json`: A JSON file with metadata about the files, including the input file, as well as file hashes and summary information for both input and output files. | ||
This file validates against the JSON Schema defined in this repository ([found here](inst/extdata/schema/parsedPortfolio_0-1-0.json)) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "opbjects"