-
Notifications
You must be signed in to change notification settings - Fork 5
Add_Datasets
Lorenz Kapsner edited this page Mar 11, 2022
·
4 revisions
A tutorial that demonstrates in detail the necessary steps for applying DQAstats
to the publicly available SHIP-dataset is provided here.
The following steps are required to configure the DQA-tool for testing a new dataset:
-
Define a so-called 'utilities'-folder, which contains all the configuration files, specific for your dataset.
- The folder does not need to be integrated into an R-package (even this is the case in the two examples below).
- The folder can be named arbitrarily.
- At least, two subfolders need to be placed inside this 'utilities'-folder:
- 'MDR'-folder: contains the
mdr.csv
-file with the metadata repository (MDR) - 'RMD'-folder: contains the
DQA_report.Rmd
Rmarkdown file
- 'MDR'-folder: contains the
- If you want to connect to SQL-databases, another subfolder named 'SQL' is required that contains one JSON-file per database, which keeps all the SQL-statements for each data element to be tested by the DQA-tool.
- Furthermore, customizations for the
DQAgui
go into the subfolder 'MISC':-
email.yml
: possibility to define a default email address to which some basic DQA-report results can be sent directly from the Web-GUI -
sitenames.JSON
: possibility to define 'sitenames', which is helpful, when working in a federated research network.
-
Examples:
- Demo-folder of R-package DQAstats
- Configuration of the MIRACUM-consortium's research IT-environment: miRacumDQA
-
Fill the folder with life:
- Create the file
MDR/mdr.csv
for your database(s) to be tested:- Please see the MDR-manual for further detailed information.
- Please also find the minimum required MDR-fields (= CSV-file column-names) there.
- Copy/Paste the RMD folder to your 'utilities'-folder and customize the report template
RMD/DQA_report.Rmd
, if necessary. - If wanting to test SQL-databases, define the SQL-statements for each data element to be tested and store them as key-value-pairs to a single JSON file inside the
SQL
-folder.- The JSON-file must be named according to the pattern
SQL_{...}.JSON
, where{...}
is the respective 'source_system_name' of the database also used in themdr.csv
. - The 'keys' for each data element in the JSON-file must match the respective 'key' from the
mdr.csv
. - 💡 For convenience, you can build your own python scripts to create this 'SQL' JSON-files.
- The JSON-file must be named according to the pattern
- Create the file
-
In order, to launch the DQA-tool with your customized 'utilities'-folder, simply provide this information to the repective argument when starting the tool:
# Load library DQAstats: library(DQAstats) # Set environment vars to demo files paths: Sys.setenv("EXAMPLECSV_SOURCE_PATH" = system.file("demo_data", package = "DQAstats")) Sys.setenv("EXAMPLECSV_TARGET_PATH" = system.file("demo_data", package = "DQAstats")) # Set path to utilities folder where to find the mdr and template files: utils_path <- "path/to/custom/utilities" # Execute the DQA and generate a PDF report: results <- DQAstats::dqa( source_system_name = "exampleCSV_source", target_system_name = "exampleCSV_target", utils_path = utils_path, mdr_filename = "mdr_example_data.csv", output_dir = "output/" )