Add_Datasets

Configuration of New Datasets

A tutorial that demonstrates in detail the necessary steps for applying DQAstats to the publicly available SHIP-dataset is provided here.

The following steps are required to configure the DQA-tool for testing a new dataset:

Define a so-called 'utilities'-folder, which contains all the configuration files, specific for your dataset.
- The folder does not need to be integrated into an R-package (even this is the case in the two examples below).
- The folder can be named arbitrarily.
- At least, two subfolders need to be placed inside this 'utilities'-folder:
  - 'MDR'-folder: contains the mdr.csv-file with the metadata repository (MDR)
  - 'RMD'-folder: contains the DQA_report.Rmd Rmarkdown file
- If you want to connect to SQL-databases, another subfolder named 'SQL' is required that contains one JSON-file per database, which keeps all the SQL-statements for each data element to be tested by the DQA-tool.
- Furthermore, customizations for the DQAgui go into the subfolder 'MISC':
  - email.yml: possibility to define a default email address to which some basic DQA-report results can be sent directly from the Web-GUI
  - sitenames.JSON: possibility to define 'sitenames', which is helpful, when working in a federated research network.
Examples:
- Demo-folder of R-package DQAstats
- Configuration of the MIRACUM-consortium's research IT-environment: miRacumDQA
Fill the folder with life:
- Create the file MDR/mdr.csv for your database(s) to be tested:
  - Please see the MDR-manual for further detailed information.
  - Please also find the minimum required MDR-fields (= CSV-file column-names) there.
- Copy/Paste the RMD folder to your 'utilities'-folder and customize the report template RMD/DQA_report.Rmd, if necessary.
- If wanting to test SQL-databases, define the SQL-statements for each data element to be tested and store them as key-value-pairs to a single JSON file inside the SQL-folder.
  - The JSON-file must be named according to the pattern SQL_{...}.JSON, where {...} is the respective 'source_system_name' of the database also used in the mdr.csv.
  - The 'keys' for each data element in the JSON-file must match the respective 'key' from the mdr.csv.
  - 💡 For convenience, you can build your own python scripts to create this 'SQL' JSON-files.

In order, to launch the DQA-tool with your customized 'utilities'-folder, simply provide this information to the repective argument when starting the tool:

# Load library DQAstats:
library(DQAstats)

# Set environment vars to demo files paths:
Sys.setenv("EXAMPLECSV_SOURCE_PATH" = system.file("demo_data",
                                                package = "DQAstats"))
Sys.setenv("EXAMPLECSV_TARGET_PATH" = system.file("demo_data",
                                                package = "DQAstats"))
# Set path to utilities folder where to find the mdr and template files:
utils_path <- "path/to/custom/utilities"

# Execute the DQA and generate a PDF report:
results <- DQAstats::dqa(
  source_system_name = "exampleCSV_source",
  target_system_name = "exampleCSV_target",
  utils_path = utils_path,
  mdr_filename = "mdr_example_data.csv",
  output_dir = "output/"
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add_Datasets

Configuration of New Datasets

Clone this wiki locally