Skip to content

Add_Datasets

Lorenz Kapsner edited this page Mar 11, 2022 · 4 revisions

Configuration of New Datasets

A tutorial that demonstrates in detail the necessary steps for applying DQAstats to the publicly available SHIP-dataset is provided here.

The following steps are required to configure the DQA-tool for testing a new dataset:

  1. Define a so-called 'utilities'-folder, which contains all the configuration files, specific for your dataset.

    • The folder does not need to be integrated into an R-package (even this is the case in the two examples below).
    • The folder can be named arbitrarily.
    • At least, two subfolders need to be placed inside this 'utilities'-folder:
      • 'MDR'-folder: contains the mdr.csv-file with the metadata repository (MDR)
      • 'RMD'-folder: contains the DQA_report.Rmd Rmarkdown file
    • If you want to connect to SQL-databases, another subfolder named 'SQL' is required that contains one JSON-file per database, which keeps all the SQL-statements for each data element to be tested by the DQA-tool.
    • Furthermore, customizations for the DQAgui go into the subfolder 'MISC':
      • email.yml: possibility to define a default email address to which some basic DQA-report results can be sent directly from the Web-GUI
      • sitenames.JSON: possibility to define 'sitenames', which is helpful, when working in a federated research network.

    Examples:

    • Demo-folder of R-package DQAstats
    • Configuration of the MIRACUM-consortium's research IT-environment: miRacumDQA
  2. Fill the folder with life:

    • Create the file MDR/mdr.csv for your database(s) to be tested:
    • Copy/Paste the RMD folder to your 'utilities'-folder and customize the report template RMD/DQA_report.Rmd, if necessary.
    • If wanting to test SQL-databases, define the SQL-statements for each data element to be tested and store them as key-value-pairs to a single JSON file inside the SQL-folder.
      • The JSON-file must be named according to the pattern SQL_{...}.JSON, where {...} is the respective 'source_system_name' of the database also used in the mdr.csv.
      • The 'keys' for each data element in the JSON-file must match the respective 'key' from the mdr.csv.
      • 💡 For convenience, you can build your own python scripts to create this 'SQL' JSON-files.
  3. In order, to launch the DQA-tool with your customized 'utilities'-folder, simply provide this information to the repective argument when starting the tool:

    # Load library DQAstats:
    library(DQAstats)
    
    # Set environment vars to demo files paths:
    Sys.setenv("EXAMPLECSV_SOURCE_PATH" = system.file("demo_data",
                                                    package = "DQAstats"))
    Sys.setenv("EXAMPLECSV_TARGET_PATH" = system.file("demo_data",
                                                    package = "DQAstats"))
    # Set path to utilities folder where to find the mdr and template files:
    utils_path <- "path/to/custom/utilities"
    
    # Execute the DQA and generate a PDF report:
    results <- DQAstats::dqa(
      source_system_name = "exampleCSV_source",
      target_system_name = "exampleCSV_target",
      utils_path = utils_path,
      mdr_filename = "mdr_example_data.csv",
      output_dir = "output/"
    )