- Compatibility
- Database Compatibility
- What databases does RDMP support?
- What database permissions does RDMP need?
- How does RDMP RSA encryption work?
- How do I set a custom port / SSL certificate / connection string option?
- When I connect to MySql it says 'The host localhost does not support SSL connections'
- Does RDMP Support Schemas?
- Does RDMP Support Views?
- Does RDMP Support Table Valued Functions?
- Cohort Creation
- Data Load Engine
- How does RDMP differ from classic tools e.g. SSIS?
- Can RDMP Load UnTyped Data?
- What is the purpose of the the data load folders? (ForLoading, ForArchiving etc)
- What is hic_dataLoadRunID?
- How does RDMP deal with Csv/text files?
- Can RDMP read Excel files?
- How does RDMP handle / translate untyped, C# and Database Types?
- When loading data can I skip some columns?
- Can I run SQL Scripts during a load?
- RDMP is not listing my table as importable
- User Interface
- Anonymisation
- Curation
- User Interface Programming
- Other Programming
RDMP comes in three parts:
The RDMP Client is a graphical user interface that allows data analysts to create cohorts, perform linkage operations and data extractions etc. The following system requirements exist for the RDMP Client
Area | Requirement | Logic |
---|---|---|
Operating System | Windows 7 or later | Application uses Windows Forms with DotNet 4.6.1 (included in redistributable) |
Hardware | 4 cores, 16GB of RAM, 50MB disk space | RDMP memory needs come from the size of datasets/batches being processed at the same time. Install is per user |
Tooling | SQL Server Management Studio and any Office suite | SSMS is not required but recommended to validate generated SQL scripts. Office is useful to work with the documents generated by RDMP |
RDMP stores all configuration data in 'platform databases'. These must be created on an Sql Server Instance accessible to RDMP Client users. The server can be the same server you host clinical data or a seperate server.
Area | Requirement | Logic |
---|---|---|
Sql Server Version | 2014 or later | |
OS | Windows Server 2012R2 or higher | This requirement is dictated by the needs of the SQL Server version installed |
Hardware | >100GB disk space | RAM and CPU are dictated by the SQL Server version installed. Disk space is based on the amount of datasets and projects managed. In a real world situation like in HIC, the existing set of RDMP DBs occupy about 60GB |
The RDMP command line client can be used to run unattended jobs (such as overnight extractions or automated loads)
Area | Requirement | Logic |
---|---|---|
Hardware | 2 cores, 8GB RAM, 100MB disk space plus space for logs | Logs are configurable and can be disabled or streamed to a UDP receiver. By default the CLI logs everything to disk and to console |
Operating System | Windows 7 or later or Linux on x86-64 | Application uses the cross platform .Net 6 API, self-contained |
Yes, read all about it in CommandLine
The RDMP CLI runs natively under linux.
Download the RDMP CLI release binary for linux-x64
For example to create test databases run:
./rdmp install localhost TEST_ -u sa -p YourStrong\!Passw0rd
The rdmp command --help
describes how to pass connection strings directly to the executable but it is often easier to enter these in Databases.yaml e.g.
CatalogueConnectionString: Server=localhost;Database=TEST_Catalogue;User Id=sa;Password=YourStrong!Passw0rd
DataExportConnectionString: Server=localhost;Database=TEST_DataExport;User Id=sa;Password=YourStrong!Passw0rd
Each rdmp engine (e.g. dle) has its own verb e.g.
./rdmp dle --help
In addition core and plugin commands can be accessed through the cmd verb e.g.
./rdmp cmd ListSupportedCommands
Yes, RDMP can be controlled programmatically through its API which is available as 3 NuGet packages
- HIC.RDMP.Plugin: All core objects, engines etc
- HIC.RDMP.Plugin.UI: All definitions for writing user interface plugins that run in the main RDMP client application.
- HIC.RDMP.Plugin.Test: Classes to simplify writing Unit/Integration Tests for RDMP plugins
Yes, RDMP supports both functional plugins (e.g. new anonymisation components, new load plugins etc) as well as UI plugins (e.g. new operations when you right click a Catalogue).
See PluginWriting
RDMP software has a build Major, Minor and Patch number (Semantic Versioning). When an API update is issued that requires a change in the platform databases, accompanying sql scripts will be included to perform the update (which will run during startup).
Platform databases are divided into three tiers:
Tier | Description |
---|---|
1 | The Catalogue and Data Export databases, stores all metadata (projects, datasets etc) as well as the locations of other databases |
2 | Ancillary databases managed by RDMP e.g. Logging, DQE Results, Query Caches. You can have multiple or none of each of these configured. |
3 | This tier is reserved for Plugins which wish to persist objects/meta data in a database using the same versioning model (update scripts, data model) as the core RDMP databases. |
During application startup (following installing an update) you will be prompted to apply patches on any platform databases that are not up to date.
Download the latest release binary (e.g. rdmp-cli-win-x64.zip) and unzip into a new empty folder (e.g. c:\temp\latest). Locate your current CLI deployment folder (e.g. c:\rdmpcli).
Copy all files from the new release into your current CLI directory EXCEPT:
- Databases.yaml (contains connection strings for your RDMP database)
- NLog.config (contains logging settings)
Ensure the new version matches expectations:
./rdmp.exe --help
Ensure all your plugins are uptodate, you can check compatiblity with:
./rdmp.exe cmd ListSupportedCommands --logstartup
RDMP uses FAnsiSql to discover, query and connect to databases. Currently this includes support for Sql Server, MySql, PostgreSQL and Oracle.
Installing RDMP will create a number of 'platform' databases e.g.:
- RDMP_Catalogue
- RDMP_DataExport
- RDMP_Logging
- RDMP_DQE
Installation requires 'CREATE DATABASE' permissions. The user that installs RDMP will be the db_owner
of the platform databases. This can be changed to the following if desired:
- db_datareader
- db_datawriter
- db_ddladmin
- db_executor
For an indepth discussion of the implications of this change see DDL_admin vs db_owner permissions.
This account should be used for patching when a new version of RDMP is released that includes a database patch (Any database patches are always clearly marked in bold in the change log).
The same user permissions and processes should be followed when creating cohort databases/query cache etc.
After installation RDMP users require db_datareader
and db_datawriter
permissions on the core platform databases:
- RDMP_Catalogue
- RDMP_DataExport
- RDMP_Logging
- RDMP_DQE
Committing cohort lists requires db_datareader
and db_datawriter
to the target cohort database. If you are using a plugin for cohort identifier allocation then additional permissions may be required (e.g. db_executor).
The query cache database requires db_ddladmin
in addition to db_datareader
and db_datawriter
because caching involves creating and dropping new tables.
RDMP can be configured to use seperate credentials for each cohort databases/query cache databases. These credentials can be encrypted with an RSA private key.
The Data Load Engine includes support for pluggable modules (CSV Reader, FTP Downloader etc) and so may require any number of permissions. Each module should clearly articulate what permissions are required (e.g. file read access for DelimitedFlatFileAttacher
). But common to all RDMP data loads are the following.
Running the DLE requires the following permissions for the database being loaded (your data repository):
- db_datareader
- db_datawriter
- db_executor (required for archiving)
Creating new data loads and changing the schema of live tables (e.g. adding a new column / changing primary key) requires db_ddladmin
.
In addition the DLE requires 2 databases the RAW and DLE_STAGING databases.
The RAW database is created at the begining of a load and is the first place where unconstrained data is loaded. The DLE runner requires CREATE DATABASE
and DROP DATABASE
permissions on this server. For this reason it is recommended to use a seperate RAW server (or seperate named instance) from your live data repository.
The final stage of DLE execution requires a database called DLE_STAGING
. This database should be created by a user with CREATE DATABASE
permissions but after it is created once it will remain and not be dropped (like RAW is). When running the DLE requires the following permissions on DLE_STAGING
- db_ddladmin
- db_datareader
- db_datawriter
The Data Load Engine supports loading other DBMS than sql server (e.g. MySql, Postgress, Oracle). When loading an alternate DBMS you will need to translate the above roles (e.g. db_datawriter
) into the appropriate permissions for the DBMS you are targetting.
The recommended approach to managing permissions is to use domain authentication (integrated security) where possible. However sometimes credentials need to be stored (e.g. for a remote web service) or sql authentication is desired. In these cases RDMP supports storing encrypted credentials.
Encryption uses 4096 bit RSA encryption. Encrypted strings are decrypted prior to sending use (e.g. to send to a web service) so runtime protections such as HTTPS should still be used. To generate an encryption certificate go to 'Tables (Advanced)' and open the 'Decryption Certificate' node (under 'Data Access Credentials').
This will generate a file on disk that contains the private key. Access to the key file allows decrypting passwords and so only RDMP users with file read access to the certificate can use them. The private key file must be stored securely and access granted only to users who require to use the encrypted passwords.
If you lose the private key file then encrypted passwords will be irretrievable (lost forever).
RDMP manages connection strings internally. If you want a keyword applied on your connection strings you can add it in the 'Connection String Keywords' node. Each keyword is associated with a single database provider (MySql, Oracle etc). In order for the changes to take effect you will need to restart RDMP.
If your MySql server does not support SSL connections then you can specify a Connection String Keyword 'SSLMode' with the Value 'None' (Make sure you select DatabaseType:MySQLServer)
Yes. In Microsoft Sql Server, Schema is a scope between Database and Table. By default all tables get created in the 'dbo' schema but it is possible to create tables in other schemas. For example
--Table get's created in the default schema 'dbo'
create table test..MyTable1(MyCol int not null)
--Table get's created in the schema 'omg' within the database 'test'
create schema omg
create table test.omg.MyTable1(MyCol int not null)
When importing a table RDMP will record the schema it came from and fully qualify calls to the table. When running the data load engine RAW and STAGING tables will always be created in dbo (to avoid issuing schema creation commands).
Yes, when importing a table from a database to create a Catalogue any views in the database will also be shown. These are interacted with in exactly the same manner as regular tables.
You cannot load a view with data using the Data Load Engine.
When importing a table from a Microsoft Sql Server database to create a Catalogue any table valued functions in the database will also be shown. When you import these you will get a TableInfo which contains default values to supply to the function when querying it. You can override these parameters e.g. for a project extraction, cohort identification configuration etc.
So you've decided to do a research study using Electronic Health Records and/or imaging and you've written a document outlining the requirements. How do we turn this list of inclusion / exclusion criteria into runnable code? With RDMP's Cohort Compiler of course!
The first task is to split up the criteria into bite sized chunks, each run on a single dataset:
- 3+ prescriptions for Drug A
- Biochemistry result for TestCode B > 500
- Alive at the time of study
- Has had a head MR in the past 5 years
How does RDMP compile this into SQL? To answer that question let's look at the end goal. Since the datasets share a common identifier we can JOIN the tables. But that can get complex fast and gives us a single gigantic query that's likely to bring the server to its knees. Instead, since we are dealing with lists of patients, we can use SET operations (UNION, INTERSECT, EXCEPT). This means we only need to pull a single column (e.g. patientId) from each dataset and we can then smash all the resulting lists together using the super fast operations that Relational Database Engines excel at. As an added bonus, if the datasets are on seperate database servers or engines (MySql, Sql Server, Oracle and PostgreSQL) we can run the queries seperately and store the results in a temporary common server and apply the SET operations there.
SELECT patientId From Prescribing WHERE Drug = 'Drug A' Group by CHI HAVING COUNT(*) > 3
UNION
SELECT patientId From Biochemistry WHERE TestCode = 'TestCode B' AND Result > 500
EXCEPT
Select patientId from Demography WHERE DateOfDeath is not null
INTERSECT
SELECT patientId from Imaging WHERE Modality = 'MR' and StudyDescription like '%head%'
Since each section is runnable independently it is trivially easy for RDMP to produce totals for each seperate set. The set results can even be cached to prevent having to re-run the entire query if you are only making a small change to one bit.
For full details on technical implementation see Cohort Creation.
Yes. If a cohort is solely defined by the contents of such a file it can be committed directly to the cohort store.
If you need to combine data in the file (or multiple files) with data in your database then the first step is to create a new Catalogue by uploading the data into your database. This has a number of advantages:
- It provides a permenant record in RDMP of the data used to generate the cohort
- Querying is faster and more flexible
- The data in the file can be anonymised and released by the RDMP extraction engine if required for project extractions.
Cohorts are built by combining datasets (with filters) using SET operations UNION/EXCEPT/INTERSECT. These queries can become very big and run slowly. In some DBMS certain operations aren't even supported (e.g. INTERSECT/EXCEPT in MySql). Therefore it is recommended to create a query cache.
The query cache will be used to store the results of each subquery in indexed temporary tables (of patient identifiers). The cache also records the SQL executed in order to invalidate the cache if the configuration is changed.
Not directly. RDMP supports multiple DBMS and does not enforce any specific table schemas on your data. There are many clinical ontologies (e.g. BNF, ICD9/ICD10, SNOMED CT) and these evolve over time. Cohort creation in RDMP is done with filters which are defined by the data analyst and are closely tied to thier own data model. For this reason it would be difficult to create an ontology system which worked with all datasets without requiring table schema changes.
Ontology creation / management can still occur behind the scenes in your database e.g. with scalar functions or table valued functions. This approach allows maximum flexibility of the tool and reduces overhead when deploying RDMP against an existing mature data hosting environment (with established systems for handling ontologies).
RDMP supports lookup tables which can form part of an ontology mapping solution. The DLE can be helpful in ensuring lookup tables / ontology mappings remain current and avoid duplication.
Yes, RDMP supports both multiple identifier formats (text, numerical, guids etc) and multiple identifiers per dataset
Yes, RDMP supports querying any data provider service through the Cohort Building API Plugin system.
RDMP is primarily a data management tool and does not seek to replace existing ETL tools (e.g. SSIS) that you might use for complex data transforms / data migrations etc. The RDMP DLE is designed to facilitate routine loading of clinical/research datasets, this differs from traditional ETL jobs the following ways:
- Data sources are error prone
- Duplication
- Badly formatted files (e.g. unescaped CSV)
- Schema changes (e.g. columns changing periodically)
- Scale in dimensions other than row count
- By number of dataset
- By number of columns (e.g. researcher adds some new columns to his dataset)
- Require specific narrow set of transformations (e.g. anonymisation)
- Benefits from traceability
- When each row appeared
- When row values change (when and what old values were)
- Which job/batch loaded a given row (allows tracing rows back to original source files)
The Data Load Engine is designed to rapidly build simple data loads with a robust framework for error detection, traceability and duplication avoidance. Core to the data load process implemented in RMDP is the automatic migration of the data you are trying to load through increasingly structured states (RAW=>STAGING=>LIVE). This allows for rapid error detection and ensures that bad data remains in an isolated area for inspection by data analysts.
A full description of the mechanics and design of the DLE can be found in the UserManual
Yes, determining database types from untyped data (e.g. CSV) is a core feature.
RDMP data load jobs are configured by creating a LoadMetadata. When you create a new LoadMetadata you will be prompted to choose/create a set of load folders. The default directories created have the following layout:
someFolder
|_ Data
| |_ Cache
| |_ ForArchiving
| |_ ForLoading
|_ Executables
These folders are accessible to DLE load components. Some components put data into a folder (e.g. WebFileDownloader) while
others load files in a folder (e.g. DelimitedFlatFileAttacher). Some components ignore these folders completely (e.g. RemoteTableAttacher).
When a load starts it is assumed that anything in ForLoading
will be loaded. If ForLoading
is empty then it is assumed that the components
in the load do not require files (e.g. RemoteTableAttacher) or a component will be populate ForLoading
at runtime (e.g. ImportFilesDataProvider).
At the end of a successful load all files in ForLoading
(if any) are moved into ForArchiving
and compressed into a zip archive. The name of the archive
will be the hic_dataLoadRunID of the run (e.g. 132.zip). If the load fails then files are left in ForLoading
for future runs.
The Executables
directory is provided for storing executable files and/or sql scripts that you might want to run during the load.
The Cache
folder is used for the RDMP caching engine (long running fetching tasks e.g. pulling images from a PACS server or reports from a webservice).
When a LoadMetadata is executed in RDMP it is assigned a unique number. This number provides an end to end audit of:
- Files stored on disk that were loaded
- Logging audit of messages, errors, insert/update counts etc
- Records in live table (INSERTed / UPDATEd by the run)
The number is allocated by the logging database and is unqiue to the RUN (i.e. it is not the ID of the LoadMetadata but reflects one execution of that config).
If the hic_dataLoadRunID
of a record in the database is null then it was not loaded by the RDMP DLE. It may have been loaded by bulk importing a CSV or created by
other means.
If a load issues an UPDATE then the archive trigger will move the old record state to an _Archive
table. Therefore it is important to also loook at the _Archive
table when tracing where records ended up from a given run (hic_dataLoadRunID
). This trigger can be disabled for volatile loads where large numbers of records are updated
frequently and the feature would impact performance.
RDMP supports files delimited by any character (tab separated, pipe separated, comma separated etc). Since invalid formatting is a common problem with ETL of CSV files RDMP has several fault tolerance features.
Yes, Support for Excel is described in the Excel Handling page
TypeTranslation is handled by FAnsiSql.
The data load engine first loads all data to the temporary unconstrained RAW database then migrates it to STAGING and finally merges it with LIVE (See UserManual for more info). It is designed to make it easy to identify common issues such as data providers renaming columns, adding new columns etc.
The above message shows the case where there is a new column appearing for the first time in input files for the data load (Agenda) and an unmatched column in your RAW database (Schedule). This could be a renamed column or it could be a new column with a new meaning. Once you have identified the nature of the new column (new or renamed) then there are many ways to respond. You could handle the name change in the DLE (e.g. using ForceHeaders or a Find and Replace script). Or you could send an email to data provider rejecting the input file.
In order for this to work the DLE RAW Attachers enforce the following rules:
- Unmatched columns in RAW are ALLOWED. For example you could have a column 'IsSensitiveRecord' which is in your live table but doesn't appear in input files.
- Unmatched columns in files are NOT ALLOWED. If a flat file has a column 'Dangerous' you must have a corresponding column in your dataset
If you have columns in your file that you straight up don't want to load (see 2 above) then you can list them explicitly using the IgnoreColumns setting of your Attacher.
If you need to process data in the columns but don't want them in your final LIVE database table (e.g. to merge two fields or populate an aggregate column) then you can create a virtual RAW only column (called a PreLoadDiscardedColumn). PreLoadDiscardedColumns are columns which are supplied by data providers but which you do not want in your LIVE database. Each PreLoadDiscardedColumn can either:
- Be created in RAW and then thrown away (
Oblivion
). This is useful if there are columns you don't care about or combo columns you want to use only to populate other columns (e.g. FullPatientName=> Forename + Surname) - Be dumped into an identifier dump (
StoreInIdentifiersDump
). This is useful if you are supplied with lots of identifiable columns that you want to keep track of but separated from the rest of the data - Be promoted to LIVE in a diluted form (
Dilute
). For example you might want to promote PatientName as a 1 or a 0 indicating whether or not it was provided and store the full name in the identifier dump as above.
Creating a PreLoadDiscardedColumn
can be done by right clicking the TableInfo . You will need to specify both the name of the virtual column and the datatype as it should be created in RAW (it won't appear in your LIVE table).
If you want to dump / dilute the column you must configure a dump server. And in the case of dilution, you will also have to add a Dilution
mutilator to AdjustStaging
and a column to your live schema for storing the diluted value.
This approach gives a single workflow for acknowledging new columns and making conscious decisions about how to treat that data. And since it is ALLOWED for columns in the database to appear that are not in input files you can still run the new load configuration on old files without it breaking.
Yes once you have a load set up you can add SQL scripts to adjust the data in RAW / STAGING during the load. You can also run scripts in PostLoad (which happens after data is migrated into the LIVE tables) but this is not recommended since it breaks the paradigm load isolation.
The Load Diagram shows the tables that will exist in RAW/STAGING. Notice that the names for the tables/databases change depending on the stage. This is handled by autocomplete when writing scripts (see below).
If you want to make a script agnostic of the LoadStage or if you are writing your own INameDatabasesAndTablesDuringLoads
then you can reference columns/tables by ID using the following Syntax
{T:100}
Where T is for TableInfo and 100 is the ID
of the TableInfo you want the name of. You can find the ID by viewing the ID column of the Tables collection:
If you want to share one script between lots of different loads you can drag the .sql file from Windows Explorer onto the bubble node (e.g. AdjustStaging)
In order to allow other people to run the data load it is advised to store all SQL Script files on a shared network drive.
RDMP can create references to tables and views from any database to which you have sufficient privelleges. If you can see other tables in the database listed check the following:
- The table does not have brackets in its name e.g.
[MyDb]..[MyTbl (fff)]
- The table does not have dots in its name e.g.
[MyDb]..[MyTbl.lol]
- The table is visible through other tools e.g. sql management studio
If you cannot see any tables listed in the database
- Check that the servername/database name are correct
- If you need a connection string keyword (e.g. specific port) create a ConnectionStringKeyword
Some columns in RDMP are set to fill the available space (in the dialog). For example the Name column in the object selection dialog. You can still resize other columns (e.g. CatalogueName) by using dragging the far side of the column to the right.
RDMP stores user settings in IsolatedStorage. In Windows you can reach this by entering the following in Windows Explorer:
%LocalAppData%\IsolatedStorage
Yes, data can be annonymised in the following ways
- When loading data
- Columns can be can be ignored by the DLE (do not load an identifiable column)
- Columns can be split off and stored in a seperate table (e.g. split identifers like name/address from data like prescribed drug)
- Column values can be allocated an anonymous substitution mapping (mapping tables are persisted in an ano database)
- Column values can be diluted (e.g. store only the left 4 digits of a postcode)
- When extracting data
- Columns can be marked as 'not extractable'
- Columns can be marked as requiring 'Special Approval' to be extracted
- Columns can be marked for hashing on extraction
- The linkage identifier (e.g. CHI) is substituted for an anonymous mapping (e.g. a guid)
- Sql transforms can be applied to the column being extracted
- Pipelines / ProcessTasks
- The plugin API makes writting plugins to perform arbitrary anonymisations easy.
A Catalogue is RDMP's representation of one of your datasets e.g. 'Hospital Admissions'. A Catalogue consists of:
- Human readable names/descriptions of what is in the dataset it is
- A collection of items mapped to underlying columns in your database. Each of these:
- Can be extractable or not, or extractable only with SpecialApproval
- Can involve a transform on the underlying column (E.g. hash on extraction, UPPER etc)
- Have a human readable name/description of the column/transform
- Can have curated WHERE filters defined on them which can be reused for project extraction/cohort generation etc
- Validation rules for each of the extractable items in the dataset
- Graph definitions for viewing the contents of the dataset (and testing filters / cohorts built)
- Attachments which help understand the dataset (e.g. a pdf file)
A Catalogue can be a part of project extraction configurations, used in cohort identification configurations. They can be marked as Deprecated, Internal etc.
The separation of dataset and underlying table allows you to have multiple datasets both of which draw data from the same table. It also makes it easier to handle moving a table/database (e.g. to a new server or database) / renaming etc.
Internally Catalogues are stored in the Catalogue table of the RDMP platform database (e.g. RDMP_Catalogue). The ID field of this table is used by other objects to reference it (e.g. CatalogueItem.Catalogue_ID).
There are 4 ways to export/import dataset descriptions from RDMP
This report is accessed through the 'Reports' menu. It uses DocX to write a Microsoft Word file containing the dataset description, row count, unique patient count as well as any graphs that are marked extractable.
This report can be generated for a specific dataset only or all datasets. A variant of this report is produced when performing a project extraction which shows only the subset of data extracted for the given project/cohort.
If you want to share dataset description, column descriptions etc with another RDMP user (in another organisation) you can export your Catalogue as a .sd file. This contains JSON definitions for the Catalogue and all CatalogueItem objects. This includes Validation XML and all the additional fields e.g. Granularity Coverage, Background Summary etc. The share does not include the mapping between descriptions and underlying columns or the extraction SQL since this is likely to vary by site.
You can import a Share Definition (.sd) file previously generated by yourself or another RMDP user to load the stored descriptions. This is useful if two or more organisations have semantically similar copies of the same dataset (e.g. SMR01). When importing a Share Definition file you will import all column descriptions (CatalogueItem) even if you do not have them in your database (they simply won't be mapped to an underlying database column).
Share Definition files include JSON serialization of values and are not designed to be human readable.
Catalogue descriptive metadata can be serialized into an xml file which follows the Dublin Core guidelines. RDMP can load dataset descriptions from <metadata>
XElements
which follow this xmlns.
An example can be seen below:
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">
<dc:title>biochemistry</dc:title>
<dcterms:alternative>BIO</dcterms:alternative>
<dc:subject>biochemistry, blood, urine, Tayside</dc:subject>
<dc:description>This is the biochemistry dataset it contains test results for HBA1c, CREAT etc</dc:description>
<dc:publisher />
<dcterms:isPartOf xsi:type="dcterms:URI" />
<dcterms:identifier xsi:type="dcterms:URI" />
<dc:format xsi:type="dcterms:IMT" />
</metadata>
You can export all dataset descriptions as .dita files and a single .ditamap using the Reports menu.
An example .dita file is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN"
"concept.dtd"><concept id="bio">
<title>biochemistry</title>
<conbody>
<simpletable keycol="1">
<strow>
<stentry>Acronym</stentry>
<stentry>BIO</stentry>
</strow>
<strow>
<stentry>Name</stentry>
<stentry>biochemistry</stentry>
</strow>
<strow>
<stentry>Description</stentry>
<stentry>This is the biochemistry dataset it contains test results for HBA1c, CREAT etc</stentry>
</strow>
<strow>
<stentry>Type</stentry>
<stentry>EHRExtract</stentry>
</strow>
<strow>
<stentry>Periodicity</stentry>
<stentry>Daily</stentry>
</strow>
...
RDMP supports exporting in arbitrary file formats (xml, markdown etc) through its 'custom metadata' engine. To do this you will need to use the RDMP CLI.
First create a simple template (e.g. a markdown file).
# Datasets
These are the datasets we hold:
- Biochemistry
- Hameatology
- Prescribing
Then surround the actual values with a $foreach Catalogue
and $end
# Datasets
These are the datasets we hold:
$foreach Catalogue
- Biochemistry
- Hameatology
- Prescribing
$end
Then replace the Name with $Name
# Datasets
These are the datasets we hold:
$foreach Catalogue
- $Name
$end
Save this in a file (e.g. C:\temp\myTemplate.md
). Next run an metadata extraction using the CLI:
./rdmp ExtractMetadata Catalogue:* c:\temp\ C:\temp\myTemplate.md "out.md" true null null
Open out.md
and you should have something like:
# Datasets
These are the datasets we hold:
- AutomateExtraction
- AutomateExtractionSchedule
- Biochemistry
- Demography
- HospitalAdmissions
- Prescribing
- QueuedExtraction
- ReleaseIdentifiersSeen
- SuccessfullyExtractedResults
- vConditions
- vOperations
- z_someting
For a full list of substitutions available see CustomMetadataSubstitutions.md
How do I create a Catalogue from 2+ tables?
Start by importing the Tables as normal (but do not create a Catalogue). Double click the topmost table and check 'Is Primary Extraction Table'.
Next right click the topmost table and select 'Configure JoinInfo Where...'. In this window add the child table to join to and drag the column(s) that should be used to join.
You can configure a collation
(if required) and join direction (LEFT / RIGHT / INNER)
Now we have configured the technical layer we can create our Catalogue. Create a new Catalogue by right clicking in the Catalogues collection
Then right click each Table in turn and select Create New Catalogue...
but instead of closing the dialog with 'Ok' instead click 'Add To Existing Catalogue' (After configuring column extractability).
You can check that you have configured the join correctly by right clicking the Catalogue and selecting View [Catalogue] Extraction Sql
Being able to visualize your data is incredibly important for data confidence, cohort building and correct filtering. RDMP has 2 graph systems:
There are 2 primary sources of 'logging' in RDMP:
- Audit
- System Logs
Audit logs are generated by running RDMP engines (extractions, cohort creation, ETL etc). Audit logs are written to the 'Logging' RDMP platform database and are designed to be consumed by data analysts. Audit logs are stored in a hierarchical format:
Table | Description |
---|---|
DataTask | Title |
Paragraph | Text |
Audit logs can be accessed by right clicking on the 'Logging' server in the RDMP Gui under (Tables(Advanced)
tree or Other
in TUI).
Audit logs for specific objects can be accessed directly from the object itself e.g. right clicking on an [ExtractionConfiguration]
The audit server is automatically set up during the installation of RDMP and defaults to using 'integrated security' (windows authentication). However you can create a new logging server manually which can use (encrypted) username/password credentials (e.g. sql authentication). RDMP also supports creating the audit logging database in other DBMS types (e.g. MySql) but again this must be manually configured.
System logs can be enabled for RDMP. These are low level logs designed to be read by a software developer or IT specialist. To enable
this feature, download NLog.config to your RDMP install.
This template generates file logs to a ./logs/
sub-directory. You can adjust it to log anywhere including to databases/central server etc
by following the NLog targets guide. Logs may contain identifiable or sensitive information so should
be secured appropriately if enabled.
The RDMP CLI (command line) is supplied pre-configured for file and colour console logging.
RDMP uses Windows Forms with the Dock Panel Suite framework for docking. It uses Scintilla.Net for text editors (e.g. viewing SQL) and ObjectListView for the advanced navigation tree/list views.
User interfaces in RDMP follow a standard testable design which is described in the user interface overview.
Any user interface class which includes an abstract base class in its hierarchy (e.g. CatalogueUI
) has an accompanying class CatalogueUI_Design
. This class exists solely facilitate using the visual studio forms designer (which doesn't support abstract base classes). Even with this workaround visual studio will sometimes fail to show the designer for these controls, restarting visual studio usually solves this problem.
Sometimes you want to limit which nodes in an RDMPCollectionUI
are reordered when the user clicks on the column header. In the below picture we want to allow the user to sort data loads by name but we don't want to reorder the ProcessTask nodes or the steps in them since that would confuse the user as to the execution order.
You can prevent all nodes of a given Type from being reordered (relative to their branch siblings) by inheriting IOrderable
and returning an appropriate value:
public class ExampleNode : IOrderable
{
public int Order { get { return 2; } set {} }
}
If you are unsure what Type a given node is you can right click it and select 'What Is This?'.
This requires a tutorial all of its own
CreatingANewCollectionTreeNode
The entire RDMP meta data model is stored in platform databases (Catalogue / Data Export etc). Classes e.g. Catalogue are fetched either all at once or by ID
. The class Properties can be used to fetch other related objects e.g. Catalogue.CatalogueItems
. This usually does not result in a bottleneck but under some conditions deeply nested use of these properties can result in your platform database being hammered with requests. You can determine whether this is the case by using the PerformanceCounter. This tool will show every database request issued while it is running including the number of distinct Stack Frames responsible for the query being issued. Hundreds or even thousands of requests isn't a problem but if you start getting into the tens of thousands for trivial operations you might want to refactor your code.
Typically you can solve these problems by fetching all the required objects up front e.g.
var catalogues = repository.GetAllObjects<Catalogue>();
var catalogueItems = repository.GetAllObjects<CatalogueItem>();
If you think the problem is more widespread then you can also use the IInjectKnown<T>
system to perform Lazy
loads which prevents repeated calls to the same property going back to the database every time.
Yes there are over 1,000 unit and integration tests, this is covered in Tests