Intro doc, readme links and spelling edits

SocialFinanceDigitalLabs · Sep 10, 2024 · c146924 · c146924
1 parent 51bb1f2
commit c146924
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -48,3 +48,7 @@ The idea is each code server will have its own setup which will be a copy of wha
 
 Note: Multiple libraries, pipelines, etc can exist in a single code server. Different servers should
 be used if they have conflicting requirements (e.g. different python versions)
+
+### Documentation
+Take a look at the documentation to understand what this code is designed to do and how to replicate it for your own dataset transformations.
+We recommend reading [text](docs/Intro_to_docs.md) first, followed by [text](docs/general_pipeline.md).
diff --git a/docs/Intro_to_docs.md b/docs/Intro_to_docs.md
@@ -0,0 +1,27 @@
+# Intro to docs (read this first)
+
+## What is this documentation for?
+
+This documentation is designed to provide guidance to developers looking to understand and replicate the code in this repo for transforming data uploaded by multiple data controllers (here assumed to be local authorities) into a compiled set of outputs for a data processor (here assumed to be a regional body tasked with analysing the data across all controllers).
+
+## Some key terminology
+
+* Pipeline: A pipeline is the sequence of processes (called jobs in Dagster) that take input data and produce outputs. We associate a pipeline with a dataset e.g. the SSDA903 dataset used by Children's Services to record episodes of care for looked after children. So the SSDA903 pipeline includes all the processes needed to take each individual SSDA903 file uploaded by each data controller and produce all of the outputs that are shared with the data processor. The outputs may include 'cleaned' data files that may have had fields removed, degraded and validated; log files that describe the processig that has taken place; and data models designed to plug directly into reporting software like Power BI. It's important to recognise that a pipeline may produce outputs for more than one use case (see below). For example, the SSDA903 pipeline produces outputs for two use cases: "PAN" and "SUFFICIENCY".
+
+* Use case: A use case refers to a use of the data sharing infrastructure for a defined purpose that has associated Information Governance agreements, such as a Data Processing Agreement (DSA) and Data Protection Impact Assessment (DPIA). Use cases are agreed between the data controllers and the data processor. They define the nature of the outputs that should be produced by the pipelines and what the outputs can be used for. Note that a use case may involve more than one pipeline. For example, the "PAN" use case defines outputs to be produced for the SSDA903, CIN Census and Annex A pipelines.
+
+## Some important foundations
+
+* Data sharing infrastructure: The pipelines have been written to be run on cloud infrastructure that can be accessed by both the data controllers, who upload input data to the infrastructure, and the data processor, who retrieves output data from the infrastructure. The pipelines assume that the infrastructure has already managed the upload of data, has separated that data into distinct areas belonging to each data controller, and has labelled those areas in such a way that the controller can be identified by the pipeline code. When the pipeline code begins, it inherits from the infrastructure's scheduler the necessary information about the location of the input file to be processed and the identity of the data controller.
+
+* Schema-based validation and processing: A key foundation for the pipelines is the schema. This describes two fundamental aspects of the processing:
+    * the structure of the input datasets, including:
+        * the fields
+        * formats of data fields
+        * values allowed in categorical fields
+        * whether field values are mandatory
+    * the instructions on how to transform data to produce the outputs, including:
+        * whether files, or fields within files, should be retained for a specific output
+        * whether fields should be degraded e.g. reducing the granularity of geographical locators
+        * how long files should be retained before being deleted
+The schema for a dataset should correspond exactly to the processing detailed in the Information Governance for the use cases relevant to that dataset. An audit of the schema should be able to map every data transformation and decision to an instruction detailed in the Information Governance.
diff --git a/docs/general_pipeline.md b/docs/general_pipeline.md
@@ -20,23 +20,23 @@ For steps 1-6, there will be:
 * a 'workspace' file area, containing 'current' and 'sessions' folders. 
   * the 'current' folder contains a copy of the processed data appropriately cleaned and minimised. 
   * the 'sessions' folder contains a history of each session, including the incoming, cleaned, enriched and degraded files as well as an error report. 
-  * these folders are only visible to the pipeline but can be accessed by technical staff in case of troubleshoothing. 
+  * these folders are only visible to the pipeline but can be accessed by technical staff in case of troubleshooting. 
 * a 'shared' file area, containing 'current', 'concatenated' and 'error_report' folders.
   * the 'current' folder contains a copy of the data from the 'workspace/current' folder.
   * the 'concatenated' folder contains the concatenated data produced in step 6.
   * the 'error_report' folder contains a copy of the error report from the 'input/sessions' folder.
-  * this folder can seen by the LA account and accessed by central pipelines for creating reports.
+  * this folder can be accessed by central pipelines for creating reports.
 
 For step 7, there will be:
 
-* an 'input' file area, which will be the previous steps 'shared/concatenated' folder.
+* an 'input' file area, which will be the previous steps' 'shared/concatenated' folder.
 * a 'workspace' file area, containing 'current' and 'sessions' folders. 
   * the 'current' folder contains a copy of the reports created for each use case. 
   * the 'sessions' folder contains a history of each session, including the incoming files. 
-  * these folders are only visible to the pipeline but can be accessed by technical staff in case of troubleshoothing. 
-* a 'shared' file area, containing a copy of the reports created for each use case and an 'error_report' folder.
+  * these folders are only visible to the pipeline but can be accessed by technical staff in case of troubleshooting. 
+* a 'shared' file area, containing a copy of all files to be shared with the Organisation and an 'error_report' folder.
   * the 'error_report' folder contains a copy of the error report from the 'input/sessions' folder from the previous steps.
-  * this folder can seen by the Organisation account.
+  * this folder can seen by the Organisation account, so any files in here can be downloaded by the regional hub users.
 
 ## Prep data
 
@@ -57,20 +57,20 @@ Returns:
 * Detects the year.
 * Checks the year is within the retention policy.
 * Reads and parses the incoming files.
-* Ensures that the data is in a consistent format and that all required fields are present. 
+* Ensures that the data is in a format consistent with the schema and that all required fields are present. 
 * Collects "error" information of any quality problems identified such as:
 
   * File older than retention policy
-  * Unknown files
+  * Unknown files i.e. cannot be matched against any in the schema
   * Blank files
   * Missing headers
   * Missing fields
   * Unknown fields
   * Incorrectly formatted data / categories
   * Missing data
 
-* Creates dataframes for the indetified tables
-* Applies retention policy to dataframes, including file names, headers and year.
+* Creates dataframes for the identified tables
+* Applies retention policy to dataframes, including file names, headers and year e.g. a file (or a column within a file) that is not used in any outputs that the regional hub is permitted to access will not be processed.
 
 Inputs:
 
@@ -106,9 +106,9 @@ Outputs:
 
 Removes sensitive columns and data, or masks / blanks / degrades the data to meet data minimisation rules.
 
-Working on each of the tables in turn, this process will degrade the data to meet data minimisation rules:
+Working on each of the tables in turn, this process will degrade the data to meet data minimisation rules that should be specified in the processing instructions received. Examples of this include:
 
-  * Dates all set to the first of the month
+  * Dates of birth all set to the first of the month
   * Postcodes all set to the first 4 characters (excluding spaces)
 
 Inputs:
@@ -139,4 +139,4 @@ Concatenates the data of multiple years into a single dataframe for each LA and
 
 ## Prepare reports
 
-Use the concatenated data to create reports to be shared. These can vary from a further concatenated dataset, combining multiple LAs data, to specific analyses built around several datasets.
+Use the concatenated data to create reports to be shared. These can vary from a further concatenated dataset, combining multiple LAs data, to specific analytical outputs built around several datasets.
diff --git a/docs/pipeline_creation.md b/docs/pipeline_creation.md
@@ -23,7 +23,7 @@ liia_tools/
 
 ## 2. Create the schemas: use a .yml schema for .csv and .xlsx files, use an .xsd schema for .xml files 
 
-The first .yml schema will be a complete schema for the easliest year of data collection. Afterwards you can create .yml.diff schemas which just contain the differences in a given year and will be applied to the initial .yml schema. \
+The first .yml schema will be a complete schema for the earliest year of data collection. Afterwards you can create .yml.diff schemas which just contain the differences in a given year and will be applied to the initial .yml schema. \
 For .xml files there is no equivalent .xsd.diff so each year will need a complete schema.
 
 * The .yml schema should follow this pattern:
@@ -157,7 +157,7 @@ String
 
 ## 3. Create the pipeline.json file, these follow the same pattern across all pipelines
 
-* The .json schema should follow this pattern:
+* The .json file should contain all the information relevant to determining which files and columns should be processed and how. This includes determining the retention period for the outputs linked to a use case of the platform, determining which of the data controllers in a region have approved each use case, and the rules surrounding processing of each file and each data field in each file. The .json schema should follow this pattern:
 
 ```json
 {