From 0d032b7a7cec11b50c86be3306f240d7f2ccc1f1 Mon Sep 17 00:00:00 2001 From: Chad MILLER Date: Mon, 20 May 2019 08:55:45 -0700 Subject: [PATCH] Remove download-csv description from README --- README.md | 26 +++++++------------------- 1 file changed, 7 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 7f49a31..34e52cf 100644 --- a/README.md +++ b/README.md @@ -19,16 +19,13 @@ reporting easier. Operational overview: -1. Python script `load_data.py`: +1. Django management command `load_data`: * downloads a zip clinical trials registry data from ClinicalTrials.gov - * converts the XML to JSON - * uploads it to BigQuery - * runs SQL to transform it to tabular format including fields to - indentify ACTs and their lateness - * downloads SQL as a CSV file + * transforms XML into a CSV file + * all of #2, `process_data` 2. Django management command `process_data`: - * imports CSV file into Django models + * imports existing CSV file into Django models * precomputes aggregate statistics and turns these into rankings * handles other metadata (in particular, hiding trials that are no longer ACTs) @@ -42,12 +39,9 @@ loaded into a staging database / website. A separate command copies new data from staging to production (following moderation). -Much complex logic has been expressed in SQL, which makes it hard to read -and test. This is a legacy of splitting the development between -academics with the domain expertise (and who could use SQL to -prototype) and software engineers. Now the project has been running -for a while and new development interations are less frequent, a useful -project would be as much of this logic to Python. +In the past, importing processes computed and filtered in SQL through +Bigtable service and some JSON processing, but that is largely gone. +You may still see scars. Similarly, the only reason step (1) exists is to create a CSV which can be imported to the database. That CSV is useful in its own right @@ -56,12 +50,6 @@ intermediate formats that could legitimately be dropped in a refactored solution (and the CSV could be generated directly from the database). -The historic reason for the XML -> JSON route is because BigQuery -includes a number of useful JSON functions which can be manipulated by -people competent in SQL. At the time of writing, there -is [an open issue](https://github.com/ebmdatalab/clinicaltrials-act-tracker/issues/121) with -some ideas about refactoring this process. - Static Pages ============