All data

otavioon · Jun 27, 2024 · d5cece3 · d5cece3
1 parent b461311
commit d5cece3
Showing 1 changed file with 4 additions and 5 deletions.
diff --git a/notebooks/05_covid_anomaly_detection.ipynb b/notebooks/05_covid_anomaly_detection.ipynb
@@ -14,11 +14,6 @@
     "![preprocessing](stanford_data_processing.png)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": []
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -27,14 +22,18 @@
     "\n",
     "The processing pipeline is then applied to the data. The pipeline is composed of the following steps:\n",
     "1. Once data was standardized, the resting heart rate was extracted (``Resting Heart Rate Extractor``, in Figure). This process takes as input `min_minutes_rest` that is the number of minutes that the user has to be at rest to consider the heart rate as resting. `min_minutes_rest` variable looks at user steps and, when user steps is 0 for `min_minutes_rest` minutes, the heart rate is considered as resting. At the end of this process, we will have a new dataframe with: the date (`datetime` column) and the resting heart rate of the last minute (`RHR` column).\n",
+    "\n",
     "2. The smoother process is applied to the data (`Smoother`, in Figure). This process takes as input `smooth_window_sample` that is the number of samples that we will use to smooth the data, and `sample_rate` that is the sample rate. The smoother process will apply a moving average filter to the data, with a window of `smooth_window_sample` samples. Then the data is downsampled to `sample_rate` samples per minute. This process will produce a new dataframe with the date (`datetime` column), the resting heart rate at desired sampling rate (`RHR` column).\n",
+    "\n",
     "3. The second step is adding labels (`Label Adder`, in Figure). Is is also illustrated in Figure below. This process takes 3 inputs: `baseline_days`, `before_onset`, and `after_onset`. The `baseline_days` is the number of days before the onset of the symptoms that we consider as baseline (in figure below, this is 21 days). Thus, using the dataframe from last step, a new column named `baseline` is added, which is a boolean column that is True if the date is in the baseline period (21 days before onset). The `before_onset` and `after_onset` are the number of days before and after the onset of the symptoms that we consider as the anomaly period (7 days before and 21 days before, in Figure below). A new column named `anomaly` is added, which is a boolean column that is True if the date is in the anomaly period. Finnaly, we also add a `status` column,that is a metadata column for a descriptive status of the date. If can be: \n",
     "   - `normal`: if the date is in the baseline period; \n",
     "   - `before onset`: if the date is in the period before the onset of the symptoms; \n",
     "   - `onset` if the date is the onset of the symptoms (day); \n",
     "   - `after onset` if the date is in the period after the onset of the symptoms, but before the recovery; \n",
     "   - `recovered` if the date is in the recovery period.\n",
+    "\n",
     "4. Once the labels were added we normalize the data (`Standardizer` in Figure above). This process perform a Z-norm scale on the data. The Z-norm scale is calculated as: $z = \\frac{x - \\mu}{\\sigma}$, where $x$ is the value, $\\mu$ is the mean of the column and $\\sigma$ is the standard deviation of the column. An important note here is that the mean and standard deviation are calculated only for the baseline period, and then applied to the entire dataset.\n",
+    "\n",
     "5. The last step is to create the sequences (`Transposer`, in Figure), that will group $n$ rows and transform it into columns (features). This process takes as input `window_size` and `overlap` parameters and creates sequences of `window_size` samples with an overlap of `overlap` samples. Thus, if we have a dataset with 100 samples, a `window_size` of 20 and an `overlap` of 0, we will have 5 sequences of 20 samples each (*i.e.* 5 rows with 20 columns). Each element of the sequence will be a column in the dataframe, numbered from 0 to 19. Thus, for example, the sequences will have columns `RHR-0`, `RHR-1`, ..., `RHR-19`, where the first row is the first 20 samples, the second row is the second 20 samples, and so on. This is useful as it is the format that the LSTM-AutoEncoder model expects as input. An important note is that we do not mix sequences from anomaly and non-anomaly periods. Thus, no label is mixed, that is, an anomaly sample only has anomaly time-steps.\n",
     "\n",
     "This will produce a dataframe (CSV file) for each user. In processed dataset, we joined all users in a single file and add a column `participant_id` to identify the user. This makes easier to work with the data in the next steps."