Skip to content

Commit

Permalink
Internal docs review1 (R. Ennis). Additional comments: When the docum…
Browse files Browse the repository at this point in the history
…ent was knit, you used to get a warning indicating the YAML title and the vignette title were different. I changed it them to match and there is no warning. Might be better to leave the title as just the package name since you have a description of what the package does, how to install, etc and not just the Pensacola Bay specific example.
  • Loading branch information
jbousquin committed Sep 30, 2023
1 parent 4e4f17a commit f5d941a
Showing 1 changed file with 95 additions and 75 deletions.
170 changes: 95 additions & 75 deletions demos/Harmonize_Pensacola.Rmd
Original file line number Diff line number Diff line change
@@ -1,86 +1,112 @@
---
title: "R markdown for harmonize-wq Harmonize_Pensacola"
title: "harmonize-wq in R"
author: "Justin Bousquin, Cristina Mullin, Marc Weber"
date: '2022-08-31'
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Harmonize_Pensacola R Markdown}
%\VignetteIndexEntry{harmonize-wq in R}
%\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---

## R Markdown
```{r setup, include = FALSE}
# Set chunk options
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

Standardize, clean and wrangle Water Quality Portal data in Pensacola and Perdido Bays into more analytic-ready formats using the harmonize_wq package
US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:
<br>

## Overview

Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats using the harmonize_wq package. US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:

* Identify differences in data units (including speciation and basis)
* Identify differences in sampling or analytic methods
* Resolve data errors using transparent assumptions
* Reduce data to the columns that are most commonly needed
* Transform data from long to wide format

Identify differences in data units (including speciation and basis)
Identify differences in sampling or analytic methods
Resolve data errors using transparent assumptions
Reduce data to the columns that are most commonly needed
Transform data from long to wide format
Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

The first part of this notebook walks through a typical harmonization process on data retrieved from Perdido and Pensacola Bays, FL. The second part of the notebook takes a deeper dive into exactly what is done to each water quality characteristic result and some ways to leverage additional functions in the package for special use cases.
<br>

<br>

## Installation & Setup

## Set up working environment
#### Install the harmonize-wq package (Command Line)

Steps:
1) If needed, re-install [miniforge](https://github.com/conda-forge/miniforge). Once miniforge is installed. Go to your start menu and open the Miniforge Prompt.
2) At the Miniforge Prompt:
- conda create --name wq_harmonize
- activate wq_harmonize
- conda install geopandas pip dataretrieval pint
- may need to update conda
- conda update -n base -c conda-forge conda
- pip install harmonize-wq
- pip install git+https://github.com/USEPA/harmonize-wq.git (dev version)

ALTERNATIVELY, you may be able to set up your environment and import the required Python packages using the block of R code below:
To install and set up the harmonize-wq package using the command line:

```{r, results = 'hide', message = FALSE, warning = FALSE}
1. If needed, re-install [miniforge](https://github.com/conda-forge/miniforge). Once miniforge is installed. Go to your start menu and open the Miniforge Prompt.
2. At the Miniforge Prompt:
- conda create --name wq_harmonize
- activate wq_harmonize
- conda install geopandas pip dataretrieval pint
- may need to update conda
- conda update -n base -c conda-forge conda
- pip install harmonize-wq
- pip install git+https://github.com/USEPA/harmonize-wq.git (dev version)

<br>

#### Install the harmonize-wq package (R)

**Alternatively**, you may be able to set up your environment and import the required Python packages using the block of R code below:

```{r, results = 'hide', eval=FALSE}
# If needed, install the reticulate package to use Python in R
install.packages("reticulate")
library(reticulate)
#envname may need to be the full path, e.g.: "~/AppData/Local/miniforge3/envs/wq_harmonize"
# The reticulate package will automatically look for an installation of Conda
# However, you may specify the location if needed using options(reticulate.conda_binary = 'dir')
options(reticulate.conda_binary = '~/AppData/Local/miniforge3/Scripts/conda.exe')
# Create a new Python environment called "wq-reticulate"
# Note that the environment name may need to include the full path (e.g. "~/AppData/Local/miniforge3/envs/wq_harmonize")
conda_create("wq-reticulate")
# Install the following packages to the newly created environment
conda_install("wq-reticulate", "geopandas")
conda_install("wq-reticulate", "pint")
conda_install("wq-reticulate", "dataretrieval")
# Only works with py install (pip), which defaults to virtualenvs,
#Again, envname may need to be the full path, e.g.: "~/AppData/Local/miniforge3/envs/wq_harmonize"
py_install("harmonize-wq", pip = TRUE, envname = "C:/Users/cmulli01/AppData/Local/miniforge3/envs/wq_harmonize")
# Dev version
#py_install("git+https://github.com/USEPA/harmonize-wq.git", pip = TRUE, envname = "C:/Users/cmulli01/AppData/Local/miniforge3/envs/wq_harmonize")
# Install the harmonize-wq package
# This only works with py_install() (pip), which defaults to virtualenvs
# Note that the environment name may need to include the full path (e.g. "~/AppData/Local/miniforge3/envs/wq_harmonize")
py_install("harmonize-wq", pip = TRUE, envname = "wq-reticulate")
```

## Specify the environment where the dependencies in the above block were installed, and the load in all the required dependencies
```{r, results = 'hide', message = FALSE, warning = FALSE}
library(reticulate)
# To install the dev version of harmonize-wq from GitHub
# Note that the environment name may need to include the full path (e.g. "~/AppData/Local/miniforge3/envs/wq_harmonize")
py_install("git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8", pip = TRUE, envname = "wq-reticulate")
# If Conda is installed somewhere else other than where reticulate automatically looked, you can specify it
options(reticulate.conda_binary ='~/AppData/Local/miniforge3/Scripts/conda.exe')
# Specify the Python environment to be used
use_condaenv("wq_harmonize")
# use these to test that your environment is set up correctly
# Test that your Python environment is correctly set up
# Both imports should return "Module(package_name)"
import("harmonize_wq")
import("dataretrieval")
```

## Import the required libraries. Check requirements.txt for dependencies that should be installed.
```{python}
# Note that outside of a markdown file, you can run python code w/ reticulate using:
# reticulate::repl_python()
<br>

#### Import required libraries

The full list of dependencies that should be installed to use the harmonize-wq package can be found in [`requirements.txt`](https://github.com/USEPA/harmonize-wq/blob/new_release_0-3-8/requirements.txt). **Note that `reticulate::repl_python()` must be called to execute these commands using the reticulate package in R.**

```{r}
# Use reticulate to execute python commands
reticulate::repl_python()
```

```{python}
# Use these reticulate imports to test the modules are installed
import harmonize_wq
import dataretrieval
Expand All @@ -94,55 +120,54 @@ from harmonize_wq import wrangle
from harmonize_wq import clean
from harmonize_wq import location
from harmonize_wq import visualize
```

## Simple example workflow for temperatures
<br>

<br>

## Usage

dataretrieval Query for a geojson
The following example illustrates a typical harmonization process using the harmonize-wq package on WQP data retrieved from Perdido and Pensacola Bays, FL.

```{python include=FALSE}
First, determine an area of interest (AOI), build a query, and retrieve water temperature and Secchi disk depth data from WQP for the AOI using the dataretrieval package:

# File for area of interest
```{python, message=FALSE, warning=FALSE, error=FALSE}
# File for area of interest (Pensacola and Perdido Bays, FL)
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'
# Build query and get data with dataretrieval
# Build query and get WQP data with dataretrieval
query = {'characteristicName': ['Temperature, water',
'Depth, Secchi disk depth',
]}
#use harmonize-wq to wrangle
# Use harmonize-wq to wrangle
query['bBox'] = wrangle.get_bounding_box(aoi_url)
query['dataProfile'] = 'narrowResult'
# Run query
res_narrow, md_narrow = wqp.get_results(**query)
# dataframe of downloaded results
# DataFrane of downloaded results
res_narrow
```

Harmonize and clean all results
Next, harmonize and clean all results:

```{python}
```{python, message=FALSE, warning=FALSE, error=FALSE}
df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')
df_harmonized
# Clean up other columns of data
df_cleaned = clean.datetime(df_harmonized) # datetime
df_cleaned = clean.harmonize_depth(df_cleaned) # Sample depth
# Clean up the datetime and sample depth columns
df_cleaned = clean.datetime(df_harmonized)
df_cleaned = clean.harmonize_depth(df_cleaned)
df_cleaned
```

##Transform results from long to wide format
There are many columns in the data frame that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data, these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.

There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.

```{python}
# Split QA column into multiple characteristic specific QA columns
```{python, message=FALSE, warning=FALSE, error=FALSE}
# Split the QA_flag column into multiple characteristic specific QA columns
df_full = wrangle.split_col(df_cleaned)
# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)
Expand All @@ -153,25 +178,20 @@ df_wide = wrangle.collapse_results(main_df)
# Reduced columns
df_wide.columns
df_wide.head()
```

## Map results
Finally, the cleaned and wrangled data may be visualized as a map:

```{python}
# Get harmonized stations clipped to the Area of Interest
```{python, message=FALSE, warning=FALSE, error=FALSE}
# Get harmonized stations clipped to the AOI
stations_gdf, stations, site_md = location.get_harmonized_stations(query, aoi=aoi_url)
# Map average temperature results at each station
gdf_temperature = visualize.map_measure(df_wide, stations_gdf, 'Temperature')
gdf_temperature.plot(column='mean', cmap='OrRd', legend=True)
```

Download location data using dataretrieval

```{python}
```
<br>

<br>

0 comments on commit f5d941a

Please sign in to comment.