README.Rmd

---
title: "Curated NEON datasets"
output: github_document
---

Contains scripts for downloading and cleaning data, and the resulting data files. 
Metadata for original and curated datasets are in this README. 

```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(usmap)
library(taxize)
```

### 1. Plant cover

The final curated dataset contains plant cover values by species at all NEON sites. 

#### Original data

- **Plant presence and percent cover** dataset 
- Product ID *DP1.10058.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.10058.001)
- Summary: Plant cover for each species of plant was estimated in six 1m2 subplots within 400m2 plots, where plant cover was percent of subplot ground covered as viewed from above. Each site has around 30 plots, with sites distributed across the USA. Plant cover was taken multiple times per year over multiple years, depending on the site. 
- Additional useful information
  - Some plants have vouchers/tissues collected that may be useful for genetic analyses
  - The only data for plant height is `heightPlantOver300cm`, which indicates whether plants are taller than 9.8 feet

#### File structure

- `plant_cover` folder
  - Scripts
    - `curate_data.R` cleans up data
  - Derived data and figures
    - `plant_cover.csv` is curated data

#### Curated data details

Columns: 

- `species`: species identification
- `lat`: latitude of plot (decimal degrees)
- `lon`: longitude of plot (decimal degrees)
- `sitename`: site, plot, and subplot info combined in format `sitecode_plotID_subplotID`; e.g., `DSNY_DSNY_017_32.4.1` is site DSNY, plot 017, subplot 32.4.1
- `date`: date of end of sampling in format YYYY-MM-DD
- `canopy_cover`: amount of ground covered by that species in 1m2 area (%)
- `uid`: unique identifier for each record as assigned by NEON

Summary figures and stats: 

```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
plant_cover <- read.csv("plant_cover/plant_cover.csv")
```

**Locations**

```{r, echo=FALSE, message=FALSE, warning=FALSE}
sites_plots <- plant_cover %>% 
  separate(sitename, sep = "_", into = c("site", "also_site", "plot", "subplot")) %>% 
  group_by(site) %>% 
  summarise(count = n_distinct(plot)) %>% 
  rename(Site = site, Plots = count)
```

- `r nrow(sites_plots)` sites with `r sum(sites_plots$Plots)` total plots
- Coordinates correspond to plot, not subplot
- Map of plot locations: 

```{r, echo=FALSE}
map_background <- map_data("state") 

ggplot() +
  geom_polygon(data = map_background, aes(x = long, y = lat, group = group), 
               fill = "white", color = "black") + 
  geom_point(data = plant_cover, aes(x = lon, y = lat), 
             color = "blue", shape = 4) +
  labs(x = "", y = "") +
  theme_classic()
```

- Figure of number of plots per site, ordered by number of plots:  

```{r, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(sites_plots) +
  geom_col(aes(x = reorder(Site, -Plots), y = Plots)) +
  xlab("Site") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90))
```

**Taxonomy**

- `r nrow(plant_cover)` records for `r length(unique(plant_cover$species))` species
- Table of the 20 species with the most records and their number of occurrences: 

```{r, echo=FALSE, message=FALSE, warning=FALSE}
species_counts <- plant_cover %>%
  group_by(species) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:20) %>%
  rename(Species = species, Occurrences = count)
kable(species_counts, format = "markdown")
```

**Time**

- Records taken on `r length(unique(as.Date(plant_cover$date)))` days from `r min(as.Date(plant_cover$date))` to `r max(as.Date(plant_cover$date))`
- Plot of number of records per day across entire time range: 

```{r, echo=FALSE, message=FALSE, warning=FALSE}
dates <- plant_cover %>% 
  select(date) %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(date) %>% 
  summarize(count = n()) %>% 
  rename(Date = date, Records = count)

ggplot() +
  geom_col(data = dates, aes(x = Date, y = Records), color = "black", fill = "black") +
  theme_classic()
```

### 2. Phenology measurements

The final curated dataset contains first date for each individual of at least half of flowers open for species from an NPN list at all NEON sites, combined with the corresponding NEON-collected meteorological data. 

#### Original data

**Plant phenology observations** dataset 

- Product ID *DP1.10055.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.10055.001)
- Summary: Phenophase status recorded for ~100 individual plants at each site across multiple years. Records are made for all plants up to multiple times a week depending on phenology activity. Each site has one transect along which all plants are included, with each individual plant tracked across each year. Tracked phenophases include initial growth, young leaves/needles, open flowers/pollen cones, colored leaves/needles, and falling leaves/needles. 

**Precipitation** dataset 

- Product ID *DP1.00006.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.00006.001)
- Summary: Three methods of measuring precipiation were used, with only one or two used at some sites. Primary measurements were with a weighing gauge, second measurements with a tipping bucket on the tower, and throughfall measurements with tipping buckets on the ground. Both primary and throughfall methods were known to have errors in the data. 

**Relative humidity** dataset 

- Product ID *DP1.00098.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.00098.001)
- Summary: At each NEON site, a Vaisala probe sensor collected relative humidity, air temperature, and dew point temperature measurements at every minute and 30 minutes at multiple locations, including one on the tower at the site. There are missing datapoints for all sites. 

#### File structure

- `phenology` folder
  - Scripts
    - `curate_data.R` cleans up data
  - Input data
    - `NPN_species_subset1_notes.csv` and `NPN_species_subset2.csv` contain lists of species from NPN with sequenced genomes
  - Derived data and figures
    - `phenology.csv` is curated data

#### Curated data details

Columns: 

- `individualID`: unique identifier assigned to each plant
- `species`: species identification, including only species from [this NPN-based list](https://docs.google.com/document/d/1RnuLpn7sKXCJsJaM1UvufTpWRxXNYkgrRHPCijCIM1E/edit)
- `lat`: latitude of plot (decimal degrees)
- `lon`: longitude of plot (decimal degrees)
- `sitename`: site and unique transect identifier, in the format site_plotID
- `first_flower_date`: earliest date per year for each individual to reach at least 50% of flowers open (i.e., `open flowers` is categorized as `50-74%`)
- `uid_pheno`: unique identifier for the phenophase record
- `uid_ind`: unique identifier for the individual record
- `mean_daily_precip`: mean precipitation (millimeters) at that individual's site in the year of `first_flower_date`, after summing precipitation for each day of year with 48 measurements and taking the mean across the year 
- `mean_humid`: mean yearly value, from daily mean humidity values calculated from days with at least ten humidity measurements on tower and summarized across years with at least 180 days of values (%)
- `min_humid`: same as `mean_humid` but minimum value
- `max_humid`: same as `mean_humid` but maximum value
- `mean_temp`: mean yearly value, from daily mean air temperature values calculated from days with at least ten temperature measurements on tower and summarized across years with at least 180 days of values (C)
- `min_temp`: same as `mean_temp` but minimum value
- `max_temp`: same as `mean_temp` but maximum value
- `gdd`: cumulative growing degree days for date of individual's `first_flower_date` starting from beginning of year, summed from growing degree day calculated for each day of the year from minimum and maximum daily temperature for days with at least 24 measurements using 10 degrees as cutoff

Summary figures and stats: 

```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
phenology <- read.csv("phenology/phenology.csv")
```

**Locations**

```{r, echo=FALSE, message=FALSE, warning=FALSE}
sites_transects <- phenology %>% 
  separate(sitename, sep = "_", into = c("site", "transect")) %>% 
  group_by(site) %>% 
  summarise(count = n_distinct(transect)) %>% 
  rename(Site = site, Transects = count)
```

- `r nrow(sites_transects)` sites with `r sum(sites_transects$Transects)` total transects
- From `r min(sites_transects$Transects)` to `r max(sites_transects$Transects)` transects per site
- Map of transect locations: 

```{r, echo=FALSE}
pheno_locs <- phenology %>% 
  select(lon, lat) %>% 
  drop_na() %>% 
  usmap_transform()

plot_usmap() +
  geom_point(data = pheno_locs, aes(x = lon.1, y = lat.1), 
             color = "blue", shape = 4) +
  theme_void()
```

**Taxonomy**

- `r nrow(phenology)` records for `r length(unique(phenology$species))` species
- Table of all species ordered by number of occurrences: 

```{r, echo=FALSE, message=FALSE, warning=FALSE}
species_counts <- phenology %>%
  group_by(species) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:20) %>%
  rename(Species = species, Occurrences = count)
kable(species_counts, format = "markdown")
```

- Table of all species ordered by number of individuals: 

```{r, echo=FALSE, message=FALSE, warning=FALSE}
inds_counts <- phenology %>% 
  group_by(species) %>% 
  summarize(count = n_distinct(individualID)) %>% 
  arrange(desc(count)) %>% 
  slice(1:20) %>% 
  rename(Species = species, Individuals = count)
kable(inds_counts, format = "markdown")
```

- Taxonomic tree of species: 

```{r, echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
spnames <- unique(stringr::word(phenology$species, 1, 2))
out <- classification(spnames, db='ncbi')
tr <- class2tree(out)
plot(tr)
```

**Time**

- Records taken on `r length(unique(as.Date(phenology$first_flower_date)))` days from `r min(as.Date(phenology$first_flower_date))` to `r max(as.Date(phenology$first_flower_date))`
- Plot of number of records per day across entire time range: 

```{r, echo=FALSE, message=FALSE, warning=FALSE}
dates <- phenology %>% 
  select(first_flower_date) %>% 
  mutate(first_flower_date = as.Date(first_flower_date)) %>% 
  group_by(first_flower_date) %>% 
  summarize(count = n()) %>% 
  rename(Date = first_flower_date, Records = count)

ggplot() +
  geom_col(data = dates, aes(x = Date, y = Records), color = "black", fill = "black") +
  theme_classic()
```

### 3. Phenology images

The final curated dataset contains green chromatic coordinate values, which came from images of sites, for a subset of NEON sites, combined with meteorological data from Daymet. 

#### Original data

**Phenology images** dataset 

- Product ID *DP1.00033.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.00033.001)
  - Data stored on [PhenoCam](https://phenocam.sr.unh.edu/webcam/about/) website [here](https://phenocam.sr.unh.edu/webcam/network/search/?sitename=&type=&primary_vegtype=&dominant_species=&active=unknown&fluxdata=unknown&group=neon); probably have to be downloaded individually by site? 
- Summary: Images (RGB and IR) taken from tops of towers at each site every 15 minutes, available for most sites back to early 2017. 

**PhenoCam-derived phenology** data

- [Metadata descriptions](https://phenocam.sr.unh.edu/webcam/tools/) (under "Standard Data Products" tab)
  - [ROI Image Statistics](https://phenocam.sr.unh.edu/webcam/tools/roi_statistics_format/) files have values, including `gcc`, for each camera image 
  - [PhenoCam 1-day](https://phenocam.sr.unh.edu/webcam/tools/summary_file_format/) files contain daily summaries of values from ROI Image Statistics, including `gcc_90`

**Weather** dataset

- From ORNL's [Daymet](https://daymet.ornl.gov/) 
- Data downloaded using R package [daymetr](https://github.com/bluegreen-labs/daymetr)
  - Note: package has not been updated for Daymet Version 4, so 2020 data not available
- Summary: daily interpolated weather data on 1km x 1km grid for North America

#### File structure

- `pheno_images` folder
  - Scripts
    - `curate_weather.R` downloads, cleans, and joins Daymet weather data to GCC dataset
  - Derived data
    - `targets_gcc.csv` is data curated into targets by [EFI Forecasting Challenge](https://ecoforecast.org/efi-rcn-forecast-challenges/) team
    - `gcc_weather.csv` is joined GCC and Daymet data

#### Curated data details

[The script](https://github.com/eco4cast/neon4cast-phenology/blob/master/phenology-workflow.R) for downloading and cleaning the phenology data provided by EFI Forecasting team. Data up to the current date can be downloaded into this repo by doing the following: 

```{r, eval=FALSE, results='hide', message=FALSE, warning=FALSE}
targets_gcc <- readr::read_csv("https://data.ecoforecast.org/targets/phenology/phenology-targets.csv.gz")
write.csv(targets_gcc, "pheno_images/targets_gcc.csv", row.names = FALSE)
```

Columns: 

- `time`: date
- `siteID`: name of NEON site
- `gcc_90`: 90th percentile of green chromatic coordinate (GCC) from PhenoCam 1-day DB_1000 file
- `gcc_sd`: standard deviation of recalculated 90th percentile GCC from ROI Image Statistics DB_1000 file
- `daylength`: daily day light duration (seconds/day)
- `precipitation`: sum of daily precipitation (mm/day)
- `radiation`: shortwave radiation flux density (W/m2)
- `snow_water_equiv`: amount of water in snow pack (kg/m2)
- `max_temp`: daily maximum temperature (C)
- `min_temp`: daily minimum temperature (C)
- `vapor_pressure`: water vapor pressure (Pa)

Summary figures and stats: 

```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
pheno_images <- read.csv("pheno_images/gcc_weather.csv") %>% 
  mutate(time = as.Date(time))
```

- `r length(unique(pheno_images$siteID))` sites and 7 weather variables

**GCC time series**

```{r, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(pheno_images, aes(x = time, y = gcc_90)) +
  geom_line() +
  geom_ribbon(aes(ymin = gcc_90 - gcc_sd, ymax = gcc_90 + gcc_sd), fill = "red") +
  facet_wrap(~siteID) +
  labs(x = "Date", y = "GCC") +
  theme_classic()
```

**Data availability across time**

```{r, echo=FALSE, message=FALSE, warning=FALSE}
gcc_weather_avail <- pheno_images %>% 
  mutate_at(vars(gcc_90:precip), function(x) ifelse(!is.na(x), 0, NA)) %>% 
  pivot_longer(cols = gcc_90:precip, names_to = "variable", values_to = "variable_presence") %>% 
  mutate(variable = factor(variable, levels = c("gcc_90", "gcc_sd", "radiation", "max_temp", "min_temp", "precip")))

ggplot(gcc_weather_avail, aes(x = time, y = variable_presence)) +
  geom_point() +
  facet_grid(rows = vars(variable), cols = vars(siteID)) +
  xlab("Year") +
  theme_classic() +
  theme(axis.title.y = element_blank(), 
        axis.text.y = element_blank(), 
        axis.ticks.y = element_blank())
```