-
Notifications
You must be signed in to change notification settings - Fork 4
/
README.Rmd
363 lines (280 loc) · 14.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
title: "Curated NEON datasets"
output: github_document
---
Contains scripts for downloading and cleaning data, and the resulting data files.
Metadata for original and curated datasets are in this README.
```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(usmap)
library(taxize)
```
### 1. Plant cover
The final curated dataset contains plant cover values by species at all NEON sites.
#### Original data
- **Plant presence and percent cover** dataset
- Product ID *DP1.10058.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.10058.001)
- Summary: Plant cover for each species of plant was estimated in six 1m2 subplots within 400m2 plots, where plant cover was percent of subplot ground covered as viewed from above. Each site has around 30 plots, with sites distributed across the USA. Plant cover was taken multiple times per year over multiple years, depending on the site.
- Additional useful information
- Some plants have vouchers/tissues collected that may be useful for genetic analyses
- The only data for plant height is `heightPlantOver300cm`, which indicates whether plants are taller than 9.8 feet
#### File structure
- `plant_cover` folder
- Scripts
- `curate_data.R` cleans up data
- Derived data and figures
- `plant_cover.csv` is curated data
#### Curated data details
Columns:
- `species`: species identification
- `lat`: latitude of plot (decimal degrees)
- `lon`: longitude of plot (decimal degrees)
- `sitename`: site, plot, and subplot info combined in format `sitecode_plotID_subplotID`; e.g., `DSNY_DSNY_017_32.4.1` is site DSNY, plot 017, subplot 32.4.1
- `date`: date of end of sampling in format YYYY-MM-DD
- `canopy_cover`: amount of ground covered by that species in 1m2 area (%)
- `uid`: unique identifier for each record as assigned by NEON
Summary figures and stats:
```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
plant_cover <- read.csv("plant_cover/plant_cover.csv")
```
**Locations**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
sites_plots <- plant_cover %>%
separate(sitename, sep = "_", into = c("site", "also_site", "plot", "subplot")) %>%
group_by(site) %>%
summarise(count = n_distinct(plot)) %>%
rename(Site = site, Plots = count)
```
- `r nrow(sites_plots)` sites with `r sum(sites_plots$Plots)` total plots
- Coordinates correspond to plot, not subplot
- Map of plot locations:
```{r, echo=FALSE}
map_background <- map_data("state")
ggplot() +
geom_polygon(data = map_background, aes(x = long, y = lat, group = group),
fill = "white", color = "black") +
geom_point(data = plant_cover, aes(x = lon, y = lat),
color = "blue", shape = 4) +
labs(x = "", y = "") +
theme_classic()
```
- Figure of number of plots per site, ordered by number of plots:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(sites_plots) +
geom_col(aes(x = reorder(Site, -Plots), y = Plots)) +
xlab("Site") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90))
```
**Taxonomy**
- `r nrow(plant_cover)` records for `r length(unique(plant_cover$species))` species
- Table of the 20 species with the most records and their number of occurrences:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
species_counts <- plant_cover %>%
group_by(species) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice(1:20) %>%
rename(Species = species, Occurrences = count)
kable(species_counts, format = "markdown")
```
**Time**
- Records taken on `r length(unique(as.Date(plant_cover$date)))` days from `r min(as.Date(plant_cover$date))` to `r max(as.Date(plant_cover$date))`
- Plot of number of records per day across entire time range:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
dates <- plant_cover %>%
select(date) %>%
mutate(date = as.Date(date)) %>%
group_by(date) %>%
summarize(count = n()) %>%
rename(Date = date, Records = count)
ggplot() +
geom_col(data = dates, aes(x = Date, y = Records), color = "black", fill = "black") +
theme_classic()
```
### 2. Phenology measurements
The final curated dataset contains first date for each individual of at least half of flowers open for species from an NPN list at all NEON sites, combined with the corresponding NEON-collected meteorological data.
#### Original data
**Plant phenology observations** dataset
- Product ID *DP1.10055.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.10055.001)
- Summary: Phenophase status recorded for ~100 individual plants at each site across multiple years. Records are made for all plants up to multiple times a week depending on phenology activity. Each site has one transect along which all plants are included, with each individual plant tracked across each year. Tracked phenophases include initial growth, young leaves/needles, open flowers/pollen cones, colored leaves/needles, and falling leaves/needles.
**Precipitation** dataset
- Product ID *DP1.00006.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.00006.001)
- Summary: Three methods of measuring precipiation were used, with only one or two used at some sites. Primary measurements were with a weighing gauge, second measurements with a tipping bucket on the tower, and throughfall measurements with tipping buckets on the ground. Both primary and throughfall methods were known to have errors in the data.
**Relative humidity** dataset
- Product ID *DP1.00098.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.00098.001)
- Summary: At each NEON site, a Vaisala probe sensor collected relative humidity, air temperature, and dew point temperature measurements at every minute and 30 minutes at multiple locations, including one on the tower at the site. There are missing datapoints for all sites.
#### File structure
- `phenology` folder
- Scripts
- `curate_data.R` cleans up data
- Input data
- `NPN_species_subset1_notes.csv` and `NPN_species_subset2.csv` contain lists of species from NPN with sequenced genomes
- Derived data and figures
- `phenology.csv` is curated data
#### Curated data details
Columns:
- `individualID`: unique identifier assigned to each plant
- `species`: species identification, including only species from [this NPN-based list](https://docs.google.com/document/d/1RnuLpn7sKXCJsJaM1UvufTpWRxXNYkgrRHPCijCIM1E/edit)
- `lat`: latitude of plot (decimal degrees)
- `lon`: longitude of plot (decimal degrees)
- `sitename`: site and unique transect identifier, in the format site_plotID
- `first_flower_date`: earliest date per year for each individual to reach at least 50% of flowers open (i.e., `open flowers` is categorized as `50-74%`)
- `uid_pheno`: unique identifier for the phenophase record
- `uid_ind`: unique identifier for the individual record
- `mean_daily_precip`: mean precipitation (millimeters) at that individual's site in the year of `first_flower_date`, after summing precipitation for each day of year with 48 measurements and taking the mean across the year
- `mean_humid`: mean yearly value, from daily mean humidity values calculated from days with at least ten humidity measurements on tower and summarized across years with at least 180 days of values (%)
- `min_humid`: same as `mean_humid` but minimum value
- `max_humid`: same as `mean_humid` but maximum value
- `mean_temp`: mean yearly value, from daily mean air temperature values calculated from days with at least ten temperature measurements on tower and summarized across years with at least 180 days of values (C)
- `min_temp`: same as `mean_temp` but minimum value
- `max_temp`: same as `mean_temp` but maximum value
- `gdd`: cumulative growing degree days for date of individual's `first_flower_date` starting from beginning of year, summed from growing degree day calculated for each day of the year from minimum and maximum daily temperature for days with at least 24 measurements using 10 degrees as cutoff
Summary figures and stats:
```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
phenology <- read.csv("phenology/phenology.csv")
```
**Locations**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
sites_transects <- phenology %>%
separate(sitename, sep = "_", into = c("site", "transect")) %>%
group_by(site) %>%
summarise(count = n_distinct(transect)) %>%
rename(Site = site, Transects = count)
```
- `r nrow(sites_transects)` sites with `r sum(sites_transects$Transects)` total transects
- From `r min(sites_transects$Transects)` to `r max(sites_transects$Transects)` transects per site
- Map of transect locations:
```{r, echo=FALSE}
pheno_locs <- phenology %>%
select(lon, lat) %>%
drop_na() %>%
usmap_transform()
plot_usmap() +
geom_point(data = pheno_locs, aes(x = lon.1, y = lat.1),
color = "blue", shape = 4) +
theme_void()
```
**Taxonomy**
- `r nrow(phenology)` records for `r length(unique(phenology$species))` species
- Table of all species ordered by number of occurrences:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
species_counts <- phenology %>%
group_by(species) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice(1:20) %>%
rename(Species = species, Occurrences = count)
kable(species_counts, format = "markdown")
```
- Table of all species ordered by number of individuals:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
inds_counts <- phenology %>%
group_by(species) %>%
summarize(count = n_distinct(individualID)) %>%
arrange(desc(count)) %>%
slice(1:20) %>%
rename(Species = species, Individuals = count)
kable(inds_counts, format = "markdown")
```
- Taxonomic tree of species:
```{r, echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
spnames <- unique(stringr::word(phenology$species, 1, 2))
out <- classification(spnames, db='ncbi')
tr <- class2tree(out)
plot(tr)
```
**Time**
- Records taken on `r length(unique(as.Date(phenology$first_flower_date)))` days from `r min(as.Date(phenology$first_flower_date))` to `r max(as.Date(phenology$first_flower_date))`
- Plot of number of records per day across entire time range:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
dates <- phenology %>%
select(first_flower_date) %>%
mutate(first_flower_date = as.Date(first_flower_date)) %>%
group_by(first_flower_date) %>%
summarize(count = n()) %>%
rename(Date = first_flower_date, Records = count)
ggplot() +
geom_col(data = dates, aes(x = Date, y = Records), color = "black", fill = "black") +
theme_classic()
```
### 3. Phenology images
The final curated dataset contains green chromatic coordinate values, which came from images of sites, for a subset of NEON sites, combined with meteorological data from Daymet.
#### Original data
**Phenology images** dataset
- Product ID *DP1.00033.001*
- [Data portal link](https://data.neonscience.org/data-products/DP1.00033.001)
- Data stored on [PhenoCam](https://phenocam.sr.unh.edu/webcam/about/) website [here](https://phenocam.sr.unh.edu/webcam/network/search/?sitename=&type=&primary_vegtype=&dominant_species=&active=unknown&fluxdata=unknown&group=neon); probably have to be downloaded individually by site?
- Summary: Images (RGB and IR) taken from tops of towers at each site every 15 minutes, available for most sites back to early 2017.
**PhenoCam-derived phenology** data
- [Metadata descriptions](https://phenocam.sr.unh.edu/webcam/tools/) (under "Standard Data Products" tab)
- [ROI Image Statistics](https://phenocam.sr.unh.edu/webcam/tools/roi_statistics_format/) files have values, including `gcc`, for each camera image
- [PhenoCam 1-day](https://phenocam.sr.unh.edu/webcam/tools/summary_file_format/) files contain daily summaries of values from ROI Image Statistics, including `gcc_90`
**Weather** dataset
- From ORNL's [Daymet](https://daymet.ornl.gov/)
- Data downloaded using R package [daymetr](https://github.com/bluegreen-labs/daymetr)
- Note: package has not been updated for Daymet Version 4, so 2020 data not available
- Summary: daily interpolated weather data on 1km x 1km grid for North America
#### File structure
- `pheno_images` folder
- Scripts
- `curate_weather.R` downloads, cleans, and joins Daymet weather data to GCC dataset
- Derived data
- `targets_gcc.csv` is data curated into targets by [EFI Forecasting Challenge](https://ecoforecast.org/efi-rcn-forecast-challenges/) team
- `gcc_weather.csv` is joined GCC and Daymet data
#### Curated data details
[The script](https://github.com/eco4cast/neon4cast-phenology/blob/master/phenology-workflow.R) for downloading and cleaning the phenology data provided by EFI Forecasting team. Data up to the current date can be downloaded into this repo by doing the following:
```{r, eval=FALSE, results='hide', message=FALSE, warning=FALSE}
targets_gcc <- readr::read_csv("https://data.ecoforecast.org/targets/phenology/phenology-targets.csv.gz")
write.csv(targets_gcc, "pheno_images/targets_gcc.csv", row.names = FALSE)
```
Columns:
- `time`: date
- `siteID`: name of NEON site
- `gcc_90`: 90th percentile of green chromatic coordinate (GCC) from PhenoCam 1-day DB_1000 file
- `gcc_sd`: standard deviation of recalculated 90th percentile GCC from ROI Image Statistics DB_1000 file
- `daylength`: daily day light duration (seconds/day)
- `precipitation`: sum of daily precipitation (mm/day)
- `radiation`: shortwave radiation flux density (W/m2)
- `snow_water_equiv`: amount of water in snow pack (kg/m2)
- `max_temp`: daily maximum temperature (C)
- `min_temp`: daily minimum temperature (C)
- `vapor_pressure`: water vapor pressure (Pa)
Summary figures and stats:
```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
pheno_images <- read.csv("pheno_images/gcc_weather.csv") %>%
mutate(time = as.Date(time))
```
- `r length(unique(pheno_images$siteID))` sites and 7 weather variables
**GCC time series**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
ggplot(pheno_images, aes(x = time, y = gcc_90)) +
geom_line() +
geom_ribbon(aes(ymin = gcc_90 - gcc_sd, ymax = gcc_90 + gcc_sd), fill = "red") +
facet_wrap(~siteID) +
labs(x = "Date", y = "GCC") +
theme_classic()
```
**Data availability across time**
```{r, echo=FALSE, message=FALSE, warning=FALSE}
gcc_weather_avail <- pheno_images %>%
mutate_at(vars(gcc_90:precip), function(x) ifelse(!is.na(x), 0, NA)) %>%
pivot_longer(cols = gcc_90:precip, names_to = "variable", values_to = "variable_presence") %>%
mutate(variable = factor(variable, levels = c("gcc_90", "gcc_sd", "radiation", "max_temp", "min_temp", "precip")))
ggplot(gcc_weather_avail, aes(x = time, y = variable_presence)) +
geom_point() +
facet_grid(rows = vars(variable), cols = vars(siteID)) +
xlab("Year") +
theme_classic() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```