Compute percentiles for daily values and their colors #45

lindsayplatt · 2018-11-06T23:16:41Z

Computes which color the site should be for each day based on its associated stats and the daily value for discharge. The general format for the output file, 2_process/out/dv_stat_colors.rds is:

site_no   dateTime dv_val       per p50_va   color
1 50011200 2018-09-12   13.9 0.8070975    3.5 #8CC0D9
2 50011200 2018-09-13   14.1 0.7907767    3.3 #96C6DE
3 50011200 2018-09-14   17.8 0.7991803    3.4 #92C5DE
4 50011200 2018-09-15   13.5 0.8066584    3.5 #8DC0D9
5 50011200 2018-09-16   13.3 0.7775168    2.4 #9CC9DF
6 50011200 2018-09-17   13.0 0.7668142    3.6 #A1CBE0

Sometimes when I build, even after I've sat through 1_fetch/out/dv_data.rds.ind run through and download daily discharge for all 8,364 sites, when dv_data <- readRDS(scipiper::sc_retrieve(dv_data_ind, remake_file = '1_fetch.yml')) runs during 2_process/out/dv_stats.rds.ind, it kicks off that download yet again...I haven't figured out why.

Also, there are still 24 observations that might be duplicated. When the left_join happens in process_dv_stats.R, 24 more rows are added.

Fixes #19

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 1_fetch.yml # viz_config.yml

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml

aappling-usgs · 2018-11-07T14:06:32Z

1_fetch/src/fetch_dv_data.R

+    last_site <- i+request_limit-1
+    if(i == tail(req_bks, 1) && last_site > length(sites)) {
+      last_site <- length(sites)
+    }


wondering if the above 4 lines could be compressed to

last_site <- min(i+request_limit-1, length(sites))

aappling-usgs · 2018-11-07T14:08:12Z

1_fetch/src/fetch_dv_data.R

-
+
+    dv_data <- rbind(dv_data, data_i)
+    print(paste("Completed", last_site, "of", length(sites)))


Use message instead of print?

aappling-usgs · 2018-11-07T14:09:51Z

1_fetch/src/fetch_dv_sites.R

+        parameterCd = "00060",
+        statCd = "00003") %>%
+      dplyr::distinct() %>%
+      dplyr::pull(site_no) %>%


Consider swapping the above two lines. distinct on multiple columns could still leave duplicates in site_no...which I think is what you want to avoid, right?

True, but distinct can't work on a vector (which is the output of dplyr::pull) so I will need to use unique()

Oh, right. unique, then.

aappling-usgs · 2018-11-07T14:10:43Z

1_fetch/src/fetch_site_stats.R

  req_bks <- seq(1, length(sites), by=request_limit)
  stat_data <- data.frame()
  for(i in req_bks) {
    last_site <- i+request_limit-1
+    if(i == tail(req_bks, 1) && last_site > length(sites)) {
+      last_site <- length(sites)
+    }


See comment under fetch_dv_data...i might be missing something but think this could be simplified

aappling-usgs · 2018-11-07T14:11:55Z

2_process.yml


  # -- config --
  proj_str:
    command: viz_config[[I('projection')]]
+  color_palette:
+    command: viz_config[[I('color_palette')]]


Specify sites_color_palette to distinguish from the overall map palette (ocean and state colors, borders, etc.)?

aappling-usgs · 2018-11-07T14:49:20Z

2_process/src/choose_timesteps.R

@@ -1,5 +1,5 @@
 choose_timesteps <- function(ind_file, dates) {
-  timesteps <- seq(as.POSIXct(dates$start, tz = "UTC"), as.POSIXct(dates$end, tz = "UTC"), by = 'hours')
+  timesteps <- seq(as.POSIXct(dates$start, tz = "UTC"), as.POSIXct(dates$end, tz = "UTC"), by = 'days')


Looks good. Note for the future: if we ever decide to try out 12-hour timesteps instead of daily ones (#23), this is where I think we'd want to adjust that.

aappling-usgs · 2018-11-07T14:51:53Z

2_process/src/process_dv_stat_colors.R

+  # Write the data file and the indicator file
+  data_file <- scipiper::as_data_file(ind_file)
+  saveRDS(dv_stats_with_color, data_file)
+  scipiper::gd_put(ind_file, data_file)


FYI, this pattern is fine, but newer versions of scipiper also allow you to just pass the ind_file to gd_put, so these three lines could even be two if you want:

saveRDS(dv_stats_with_color, scipiper::as_data_file(ind_file)) scipiper::gd_put(ind_file)

Ah, OK! I have just been copy and pasting this command and changing the object to save. Good to know :) I'll change this one and hopefully remember to use this as an example of how to do it!

aappling-usgs · 2018-11-07T15:22:13Z

1_fetch/src/fetch_dv_data.R

+    print(paste("Completed", last_site, "of", length(sites)))
+  }
+
+  dv_data_unique <- dplyr::distinct(dv_data) # need this to avoid some duplicates


Other duplicates may be discoverable here if you look for rows that are distinct just among the columns agency_cd, site_no, and dateTime (e.g., check the results of dup_dv <- dv_data %>% group_by(site_no, month_nu, day_nu) %>% summarize(n=n()) %>% filter(n > 1) %>% left_join(dv_data, by=c('site_no','month_nu','day_nu')))

aappling-usgs · 2018-11-07T15:22:36Z

2_process/src/process_dv_stats.R

+                  day_nu = as.numeric(format(dateTime, "%d")))
+
+  # merge stats with the dv data
+  # merge still results in extra rows - 24 extra to be exact


I think I was seeing those with the test data you shared - they're duplicates in dv_data_md, right? They might be resolved by the suggestion above to call pull before distinct in fetch_dv_sites, or you could remove the duplicates in lines just above this one or at the end of fetch_dv_data

even after adding that step, I am still getting 24 extra observations

Hmmm. Can you figure out which ones they are?

Working on this FYI!

aappling-usgs · 2018-11-07T15:24:20Z

2_process/src/process_dv_stats.R

+  stat_colnames <- sprintf("p%s_va", percentiles)
+  stat_perc <- as.numeric(percentiles)/100
+
+  int_per <- function(df){


What does int_per stand for? A comment explaining the goal of this function might be useful.

I keep thinking interpolated percentile but @jread-usgs created this code that I have more or less been cutting and pasting. That is what it is doing though - taking the current daily value and interpolating what it's percentile is based on the percentiles for the site and day of the year

yes, that is what it does. interpolate to the percentile.

aappling-usgs · 2018-11-07T15:26:34Z

Lemme know how much more time you want to spend debugging the scipiper workflow, if any (the stuff we were discussing on slack last night).

aappling-usgs · 2018-11-07T16:39:29Z

Rats about the conflicts in 2_process/out/site_locations_sp.rds.ind and build/status/Ml9wcm9jZXNzL291dC9zaXRlX2xvY2F0aW9uc19zcC5yZHMuaW5k.yml. As a reminder, the information to keep is the information from the more recent change to Drive, whichever file that was. The date stamp is in the build/status file.

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml # 2_process/out/site_locations_sp.rds.ind # build/status/Ml9wcm9jZXNzL291dC9zaXRlX2xvY2F0aW9uc19zcC5yZHMuaW5k.yml # viz_config.yml

lindsayplatt · 2018-11-07T17:49:12Z

Ok, @aappling-usgs how's this??

aappling-usgs · 2018-11-07T18:12:08Z

1_fetch/src/fetch_dv_sites.R

      dplyr::pull(site_no) %>%
+      unique() %>%
      c(sites)


Is it possible that some sites show up in multiple HUCs? I don't know how that would happen, but since you're still seeing duplicates after many other fixes...

aappling-usgs

Looks great. @lindsaycarr , go ahead and merge this PR whenever you're ready (sounds like you're still exploring that last duplicates question). I'll be in meetings for the next couple hours and approve this PR pending whatever changes you decide to make or not make.

lindsayplatt · 2018-11-07T18:25:44Z

Some progress. It comes down to duplicate (sometimes x2, sometimes x3) stats for these 4 sites: "01574500" "03292555" "06903900" "08188590". Calling that good for this PR, and will try to add something to the dv_stats processing step in a later PR.

library(dplyr)

dv_data <- readRDS('1_fetch/out/dv_data.rds')
site_stats <- readRDS('2_process/out/site_stats_clean.rds')

dv_data_md <- dv_data %>%
  dplyr::mutate(month_nu = as.numeric(format(dateTime, "%m")),
                day_nu = as.numeric(format(dateTime, "%d")))

dup_dv <- dv_data_md %>% 
  group_by(site_no, month_nu, day_nu) %>% 
  summarize(n=n()) %>% filter(n > 1) %>% 
  left_join(dv_data_md, by=c('site_no','month_nu','day_nu'))

dup_stats <- site_stats %>% 
  group_by(site_no, month_nu, day_nu, begin_yr, end_yr) %>% 
  summarize(n=n()) %>% filter(n > 1) %>% 
  left_join(site_stats, by=c('site_no','month_nu','day_nu')) %>% 
  filter(month_nu == 9, day_nu <= 19 & day_nu >= 12)

dup_stats_info <- dup_stats %>% 
  group_by(site_no, month_nu, day_nu) %>% 
  summarize(count = n())
sum(dup_stats_info$count)

dv_with_stats <- left_join(dv_data_md, site_stats, 
                           by = c("site_no", "month_nu", "day_nu"))

dv_with_stats_dup <- dv_with_stats %>% 
  select(site_no, month_nu, day_nu) %>% 
  filter(site_no %in% dup_stats_info$site_no) 

# How many unique?
nrow(unique(dv_with_stats_dup))
# 16 - 2 sites, 8 days. Makes sense

# How many duplicates?
nrow(dv_with_stats_dup) - nrow(unique(dv_with_stats_dup))
# 24 - AH! the number of extra observations we are seeing after the join

Carr added 12 commits November 5, 2018 15:49

merge conflicts

7206599

update how dv_data and site_stats are fetched

bd04778

fix dv_data fetch to remove duplicates

16cec6f

add process step for getting unique site stats

6c3fa3e

calculate the percentiles for all daily values

b1a783a

calculate colors based on the dv percentiles

919cabc

merge conflicts

30ba280

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml

delete extra Rproj file

9c0b0c1

merge conflicts

a9d6e60

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 1_fetch.yml # viz_config.yml

a few more measures to request unique data only

f9fc643

run scmake(remake_file='2_process.yml')

f4cb8ec

merge conflicts

c1ebb51

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml

lindsayplatt requested a review from aappling-usgs November 6, 2018 23:17

lindsayplatt changed the title ~~Compute colors~~ Compute percentiles for daily values and their colors Nov 6, 2018

aappling-usgs reviewed Nov 7, 2018

View reviewed changes

fixes based on aappling-usgs review

af242e2

Carr added 5 commits November 7, 2018 11:24

build fetch sites, fetch stats, and process stats_clean

8d09f41

fetch dv data

be8a91a

build dv stats and colors steps

afaa891

merge conflicts

5e82f3b

Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml # 2_process/out/site_locations_sp.rds.ind # build/status/Ml9wcm9jZXNzL291dC9zaXRlX2xvY2F0aW9uc19zcC5yZHMuaW5k.yml # viz_config.yml

remove site_locations_sp stuff and rebuild

def11ed

aappling-usgs reviewed Nov 7, 2018

View reviewed changes

aappling-usgs approved these changes Nov 7, 2018

View reviewed changes

lindsayplatt merged commit f76e89d into DOI-USGS:master Nov 7, 2018

lindsayplatt deleted the compute_colors branch November 13, 2018 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute percentiles for daily values and their colors #45

Compute percentiles for daily values and their colors #45

lindsayplatt commented Nov 6, 2018 •

edited

Loading

aappling-usgs Nov 7, 2018

lindsayplatt Nov 7, 2018

aappling-usgs Nov 7, 2018

aappling-usgs Nov 7, 2018

lindsayplatt Nov 7, 2018

aappling-usgs Nov 7, 2018

aappling-usgs Nov 7, 2018

aappling-usgs Nov 7, 2018

aappling-usgs Nov 7, 2018

aappling-usgs Nov 7, 2018

lindsayplatt Nov 7, 2018 •

edited

Loading

aappling-usgs Nov 7, 2018

aappling-usgs Nov 7, 2018

lindsayplatt Nov 7, 2018

aappling-usgs Nov 7, 2018

lindsayplatt Nov 7, 2018

aappling-usgs Nov 7, 2018

lindsayplatt Nov 7, 2018

jordansread Nov 7, 2018

aappling-usgs commented Nov 7, 2018

aappling-usgs commented Nov 7, 2018

lindsayplatt commented Nov 7, 2018

aappling-usgs Nov 7, 2018

aappling-usgs left a comment

lindsayplatt commented Nov 7, 2018



		dv_data <- rbind(dv_data, data_i)
		print(paste("Completed", last_site, "of", length(sites)))

Compute percentiles for daily values and their colors #45

Compute percentiles for daily values and their colors #45

Conversation

lindsayplatt commented Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindsayplatt Nov 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aappling-usgs commented Nov 7, 2018

aappling-usgs commented Nov 7, 2018

lindsayplatt commented Nov 7, 2018

Choose a reason for hiding this comment

aappling-usgs left a comment

Choose a reason for hiding this comment

lindsayplatt commented Nov 7, 2018

lindsayplatt commented Nov 6, 2018 •

edited

Loading

lindsayplatt Nov 7, 2018 •

edited

Loading