Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute percentiles for daily values and their colors #45

Merged
merged 18 commits into from
Nov 7, 2018

Conversation

lindsayplatt
Copy link
Contributor

@lindsayplatt lindsayplatt commented Nov 6, 2018

Computes which color the site should be for each day based on its associated stats and the daily value for discharge. The general format for the output file, 2_process/out/dv_stat_colors.rds is:

site_no   dateTime dv_val       per p50_va   color
1 50011200 2018-09-12   13.9 0.8070975    3.5 #8CC0D9
2 50011200 2018-09-13   14.1 0.7907767    3.3 #96C6DE
3 50011200 2018-09-14   17.8 0.7991803    3.4 #92C5DE
4 50011200 2018-09-15   13.5 0.8066584    3.5 #8DC0D9
5 50011200 2018-09-16   13.3 0.7775168    2.4 #9CC9DF
6 50011200 2018-09-17   13.0 0.7668142    3.6 #A1CBE0

Sometimes when I build, even after I've sat through 1_fetch/out/dv_data.rds.ind run through and download daily discharge for all 8,364 sites, when dv_data <- readRDS(scipiper::sc_retrieve(dv_data_ind, remake_file = '1_fetch.yml')) runs during 2_process/out/dv_stats.rds.ind, it kicks off that download yet again...I haven't figured out why.

Also, there are still 24 observations that might be duplicated. When the left_join happens in process_dv_stats.R, 24 more rows are added.

Fixes #19

Carr added 12 commits November 5, 2018 15:49
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors

# Conflicts:
#	2_process.yml
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors

# Conflicts:
#	1_fetch.yml
#	viz_config.yml
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors

# Conflicts:
#	2_process.yml
@lindsayplatt lindsayplatt changed the title Compute colors Compute percentiles for daily values and their colors Nov 6, 2018
last_site <- i+request_limit-1
if(i == tail(req_bks, 1) && last_site > length(sites)) {
last_site <- length(sites)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if the above 4 lines could be compressed to

last_site <- min(i+request_limit-1, length(sites))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooo nice!



dv_data <- rbind(dv_data, data_i)
print(paste("Completed", last_site, "of", length(sites)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use message instead of print?

parameterCd = "00060",
statCd = "00003") %>%
dplyr::distinct() %>%
dplyr::pull(site_no) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider swapping the above two lines. distinct on multiple columns could still leave duplicates in site_no...which I think is what you want to avoid, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but distinct can't work on a vector (which is the output of dplyr::pull) so I will need to use unique()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. unique, then.

req_bks <- seq(1, length(sites), by=request_limit)
stat_data <- data.frame()
for(i in req_bks) {
last_site <- i+request_limit-1
if(i == tail(req_bks, 1) && last_site > length(sites)) {
last_site <- length(sites)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment under fetch_dv_data...i might be missing something but think this could be simplified

2_process.yml Outdated

# -- config --
proj_str:
command: viz_config[[I('projection')]]
color_palette:
command: viz_config[[I('color_palette')]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify sites_color_palette to distinguish from the overall map palette (ocean and state colors, borders, etc.)?

@@ -1,5 +1,5 @@
choose_timesteps <- function(ind_file, dates) {
timesteps <- seq(as.POSIXct(dates$start, tz = "UTC"), as.POSIXct(dates$end, tz = "UTC"), by = 'hours')
timesteps <- seq(as.POSIXct(dates$start, tz = "UTC"), as.POSIXct(dates$end, tz = "UTC"), by = 'days')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Note for the future: if we ever decide to try out 12-hour timesteps instead of daily ones (#23), this is where I think we'd want to adjust that.

# Write the data file and the indicator file
data_file <- scipiper::as_data_file(ind_file)
saveRDS(dv_stats_with_color, data_file)
scipiper::gd_put(ind_file, data_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this pattern is fine, but newer versions of scipiper also allow you to just pass the ind_file to gd_put, so these three lines could even be two if you want:

   saveRDS(dv_stats_with_color, scipiper::as_data_file(ind_file))
   scipiper::gd_put(ind_file)

Copy link
Contributor Author

@lindsayplatt lindsayplatt Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK! I have just been copy and pasting this command and changing the object to save. Good to know :) I'll change this one and hopefully remember to use this as an example of how to do it!

print(paste("Completed", last_site, "of", length(sites)))
}

dv_data_unique <- dplyr::distinct(dv_data) # need this to avoid some duplicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other duplicates may be discoverable here if you look for rows that are distinct just among the columns agency_cd, site_no, and dateTime (e.g., check the results of dup_dv <- dv_data %>% group_by(site_no, month_nu, day_nu) %>% summarize(n=n()) %>% filter(n > 1) %>% left_join(dv_data, by=c('site_no','month_nu','day_nu')))

day_nu = as.numeric(format(dateTime, "%d")))

# merge stats with the dv data
# merge still results in extra rows - 24 extra to be exact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was seeing those with the test data you shared - they're duplicates in dv_data_md, right? They might be resolved by the suggestion above to call pull before distinct in fetch_dv_sites, or you could remove the duplicates in lines just above this one or at the end of fetch_dv_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even after adding that step, I am still getting 24 extra observations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Can you figure out which ones they are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on this FYI!

stat_colnames <- sprintf("p%s_va", percentiles)
stat_perc <- as.numeric(percentiles)/100

int_per <- function(df){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does int_per stand for? A comment explaining the goal of this function might be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep thinking interpolated percentile but @jread-usgs created this code that I have more or less been cutting and pasting. That is what it is doing though - taking the current daily value and interpolating what it's percentile is based on the percentiles for the site and day of the year

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is what it does. interpolate to the percentile.

@aappling-usgs
Copy link
Contributor

Lemme know how much more time you want to spend debugging the scipiper workflow, if any (the stuff we were discussing on slack last night).

@aappling-usgs
Copy link
Contributor

Rats about the conflicts in 2_process/out/site_locations_sp.rds.ind and build/status/Ml9wcm9jZXNzL291dC9zaXRlX2xvY2F0aW9uc19zcC5yZHMuaW5k.yml. As a reminder, the information to keep is the information from the more recent change to Drive, whichever file that was. The date stamp is in the build/status file.

Carr added 5 commits November 7, 2018 11:24
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors

# Conflicts:
#	2_process.yml
#	2_process/out/site_locations_sp.rds.ind
#	build/status/Ml9wcm9jZXNzL291dC9zaXRlX2xvY2F0aW9uc19zcC5yZHMuaW5k.yml
#	viz_config.yml
@lindsayplatt
Copy link
Contributor Author

Ok, @aappling-usgs how's this??

dplyr::pull(site_no) %>%
unique() %>%
c(sites)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that some sites show up in multiple HUCs? I don't know how that would happen, but since you're still seeing duplicates after many other fixes...

Copy link
Contributor

@aappling-usgs aappling-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. @lindsaycarr , go ahead and merge this PR whenever you're ready (sounds like you're still exploring that last duplicates question). I'll be in meetings for the next couple hours and approve this PR pending whatever changes you decide to make or not make.

@lindsayplatt
Copy link
Contributor Author

Some progress. It comes down to duplicate (sometimes x2, sometimes x3) stats for these 4 sites: "01574500" "03292555" "06903900" "08188590". Calling that good for this PR, and will try to add something to the dv_stats processing step in a later PR.

library(dplyr)

dv_data <- readRDS('1_fetch/out/dv_data.rds')
site_stats <- readRDS('2_process/out/site_stats_clean.rds')

dv_data_md <- dv_data %>%
  dplyr::mutate(month_nu = as.numeric(format(dateTime, "%m")),
                day_nu = as.numeric(format(dateTime, "%d")))

dup_dv <- dv_data_md %>% 
  group_by(site_no, month_nu, day_nu) %>% 
  summarize(n=n()) %>% filter(n > 1) %>% 
  left_join(dv_data_md, by=c('site_no','month_nu','day_nu'))

dup_stats <- site_stats %>% 
  group_by(site_no, month_nu, day_nu, begin_yr, end_yr) %>% 
  summarize(n=n()) %>% filter(n > 1) %>% 
  left_join(site_stats, by=c('site_no','month_nu','day_nu')) %>% 
  filter(month_nu == 9, day_nu <= 19 & day_nu >= 12)

dup_stats_info <- dup_stats %>% 
  group_by(site_no, month_nu, day_nu) %>% 
  summarize(count = n())
sum(dup_stats_info$count)

dv_with_stats <- left_join(dv_data_md, site_stats, 
                           by = c("site_no", "month_nu", "day_nu"))

dv_with_stats_dup <- dv_with_stats %>% 
  select(site_no, month_nu, day_nu) %>% 
  filter(site_no %in% dup_stats_info$site_no) 

# How many unique?
nrow(unique(dv_with_stats_dup))
# 16 - 2 sites, 8 days. Makes sense

# How many duplicates?
nrow(dv_with_stats_dup) - nrow(unique(dv_with_stats_dup))
# 24 - AH! the number of extra observations we are seeing after the join

@lindsayplatt lindsayplatt merged commit f76e89d into DOI-USGS:master Nov 7, 2018
@lindsayplatt lindsayplatt deleted the compute_colors branch November 13, 2018 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants