-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute percentiles for daily values and their colors #45
Conversation
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 1_fetch.yml # viz_config.yml
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml
1_fetch/src/fetch_dv_data.R
Outdated
last_site <- i+request_limit-1 | ||
if(i == tail(req_bks, 1) && last_site > length(sites)) { | ||
last_site <- length(sites) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if the above 4 lines could be compressed to
last_site <- min(i+request_limit-1, length(sites))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooo nice!
1_fetch/src/fetch_dv_data.R
Outdated
|
||
|
||
dv_data <- rbind(dv_data, data_i) | ||
print(paste("Completed", last_site, "of", length(sites))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use message instead of print?
parameterCd = "00060", | ||
statCd = "00003") %>% | ||
dplyr::distinct() %>% | ||
dplyr::pull(site_no) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider swapping the above two lines. distinct
on multiple columns could still leave duplicates in site_no
...which I think is what you want to avoid, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but distinct
can't work on a vector (which is the output of dplyr::pull
) so I will need to use unique()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. unique
, then.
1_fetch/src/fetch_site_stats.R
Outdated
req_bks <- seq(1, length(sites), by=request_limit) | ||
stat_data <- data.frame() | ||
for(i in req_bks) { | ||
last_site <- i+request_limit-1 | ||
if(i == tail(req_bks, 1) && last_site > length(sites)) { | ||
last_site <- length(sites) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment under fetch_dv_data...i might be missing something but think this could be simplified
2_process.yml
Outdated
|
||
# -- config -- | ||
proj_str: | ||
command: viz_config[[I('projection')]] | ||
color_palette: | ||
command: viz_config[[I('color_palette')]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify sites_color_palette to distinguish from the overall map palette (ocean and state colors, borders, etc.)?
@@ -1,5 +1,5 @@ | |||
choose_timesteps <- function(ind_file, dates) { | |||
timesteps <- seq(as.POSIXct(dates$start, tz = "UTC"), as.POSIXct(dates$end, tz = "UTC"), by = 'hours') | |||
timesteps <- seq(as.POSIXct(dates$start, tz = "UTC"), as.POSIXct(dates$end, tz = "UTC"), by = 'days') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Note for the future: if we ever decide to try out 12-hour timesteps instead of daily ones (#23), this is where I think we'd want to adjust that.
# Write the data file and the indicator file | ||
data_file <- scipiper::as_data_file(ind_file) | ||
saveRDS(dv_stats_with_color, data_file) | ||
scipiper::gd_put(ind_file, data_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this pattern is fine, but newer versions of scipiper also allow you to just pass the ind_file
to gd_put
, so these three lines could even be two if you want:
saveRDS(dv_stats_with_color, scipiper::as_data_file(ind_file))
scipiper::gd_put(ind_file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK! I have just been copy and pasting this command and changing the object to save. Good to know :) I'll change this one and hopefully remember to use this as an example of how to do it!
print(paste("Completed", last_site, "of", length(sites))) | ||
} | ||
|
||
dv_data_unique <- dplyr::distinct(dv_data) # need this to avoid some duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other duplicates may be discoverable here if you look for rows that are distinct just among the columns agency_cd, site_no, and dateTime (e.g., check the results of dup_dv <- dv_data %>% group_by(site_no, month_nu, day_nu) %>% summarize(n=n()) %>% filter(n > 1) %>% left_join(dv_data, by=c('site_no','month_nu','day_nu'))
)
day_nu = as.numeric(format(dateTime, "%d"))) | ||
|
||
# merge stats with the dv data | ||
# merge still results in extra rows - 24 extra to be exact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was seeing those with the test data you shared - they're duplicates in dv_data_md, right? They might be resolved by the suggestion above to call pull
before distinct
in fetch_dv_sites
, or you could remove the duplicates in lines just above this one or at the end of fetch_dv_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even after adding that step, I am still getting 24 extra observations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. Can you figure out which ones they are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on this FYI!
2_process/src/process_dv_stats.R
Outdated
stat_colnames <- sprintf("p%s_va", percentiles) | ||
stat_perc <- as.numeric(percentiles)/100 | ||
|
||
int_per <- function(df){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does int_per
stand for? A comment explaining the goal of this function might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep thinking interpolated percentile
but @jread-usgs created this code that I have more or less been cutting and pasting. That is what it is doing though - taking the current daily value and interpolating what it's percentile is based on the percentiles for the site and day of the year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is what it does. interpolate to the percentile.
Lemme know how much more time you want to spend debugging the scipiper workflow, if any (the stuff we were discussing on slack last night). |
Rats about the conflicts in |
Merge branch 'master' of github.com:USGS-VIZLAB/gage-conditions-gif into compute_colors # Conflicts: # 2_process.yml # 2_process/out/site_locations_sp.rds.ind # build/status/Ml9wcm9jZXNzL291dC9zaXRlX2xvY2F0aW9uc19zcC5yZHMuaW5k.yml # viz_config.yml
Ok, @aappling-usgs how's this?? |
dplyr::pull(site_no) %>% | ||
unique() %>% | ||
c(sites) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that some sites show up in multiple HUCs? I don't know how that would happen, but since you're still seeing duplicates after many other fixes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. @lindsaycarr , go ahead and merge this PR whenever you're ready (sounds like you're still exploring that last duplicates question). I'll be in meetings for the next couple hours and approve this PR pending whatever changes you decide to make or not make.
Some progress. It comes down to duplicate (sometimes x2, sometimes x3) stats for these 4 sites: "01574500" "03292555" "06903900" "08188590". Calling that good for this PR, and will try to add something to the dv_stats processing step in a later PR.
|
Computes which color the site should be for each day based on its associated stats and the daily value for discharge. The general format for the output file,
2_process/out/dv_stat_colors.rds
is:Sometimes when I build, even after I've sat through
1_fetch/out/dv_data.rds.ind
run through and download daily discharge for all 8,364 sites, whendv_data <- readRDS(scipiper::sc_retrieve(dv_data_ind, remake_file = '1_fetch.yml'))
runs during2_process/out/dv_stats.rds.ind
, it kicks off that download yet again...I haven't figured out why.Also, there are still 24 observations that might be duplicated. When the left_join happens in
process_dv_stats.R
, 24 more rows are added.Fixes #19