Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column type validation #740

Open
mkoohafkan opened this issue Dec 19, 2024 · 4 comments
Open

column type validation #740

mkoohafkan opened this issue Dec 19, 2024 · 4 comments

Comments

@mkoohafkan
Copy link

mkoohafkan commented Dec 19, 2024

It would be great if retrieval functions such as readWQPqw(), readNWISdv(), etc. could guarantee returning consistent column types for all fields. I have run across issues where some queries will return fields with different types, typically when the field contains all NA values. For example, I have seen readWQPqw() queries return the "ActivityEndDate" field as a character instead of a Date when all values are NA.

Inconsistent field types can cause issues in split-apply-combine workflows, e.g. using targets to branch over multiple gages/regions/etc. and recombine datasets, or really any workflow that uses dplyr::bind_rows() to merge results from multiple queries. I've written a small validation function in my own workflow to enforce the (presumably correct) field types for readWQPqw() (see below), but I think this is something that should be enforced in dataRetrieval.

library(dplyr)
library(lubridate)
format_WQPqw_data = function(df) {
  char_cols = c(
    "OrganizationIdentifier",
    "OrganizationFormalName",
    "ActivityIdentifier",
    "ActivityTypeCode",
    "ActivityMediaName",
    "ActivityMediaSubdivisionName",
    "ActivityStartTime.Time",
    "ActivityStartTime.TimeZoneCode",
    "ActivityEndTime.Time",
    "ActivityEndTime.TimeZoneCode",
    "ActivityDepthHeightMeasure.MeasureUnitCode",
    "ActivityDepthAltitudeReferencePointText",
    "ActivityTopDepthHeightMeasure.MeasureUnitCode",
    "ActivityBottomDepthHeightMeasure.MeasureUnitCode",
    "ProjectIdentifier",
    "ActivityConductingOrganizationText",
    "MonitoringLocationIdentifier",
    "ActivityCommentText",
    "SampleAquifer",
    "HydrologicCondition",
    "HydrologicEvent",
    "SampleCollectionMethod.MethodIdentifier",
    "SampleCollectionMethod.MethodIdentifierContext",
    "SampleCollectionMethod.MethodName",
    "SampleCollectionEquipmentName",
    "ResultDetectionConditionText",
    "CharacteristicName",
    "ResultSampleFractionText",
    "ResultMeasure.MeasureUnitCode",
    "MeasureQualifierCode",
    "ResultStatusIdentifier",
    "StatisticalBaseCode",
    "ResultValueTypeName",
    "ResultWeightBasisText",
    "ResultTimeBasisText",
    "ResultTemperatureBasisText",
    "ResultParticleSizeBasisText",
    "ResultCommentText",
    "USGSPCode",
    "ResultDepthHeightMeasure.MeasureUnitCode",
    "ResultDepthAltitudeReferencePointText",
    "SubjectTaxonomicName",
    "SampleTissueAnatomyName",
    "ResultAnalyticalMethod.MethodIdentifier",
    "ResultAnalyticalMethod.MethodIdentifierContext",
    "ResultAnalyticalMethod.MethodName",
    "MethodDescriptionText",
    "LaboratoryName",
    "ResultLaboratoryCommentText",
    "DetectionQuantitationLimitTypeName",
    "DetectionQuantitationLimitMeasure.MeasureUnitCode",
    "ProviderName"
  )
  numeric_cols = c(
    "ActivityDepthHeightMeasure.MeasureValue",
    "ActivityTopDepthHeightMeasure.MeasureValue",
    "ActivityBottomDepthHeightMeasure.MeasureValue",
    "ResultMeasureValue",
    "PrecisionValue",
    "ResultDepthHeightMeasure.MeasureValue",
    "DetectionQuantitationLimitMeasure.MeasureValue",
    "ActivityStartTime.TimeZoneCode_offset",
    "ActivityEndTime.TimeZoneCode_offset"
  )
  date_cols = c(
    "ActivityStartDate",
    "ActivityEndDate",
    "AnalysisStartDate",
    "PreparationStartDate"
  )
  datetime_cols = c(
    "ActivityStartDateTime",
    "ActivityEndDateTime"
  )

  df |>
    mutate(
      across(any_of(char_cols), as.character),
      across(any_of(numeric_cols), as.numeric),
      across(any_of(date_cols), as_date),
      across(any_of(datetime_cols), as_datetime)
    )
}
@ldecicco-USGS
Copy link
Collaborator

Believe me...I've wanted to do this for a long time. However, there are MANY edge cases where a column that is normally a date (or number or whatever) is just not. Sometimes, it is an error in the data, but in most cases - it is actually how the data is suppose to be released. For example, there might be a value "ice" in the results_va. If we only output numeric, we'll lose important information. Similarly with dates, there are many examples that might be 12/2004 or 2004. We cannot just add a fake middle date and we also don't want to lose that information.

OK, so what to do? I usually use the argument "convertType=FALSE", this brings everything in as a character (which therefore allows bind_cols and other methods like that to work). Then, for WQP queries, you can run parse_WQP to set the columns types. The parse_WQP function is similar to what you've got above, it also makes the dateTime posixct objects.

readNWISdv doesn't have the convertType option, but readNWISdata does. There are going to be a lot of changes to the NWIS workflows in the near term, so I'm a bit hesitant to change those up now to have a similar option like the WQP functions.

@mkoohafkan
Copy link
Author

thanks for the reference to parse_WQP(), and I'll keep an eye out for changes to NWIS workflows.

Your example surprises me, I would expect that "ice" would go in that status flag field rather than the result value field. The incomplete date issue is tricky.

@mkoohafkan
Copy link
Author

mkoohafkan commented Dec 20, 2024

different but related issue: how about consistent fields? An empty query can return a 0x0 tibble, but it would be better to return a 0xN tibble with all the field names.

@ldecicco-USGS
Copy link
Collaborator

We did use to do that, but the issue was that (especially with WQP) there's been an explosion of "dataProfiles" - and with each dataProfile, there's a new set of column and column names. So while it was easy to set an empty data frame with the correct names but no rows when there was a single format returned, now it is a lot more challenging. However, I think in the new beta services (ResultsWQX), calls are coming back with exactly what you are asking for (from the service, which is much more ideal than having dataRetrieval create it).

I could be wrong, but I think NWIS also comes back with the info to make an empty data.frame (we've avoided adding dplyr/tibble as a dependency). If not, they are going through a major modernization this year so something like that will probably happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants