column type validation #740

mkoohafkan · 2024-12-19T19:16:01Z

It would be great if retrieval functions such as readWQPqw(), readNWISdv(), etc. could guarantee returning consistent column types for all fields. I have run across issues where some queries will return fields with different types, typically when the field contains all NA values. For example, I have seen readWQPqw() queries return the "ActivityEndDate" field as a character instead of a Date when all values are NA.

Inconsistent field types can cause issues in split-apply-combine workflows, e.g. using targets to branch over multiple gages/regions/etc. and recombine datasets, or really any workflow that uses dplyr::bind_rows() to merge results from multiple queries. I've written a small validation function in my own workflow to enforce the (presumably correct) field types for readWQPqw() (see below), but I think this is something that should be enforced in dataRetrieval.

library(dplyr)
library(lubridate)
format_WQPqw_data = function(df) {
  char_cols = c(
    "OrganizationIdentifier",
    "OrganizationFormalName",
    "ActivityIdentifier",
    "ActivityTypeCode",
    "ActivityMediaName",
    "ActivityMediaSubdivisionName",
    "ActivityStartTime.Time",
    "ActivityStartTime.TimeZoneCode",
    "ActivityEndTime.Time",
    "ActivityEndTime.TimeZoneCode",
    "ActivityDepthHeightMeasure.MeasureUnitCode",
    "ActivityDepthAltitudeReferencePointText",
    "ActivityTopDepthHeightMeasure.MeasureUnitCode",
    "ActivityBottomDepthHeightMeasure.MeasureUnitCode",
    "ProjectIdentifier",
    "ActivityConductingOrganizationText",
    "MonitoringLocationIdentifier",
    "ActivityCommentText",
    "SampleAquifer",
    "HydrologicCondition",
    "HydrologicEvent",
    "SampleCollectionMethod.MethodIdentifier",
    "SampleCollectionMethod.MethodIdentifierContext",
    "SampleCollectionMethod.MethodName",
    "SampleCollectionEquipmentName",
    "ResultDetectionConditionText",
    "CharacteristicName",
    "ResultSampleFractionText",
    "ResultMeasure.MeasureUnitCode",
    "MeasureQualifierCode",
    "ResultStatusIdentifier",
    "StatisticalBaseCode",
    "ResultValueTypeName",
    "ResultWeightBasisText",
    "ResultTimeBasisText",
    "ResultTemperatureBasisText",
    "ResultParticleSizeBasisText",
    "ResultCommentText",
    "USGSPCode",
    "ResultDepthHeightMeasure.MeasureUnitCode",
    "ResultDepthAltitudeReferencePointText",
    "SubjectTaxonomicName",
    "SampleTissueAnatomyName",
    "ResultAnalyticalMethod.MethodIdentifier",
    "ResultAnalyticalMethod.MethodIdentifierContext",
    "ResultAnalyticalMethod.MethodName",
    "MethodDescriptionText",
    "LaboratoryName",
    "ResultLaboratoryCommentText",
    "DetectionQuantitationLimitTypeName",
    "DetectionQuantitationLimitMeasure.MeasureUnitCode",
    "ProviderName"
  )
  numeric_cols = c(
    "ActivityDepthHeightMeasure.MeasureValue",
    "ActivityTopDepthHeightMeasure.MeasureValue",
    "ActivityBottomDepthHeightMeasure.MeasureValue",
    "ResultMeasureValue",
    "PrecisionValue",
    "ResultDepthHeightMeasure.MeasureValue",
    "DetectionQuantitationLimitMeasure.MeasureValue",
    "ActivityStartTime.TimeZoneCode_offset",
    "ActivityEndTime.TimeZoneCode_offset"
  )
  date_cols = c(
    "ActivityStartDate",
    "ActivityEndDate",
    "AnalysisStartDate",
    "PreparationStartDate"
  )
  datetime_cols = c(
    "ActivityStartDateTime",
    "ActivityEndDateTime"
  )

  df |>
    mutate(
      across(any_of(char_cols), as.character),
      across(any_of(numeric_cols), as.numeric),
      across(any_of(date_cols), as_date),
      across(any_of(datetime_cols), as_datetime)
    )
}

The text was updated successfully, but these errors were encountered:

ldecicco-USGS · 2024-12-19T22:10:44Z

Believe me...I've wanted to do this for a long time. However, there are MANY edge cases where a column that is normally a date (or number or whatever) is just not. Sometimes, it is an error in the data, but in most cases - it is actually how the data is suppose to be released. For example, there might be a value "ice" in the results_va. If we only output numeric, we'll lose important information. Similarly with dates, there are many examples that might be 12/2004 or 2004. We cannot just add a fake middle date and we also don't want to lose that information.

OK, so what to do? I usually use the argument "convertType=FALSE", this brings everything in as a character (which therefore allows bind_cols and other methods like that to work). Then, for WQP queries, you can run parse_WQP to set the columns types. The parse_WQP function is similar to what you've got above, it also makes the dateTime posixct objects.

readNWISdv doesn't have the convertType option, but readNWISdata does. There are going to be a lot of changes to the NWIS workflows in the near term, so I'm a bit hesitant to change those up now to have a similar option like the WQP functions.

mkoohafkan · 2024-12-19T22:52:05Z

thanks for the reference to parse_WQP(), and I'll keep an eye out for changes to NWIS workflows.

Your example surprises me, I would expect that "ice" would go in that status flag field rather than the result value field. The incomplete date issue is tricky.

mkoohafkan · 2024-12-20T16:28:07Z

different but related issue: how about consistent fields? An empty query can return a 0x0 tibble, but it would be better to return a 0xN tibble with all the field names.

ldecicco-USGS · 2024-12-20T20:26:05Z

We did use to do that, but the issue was that (especially with WQP) there's been an explosion of "dataProfiles" - and with each dataProfile, there's a new set of column and column names. So while it was easy to set an empty data frame with the correct names but no rows when there was a single format returned, now it is a lot more challenging. However, I think in the new beta services (ResultsWQX), calls are coming back with exactly what you are asking for (from the service, which is much more ideal than having dataRetrieval create it).

I could be wrong, but I think NWIS also comes back with the info to make an empty data.frame (we've avoided adding dplyr/tibble as a dependency). If not, they are going through a major modernization this year so something like that will probably happen.

mkoohafkan added the enhancement label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column type validation #740

column type validation #740

mkoohafkan commented Dec 19, 2024 •

edited

Loading

ldecicco-USGS commented Dec 19, 2024

mkoohafkan commented Dec 19, 2024

mkoohafkan commented Dec 20, 2024 •

edited

Loading

ldecicco-USGS commented Dec 20, 2024

column type validation #740

column type validation #740

Comments

mkoohafkan commented Dec 19, 2024 • edited Loading

ldecicco-USGS commented Dec 19, 2024

mkoohafkan commented Dec 19, 2024

mkoohafkan commented Dec 20, 2024 • edited Loading

ldecicco-USGS commented Dec 20, 2024

mkoohafkan commented Dec 19, 2024 •

edited

Loading

mkoohafkan commented Dec 20, 2024 •

edited

Loading