diff --git a/README.md b/README.md index 0d67de7..01bf068 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # harmonize-wq -Standardize, clean and wrangle Water Quality Portal data into more analytic-ready formats +Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats US EPA’s [Water Quality Portal (WQP)](https://www.waterqualitydata.us/) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using [python](https://github.com/USGS-python/dataretrieval) or [R](https://github.com/USGS-R/dataRetrieval). Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonixe_wq package is intended to be a flexible water quality specific framework to help: - Identify differences in data units (including speciation and basis) diff --git a/docs/source/Code of Conduct.rst b/docs/source/Code of Conduct.rst index de90deb..d9a91f0 100644 --- a/docs/source/Code of Conduct.rst +++ b/docs/source/Code of Conduct.rst @@ -1,5 +1,5 @@ -# CONTRIBUTOR CODE OF CONDUCT -============================= +CONTRIBUTOR CODE OF CONDUCT +=========================== As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities. @@ -11,4 +11,4 @@ Project maintainers have the right and responsibility to remove, edit, or reject Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers. -This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available at https://www.contributor-covenant.org/version/1/0/0/code-of-conduct.html +This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available at https://www.contributor-covenant.org/version/1/0/0/code-of-conduct.html. diff --git a/docs/source/conf.py b/docs/source/conf.py index bf64351..7347951 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -42,6 +42,7 @@ "sphinx.ext.coverage", "sphinx.ext.napoleon", "sphinx.ext.intersphinx", + 'sphinx.ext.autosectionlabel', "sphinxcontrib.spelling", ] diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst index 7d3308c..b7b5845 100644 --- a/docs/source/contributing.rst +++ b/docs/source/contributing.rst @@ -6,7 +6,8 @@ Contributing to harmonize_wq We’re so glad you’re thinking about contributing to an EPA open source project! If you’re unsure about anything, just ask — or submit your issue or pull request anyway. The worst that can happen is we’ll politely ask you to change something. We appreciate all friendly contributions. We encourage you to read this project’s CONTRIBUTING policy (you are here), its -LICENSE, and its `README `_. +`LICENSE `_, +and its `README `_. All contributions to this project will be released under the MIT dedication. By submitting a pull request or issue, you are agreeing to comply with this waiver of copyright interest. @@ -22,15 +23,15 @@ You can contribute in different ways: Report issues ------------- -You can report any issues with the package, the documentation to the `issue tracker`_. -Also feel free to submit feature requests, comments or questions. +You can report any issues with the package or the documentation to the `issue tracker`_. +Also feel free to submit feature requests, comments, or questions. Contribute code --------------- -To contribute fixes, code or documentation, fork harmonize_wq in GitHub_ and submit -the changes using a pull request against the **main** branch. +To contribute fixes, code, tests, or documentation, fork harmonize_wq in GitHub_ +and submit the changes using a pull request against the **main** branch. - If you are submitting new code, add tests (see below) and documentation. - Write "Closes #" in the PR description or a comment, as described in the @@ -41,7 +42,7 @@ In any case, feel free to use the `issue tracker`_ to discuss ideas for new feat Notice that we will not merge a PR if tests are failing. In certain cases tests pass in your machine but not in GitHub actions. There might be multiple reasons for this but these are some of -the most common +the most common: - Your new code does not work for other operating systems or Python versions. - The documentation is not being built properly or the examples in the docs are diff --git a/docs/source/example workflow.rst b/docs/source/example workflow.rst index 32832fa..850038b 100644 --- a/docs/source/example workflow.rst +++ b/docs/source/example workflow.rst @@ -70,7 +70,7 @@ There are many columns in the :class:`pandas.DataFrame` that are characteristic # Combine rows with the same sample organization, activity, location, and datetime df_wide = wrangle.collapse_results(main_df) -The number of columns in the resulting table is greatly reduced +The number of columns in the resulting table is greatly reduced: +----------------------------+-------------+----------------------------------------+-------------------------------+ | Output Column | Type | Source | Changes | diff --git a/docs/source/index.rst b/docs/source/index.rst index 9ac9b2e..996d24a 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -5,8 +5,8 @@ harmonize_wq: ============= -Standardize, clean and wrangle Water Quality Portal data into more analytic-ready formats ------------------------------------------------------------------------------------------ +Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats +------------------------------------------------------------------------------------------ **Useful links**: `Code Repository `__ | `Issues `__ diff --git a/docs/source/overview.rst b/docs/source/overview.rst index ea0a0c3..b09b40f 100644 --- a/docs/source/overview.rst +++ b/docs/source/overview.rst @@ -3,7 +3,7 @@ Overview ======== -US EPA’s `Water Quality Portal (WQP) `_ aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using `python `_ or `R `_. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help: +US EPA’s `Water Quality Portal (WQP) `_ aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieve data using `python `_ or `R `_. Given the variety of data and data originators, using the data in analysis often requires cleaning to ensure it meets required quality standards and wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help: * Identify differences in data units (including speciation and basis) * Identify differences in sampling or analytic methods diff --git a/harmonize_wq/basis.py b/harmonize_wq/basis.py index 4658a44..8b2a4d5 100644 --- a/harmonize_wq/basis.py +++ b/harmonize_wq/basis.py @@ -8,18 +8,21 @@ def unit_basis_dict(out_col): """Characteristic specific basis dictionary to define basis from units. + The out_col is often derived from :attr:`WQCharData.char_val`. The desired + basis can be used as a key to subset result. + Parameters ---------- out_col : str - Column name where results are written (char_val derived) + Column name where results are written. Returns ------- dict Dictionary with logic for determining basis from units string and - standard pint units to replace those with. + standard :mod:`pint` units to replace those with. The structure is {Basis: {standard units: [unit strings with basis]}}. - + Examples -------- Get dictionary for Phosphorus and subset for 'as P': @@ -43,7 +46,7 @@ def unit_basis_dict(out_col): def basis_conversion(): """Get dictionary of conversion factors to convert basis/speciation. - For example, this is used to convert 'as PO4' to 'as P' + For example, this is used to convert 'as PO4' to 'as P'. Returns ------- @@ -52,11 +55,10 @@ def basis_conversion(): See Also -------- - convert.moles_to_mass() + :func:`convert.moles_to_mass` - Originally from Table 1 in 'Best Practices for Submitting Nutrient Data to - the Water Quality eXchange (WQX) - ' + `Best Practices for Submitting Nutrient Data to the Water Quality eXchange + `_ """ return {'NH3': 0.822, 'NH4': 0.776, @@ -75,7 +77,7 @@ def stp_dict(): Returns ------- dict - Dictionary with {'standard temp' : {'units': [values to replace]}} + Dictionary with {'standard temp' : {'units': [values to replace]}}. Examples -------- @@ -88,12 +90,14 @@ def stp_dict(): return {'@25C': {'mg/mL': ['mg/mL @25C']}} -def basis_from_unit(df_in, basis_dict, unit_col, basis_col='Speciation'): - """Create standardized Basis column in :class:`pandas.DataFrame`. +def basis_from_unit(df_in, basis_dict, unit_col='Units', basis_col='Speciation'): + """Move basis from units to basis column in :class:`pandas.DataFrame`. - Standardizes units in units column based on basis_dict. Units column is - updated in place, it should not be original 'ResultMeasure/MeasureUnitCode' - to maintain data integrity. + Move basis information from units in unit_col column to basis in basis_col + column based on basis_dict. If basis_col does not exist in df_in it will be + created. The unit_col column is updated in place. To maintain data + integrity unit_col should not be the original + 'ResultMeasure/MeasureUnitCode' column. Parameters ---------- @@ -101,16 +105,17 @@ def basis_from_unit(df_in, basis_dict, unit_col, basis_col='Speciation'): DataFrame that will be updated. basis_dict : dict Dictionary with structure {basis:{new_unit:[old_units]}}. - unit_col : str - string for the column name in df to be used. + unit_col : str, optional + String for the units column name in df_in to use. + The default is 'Units'. basis_col : str, optional - string for the basis column name in df to be used. + String for the basis column name in df_in to use. The default is 'Speciation'. Returns ------- df : pandas.DataFrame - Updated copy of df_in + Updated copy of df_in. Examples -------- @@ -134,8 +139,8 @@ def basis_from_unit(df_in, basis_dict, unit_col, basis_col='Speciation'): 0 Phosphorus mg/l as P mg/l as P 1 Phosphorus mg/kg as P mg/kg as P - If an existing basis_col value is different a warning is issued when it is - updated and a QA_flag is assigned + If an existing basis_col value is different, a warning is issued when it is + updated and a QA_flag is assigned: >>> from numpy import nan >>> df['Speciation'] = [nan, 'as PO4'] @@ -226,7 +231,15 @@ def basis_from_methodSpec(df_in): def update_result_basis(df_in, basis_col, unit_col): - """Basis from result col that is not moved to a new col. + """Move basis from unit_col column to basis_col column. + + This is usually used in place of basis_from_unit when the basis_col is not + 'ResultMeasure/MeasureUnitCode' (i.e., not speciation). + + Notes + ----- + Rather than creating many new empty columns this function currently overwrites the original + basis_col values. The original values are noted in the QA_flag. Parameters ---------- @@ -234,7 +247,7 @@ def update_result_basis(df_in, basis_col, unit_col): DataFrame that will be updated. basis_col : str Column in df_in with result basis to update. Expected values are - 'ResultTemperatureBasisText' + 'ResultTemperatureBasisText'. unit_col : str Column in df_in with units that may contain basis. @@ -246,7 +259,6 @@ def update_result_basis(df_in, basis_col, unit_col): Examples -------- Build pandas DataFrame for example: - Note: 'Units' is used to preserve 'ResultMeasure/MeasureUnitCode' >>> from pandas import DataFrame >>> from numpy import nan @@ -258,7 +270,7 @@ def update_result_basis(df_in, basis_col, unit_col): CharacteristicName ResultTemperatureBasisText Units 0 Salinity 25 deg C mg/mL @25C 1 Salinity NaN mg/mL @25C - + >>> from harmonize_wq import basis >>> df_temp_basis = basis.update_result_basis(df, ... 'ResultTemperatureBasisText', @@ -294,7 +306,7 @@ def update_result_basis(df_in, basis_col, unit_col): def set_basis(df_in, mask, basis, basis_col='Speciation'): - """Update basis_col to basis where col is expected_val. + """Update or create basis_col with basis as value. Parameters ---------- @@ -311,8 +323,34 @@ def set_basis(df_in, mask, basis, basis_col='Speciation'): Returns ------- df_out : pandas.DataFrame - Updated copy of df_in + Updated copy of df_in. + + Examples + -------- + Build pandas DataFrame for example: + + >>> from pandas import DataFrame + >>> df = DataFrame({'CharacteristicName': ['Phosphorus', + ... 'Phosphorus', + ... 'Salinity'], + ... 'MethodSpecificationName': ['as P', 'as PO4', ''], + ... }) + >>> df + CharacteristicName MethodSpecificationName + 0 Phosphorus as P + 1 Phosphorus as PO4 + 2 Salinity + + Build mask for example: + >>> mask = df['CharacteristicName']=='Phosphorus' + + >>> from harmonize_wq import basis + >>> basis.set_basis(df, mask, basis='as P') + CharacteristicName MethodSpecificationName Speciation + 0 Phosphorus as P as P + 1 Phosphorus as PO4 as P + 2 Salinity NaN """ df_out = df_in.copy() # Add Basis column if it doesn't exist @@ -321,35 +359,3 @@ def set_basis(df_in, mask, basis, basis_col='Speciation'): # Populate Basis column where expected value with basis df_out.loc[mask, basis_col] = basis return df_out - - -def basis_qa_flag(trouble, basis, spec_col='MethodSpecificationName'): - """Get QA_flag for different basis in MethodsSpeciation and units. - - NOTE: Deprecate - not currently in use anywhere - - Parameters - ---------- - trouble : str - Problem encountered (e.g., unit_basis != speciation). - basis : str - The basis from the unit that replaced the original speciation. - spec_col : str, optional - Column currently being checked. Default is 'MethodSpecificationName' - - Returns - ------- - str - Flag to use in QA_flag column. - - Examples - -------- - Formats QA_Flag - - >>> from harmonize_wq import basis - >>> basis.basis_qa_flag('(units)', - ... 'updated from 25 deg C to @25C', - ... 'ResultTemperatureBasisText') - 'ResultTemperatureBasisText: updated from 25 deg C to @25C (units)' - """ - return '{}: {} {}'.format(spec_col, basis, trouble) diff --git a/harmonize_wq/clean.py b/harmonize_wq/clean.py index cbb0ec0..9feddaf 100644 --- a/harmonize_wq/clean.py +++ b/harmonize_wq/clean.py @@ -8,17 +8,17 @@ def datetime(df_in): - """Format time using dataretrieval and 'ActivityStart'. + """Format time using :mod:`dataretrieval` and 'ActivityStart' columns. Parameters ---------- df_in : pandas.DataFrame - DataFrame with the expected activity date time columns. + DataFrame with the expected activity date, time and timezone columns. Returns ------- df_out : pandas.DataFrame - DataFrame with the converted date and datetime columns. + DataFrame with the converted datetime column. Examples -------- @@ -55,7 +55,15 @@ def datetime(df_in): def harmonize_depth(df_in, units='meter'): - """Doesn't currently pass errors or unit registry (ureg). + """Create 'Depth' column with result depth values in consistent units. + + The new column is based on values from the 'ResultDepthHeightMeasure/MeasureValue' column and + units from the 'ResultDepthHeightMeasure/MeasureUnitCode' column. + + Notes + ----- + If there are errors or unit registry (ureg) updates these are not currently + passed back. In the future activity depth columns may be considered if result depth missing. Parameters ---------- @@ -84,7 +92,7 @@ def harmonize_depth(df_in, units='meter'): 1 NaN NaN 2 10 ft - Get clean depth column: + Get clean 'Depth' column: >>> from harmonize_wq import clean >>> clean.harmonize_depth(df)[['ResultDepthHeightMeasure/MeasureValue', @@ -114,8 +122,12 @@ def harmonize_depth(df_in, units='meter'): def check_precision(df_in, col, limit=3): - """Note - be cautious of float type and real vs representable precision. + """Add QA_flag if value in column has precision lower than limit. + Notes + ----- + Be cautious of float type and real vs representable precision. + Parameters ---------- df_in : pandas.DataFrame @@ -128,7 +140,7 @@ def check_precision(df_in, col, limit=3): Returns ------- df_out : pandas.DataFrame - DataFrame with the quality assurance flag for precision + DataFrame with the quality assurance flag for precision. """ df_out = df_in.copy() @@ -141,6 +153,10 @@ def check_precision(df_in, col, limit=3): def methods_check(df_in, char_val, methods=None): """Check methods against list of accepted methods. + + Notes + ----- + This is not fully implemented. Parameters ---------- @@ -151,12 +167,13 @@ def methods_check(df_in, char_val, methods=None): methods : dict, optional Dictionary where key is characteristic column name and value is list of dictionaries each with Source and Method keys. This allows updated - methods dictionaries to be used. The default None, uses the built-in - domains.accepted_methods(). + methods dictionaries to be used. The default None uses the built-in + :meth:`domains.accepted_methods`. Returns ------- - None. + accept : list + List of values from 'ResultAnalyticalMethod/MethodIdentifier' column in methods. """ if methods is None: @@ -179,19 +196,22 @@ def methods_check(df_in, char_val, methods=None): def wet_dry_checks(df_in, mask=None): - """Fix known errors in MediaName using WeightBasis/SampleFraction columns. + """Fix suspected errors in 'ActivityMediaName' column. + + Uses the 'ResultWeightBasisText' and 'ResultSampleFractionText' columns to swicth if the media is + wet/dry where appropriate. Parameters ---------- df_in : pandas.DataFrame DataFrame that will be updated. mask : pandas.Series - Row conditional (bool) mask to limit df rows to check/fix + Row conditional (bool) mask to limit df rows to check/fix. The default is None. Returns ------- df_out : pandas.DataFrame - Updated DataFrame + Updated DataFrame. """ df_out = df_in.copy() @@ -225,14 +245,14 @@ def wet_dry_drop(df_in, wet_dry='wet', char_val=None): df_in : pandas.DataFrame DataFrame that will be updated. wet_dry : str, optional - Which values (Water/Sediment) to keep. The default is 'wet' (Water) + Which values (Water/Sediment) to keep. The default is 'wet' (Water). char_val : str, optional Apply to specific characteristic name. The default is None (for all). Returns ------- df2 : pandas.DataFrame - Updated copy of df_in + Updated copy of df_in. """ df2 = df_in.copy() if char_val: diff --git a/harmonize_wq/convert.py b/harmonize_wq/convert.py index 53873c4..e5b1205 100644 --- a/harmonize_wq/convert.py +++ b/harmonize_wq/convert.py @@ -1,8 +1,7 @@ # -*- coding: utf-8 -*- -""" -Functions to convert from one unit to another, sometimes using pint decorators. +"""Functions to convert from one unit to another, sometimes using :mod:`pint` decorators. -Contains several unit conversion functions not in Pint. +Contains several unit conversion functions not in :mod:`pint`. """ import pint @@ -49,12 +48,12 @@ def mass_to_moles(ureg, char_val, Q_): Examples -------- - Build standard Pint unit registry + Build standard pint unit registry: >>> import pint >>> ureg = pint.UnitRegistry() - Build quantity + Build pint quantity: >>> Q_ = pint.Quantity('1 g') >>> convert.mass_to_moles(ureg, 'Phosphorus', Q_) @@ -68,6 +67,8 @@ def mass_to_moles(ureg, char_val, Q_): def moles_to_mass(ureg, Q_, basis=None, char_val=None): """Convert moles substance to mass. + Either basis or char_val must have a non-default value. + Parameters ---------- ureg : pint.UnitRegistry @@ -88,7 +89,7 @@ def moles_to_mass(ureg, Q_, basis=None, char_val=None): Examples -------- - Build standard Pint unit registry: + Build standard pint unit registry: >>> import pint >>> ureg = pint.UnitRegistry() @@ -114,7 +115,7 @@ def moles_to_mass(ureg, Q_, basis=None, char_val=None): @u_reg.wraps(u_reg.NTU, u_reg.centimeter) def cm_to_NTU(val): - """Convert Turbidity measured in centimeters to NTU. + """Convert turbidity measured in centimeters to NTU. Parameters ---------- @@ -123,16 +124,17 @@ def cm_to_NTU(val): Returns ------- + pint.Quantity The turbidity value in NTU. Examples -------- - Build standard Pint unit registry: + Build standard pint unit registry: >>> import pint >>> ureg = pint.UnitRegistry() - Build cm units aware Quantity (already in standard unit registry): + Build cm units aware pint Quantity (already in standard unit registry): >>> turbidity = ureg.Quantity('cm') >>> str(turbidity) @@ -158,7 +160,7 @@ def cm_to_NTU(val): @u_reg.wraps(u_reg.centimeter, u_reg.NTU) def NTU_to_cm(val): - """Convert Turbidity in NTU (Nephelometric Turbidity Units) to centimeters. + """Convert turbidity in NTU (Nephelometric Turbidity Units) to centimeters. Parameters ---------- @@ -167,11 +169,12 @@ def NTU_to_cm(val): Returns ------- + pint.Quantity The turbidity value in centimeters. Examples -------- - NTU is not a standard Pint unit and must be added to a unit registry first + NTU is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method): >>> import pint @@ -180,7 +183,7 @@ def NTU_to_cm(val): >>> for definition in domains.registry_adds_list('Turbidity'): ... ureg.define(definition) - Build NTU units aware Quantity: + Build NTU aware pint pint Quantity: >>> turbidity = ureg.Quantity('NTU') >>> str(turbidity) @@ -206,23 +209,25 @@ def NTU_to_cm(val): @u_reg.wraps(u_reg.NTU, u_reg.dimensionless) def JTU_to_NTU(val): - """Convert turbidity units JTU (Jackson Turbidity Units) to NTU. + """Convert turbidity units from JTU (Jackson Turbidity Units) to NTU. - Note: this is based on linear relationship: 1 -> 19, 0.053 -> 1, 0.4 -> 7.5 + Notes + ----- + This is based on linear relationship: 1 -> 19, 0.053 -> 1, 0.4 -> 7.5 Parameters ---------- - val : pint.quantity - The turbidity value in JTU units (dimensionless). + val : pint.Quantity + The turbidity value in JTU (dimensionless). Returns ------- - NTU : pint.quantity + NTU : pint.Quantity The turbidity value in dimensionless NTU. Examples -------- - JTU is not a standard Pint unit and must be added to a unit registry first + JTU is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method): >>> import pint @@ -231,7 +236,7 @@ def JTU_to_NTU(val): >>> for definition in domains.registry_adds_list('Turbidity'): ... ureg.define(definition) - Build NTU units aware Quantity: + Build JTU units aware pint Quantity: >>> turbidity = ureg.Quantity('JTU') >>> str(turbidity) @@ -256,24 +261,26 @@ def JTU_to_NTU(val): @u_reg.wraps(u_reg.NTU, u_reg.dimensionless) def SiO2_to_NTU(val): - """Convert turbidity units SiO2 (silicon dioxide) to NTU. + """Convert turbidity units from SiO2 (silicon dioxide) to NTU. - Note: this is based on a Linear relationship: 2.5 -> 19, 0.13 -> 1, 1 -> 7.5 + Notes + ----- + This is based on a linear relationship: 0.13 -> 1, 1 -> 7.5, 2.5 -> 19 Parameters ---------- - val : pint.quantity.build_quantity_class + val : pint.Quantity.build_quantity_class The turbidity value in SiO2 units (dimensionless). Returns ------- - NTU : pint.quantity.build_quantity_class + NTU : pint.Quantity.build_quantity_class The turbidity value in dimensionless NTU. Examples -------- - SiO2 is not a standard Pint unit and must be added to a unit registry first - (normally done by WQCharData.update_ureg() method): + SiO2 is not a standard pint unit and must be added to a unit registry first + (normally done using WQCharData.update_ureg() method): >>> import pint >>> ureg = pint.UnitRegistry() @@ -281,7 +288,7 @@ def SiO2_to_NTU(val): >>> for definition in domains.registry_adds_list('Turbidity'): ... ureg.define(definition) - Build NTU units aware Quantity: + Build SiO2 units aware pint Quantity: >>> turbidity = ureg.Quantity('SiO2') >>> str(turbidity) @@ -301,12 +308,12 @@ def SiO2_to_NTU(val): def FNU_to_NTU(val): - """Convert turbidity units FNU to NTU. + """Convert turbidity units from FNU (Formazin Nephelometric Units) to NTU. Parameters ---------- val : float - The turbidity magnitude (FNU units is dimensionless). + The turbidity magnitude (FNU is dimensionless). Returns ------- @@ -335,21 +342,21 @@ def density_to_PSU(val, Parameters ---------- - val : pint.quantity.build_quantity_class + val : pint.Quantity.build_quantity_class The salinity value in density units. - pressure : pint.quantity.build_quantity_class, optional + pressure : pint.Quantity.build_quantity_class, optional The pressure value. The default is 1*ureg("atm"). - temperature : pint.quantity.build_quantity_class, optional + temperature : pint.Quantity.build_quantity_class, optional The temperature value. The default is ureg.Quantity(25, ureg("degC")). Returns ------- - PSU : pint.quantity.build_quantity_class + PSU : pint.Quantity.build_quantity_class The salinity value in dimensionless PSU. Examples -------- - PSU (Practical Salinity Units) is not a standard Pint unit and must be added to a unit registry + PSU (Practical Salinity Units) is not a standard pint unit and must be added to a unit registry first (normally done by WQCharData.update_ureg() method): >>> import pint @@ -358,7 +365,7 @@ def density_to_PSU(val, >>> for definition in domains.registry_adds_list('Salinity'): ... ureg.define(definition) - Build units aware input, as string: + Build units aware pint Quantity, as string: >>> input_density = '1000 milligram / milliliter' @@ -387,8 +394,11 @@ def density_to_PSU(val, def PSU_to_density(val, pressure=1*u_reg("atm"), temperature=u_reg.Quantity(25, u_reg("degC"))): - """Convert salinity as Practical Salinity Units to density (mass/volume). + """Convert salinity as Practical Salinity Units (PSU) to density. + Dimensionality changes from dimensionless Practical Salinity Units (PSU) to + mass/volume density. + Parameters ---------- val : pint.Quantity @@ -400,13 +410,13 @@ def PSU_to_density(val, Returns ------- - density : pint.quantity.build_quantity_class + density : pint.Quantity.build_quantity_class The salinity value in density units (mg/ml). Examples -------- - PSU (Practical Salinity Units) is not a standard Pint unit and must be - added to a unit registry first (normally by WQCharData.update_ureg method): + PSU is not a standard pint unit and must be added to a unit registry first. + This can be done using the WQCharData.update_ureg method: >>> import pint >>> ureg = pint.UnitRegistry() @@ -414,7 +424,8 @@ def PSU_to_density(val, >>> for definition in domains.registry_adds_list('Salinity'): ... ureg.define(definition) - Build units aware input, as string because it is an altered unit registry: + Build units aware pint Quantity, as string because it is an altered unit + registry: >>> unit = ureg.Quantity('PSU') >>> unit @@ -474,14 +485,14 @@ def PSU_to_density(val, def DO_saturation(val, pressure=1*u_reg("atm"), temperature=u_reg.Quantity(25, u_reg("degC"))): - """Convert Dissolved Oxygen from saturation (%) concentration (mg/l). + """Convert Dissolved Oxygen (DO) from saturation (%) to concentration (mg/l). Defaults assume STP where pressure is 1 atmosphere and temperature 25C. Parameters ---------- - val : pint.quantity.build_quantity_class - The Dissolved Oxygen saturation value in dimensionless percent. + val : pint.Quantity.build_quantity_class + The DO saturation value in dimensionless percent. pressure : pint.Quantity, optional The pressure value. The default is 1*ureg("atm"). temperature : pint.Quantity, optional @@ -490,10 +501,13 @@ def DO_saturation(val, Returns ------- pint.Quantity - Value in mg/l. + DO value in mg/l. Examples -------- + >>> from harmonize_wq import convert + >>> convert.DO_saturation(70) + 578.3632692599999 milligram / liter """ p, t = pressure, temperature if p == 1 & (t == 25): @@ -513,24 +527,30 @@ def DO_saturation(val, def DO_concentration(val, pressure=1*u_reg("atm"), temperature=u_reg.Quantity(25, u_reg("degC"))): - """Convert Dissolved Oxygen from concentration (mg/ml) to saturation (%). + """Convert Dissolved Oxygen (DO) from concentration (mg/l) to saturation (%). Parameters ---------- - val : pint.quantity.build_quantity_class - The DO value (converted to mg/L) + val : pint.Quantity.build_quantity_class + The DO value (converted to mg/L). pressure : pint.Quantity, optional The pressure value. The default is 1*ureg("atm"). - temperature : TYPE, optional + temperature : pint.Quantity, optional The temperature value. The default is ureg.Quantity(25, ureg("degC")). Returns ------- float - Dissolved Oxygen as saturation (dimensionless). + Dissolved Oxygen (DO) as saturation (dimensionless). Examples -------- + Build units aware pint Quantity, as string: + + >>> input_DO = '578 mg/l' + + >>> from harmonize_wq import convert + >>> convert.DO_concentration(input_DO) """ # TODO: switch to kelvin? # https://www.waterontheweb.org/under/waterquality/oxygen.html#:~: @@ -554,11 +574,11 @@ def conductivity_to_PSU(val, Parameters ---------- - val : pint.quantity.build_quantity_class - The conductivity value (converted to microsiemens / centimeter) + val : pint.Quantity.build_quantity_class + The conductivity value (converted to microsiemens / centimeter). pressure : pint.Quantity, optional The pressure value. The default is 0*ureg("atm"). - temperature : TYPE, optional + temperature : pint.Quantity, optional The temperature value. The default is ureg.Quantity(25, ureg("degC")). Returns @@ -566,11 +586,12 @@ def conductivity_to_PSU(val, pint.Quantity Estimated salinity (PSU). - Additional Notes: - Conductivity to salinity conversion PSS 1978 method + Notes + ----- + Conductivity to salinity conversion PSS 1978 method. c-numeric conductivity in uS (microsiemens). - t-numeric Celsius temperature (defaults to 25) - P-numeric optional pressure (defaults to 0) + t-numeric Celsius temperature (defaults to 25). + P-numeric optional pressure (defaults to 0). References ---------- @@ -586,7 +607,7 @@ def conductivity_to_PSU(val, Examples -------- - PSU (Practical Salinity Units) is not a standard Pint unit and must be + PSU (Practical Salinity Units) is not a standard pint unit and must be added to a unit registry first: >>> import pint @@ -595,7 +616,7 @@ def conductivity_to_PSU(val, >>> for definition in domains.registry_adds_list('Salinity'): ... ureg.define(definition) - Build units aware input, as string: + Build units aware pint Quantity, as string: >>> input_conductivity = '111.0 uS/cm' diff --git a/harmonize_wq/domains.py b/harmonize_wq/domains.py index 38aec44..8925a10 100644 --- a/harmonize_wq/domains.py +++ b/harmonize_wq/domains.py @@ -72,7 +72,7 @@ # get_domain_list(field): def get_domain_dict(table, cols=None): - """Retrieve domain values for specified table. + """Get domain values for specified table. Parameters ---------- @@ -84,13 +84,13 @@ def get_domain_dict(table, cols=None): Returns ------- - dictionary + dict Dictionary where {cols[0]: cols[1]} Examples -------- - Return dict for domain from WQX table (e.g., 'ResultSampleFraction'), just - the default keys (Name) are shown as values (Description) can be long: + Return dictionary for domain from WQP table (e.g., 'ResultSampleFraction'), + The default keys ('Name') are shown as values ('Description') are long: >>> domains.get_domain_dict('ResultSampleFraction').keys() dict_keys(['Acid Soluble', 'Bed Sediment', 'Bedload', 'Bioavailable', 'Comb Available', @@ -124,11 +124,7 @@ def harmonize_TADA_dict(): Returns ------- full_dict : dict - {'TADA.CharacteristicName': - {Target.TADA.CharacteristicName: - {Target.TADA.ResultSampleFractionText: - [Target.TADA.ResultSampleFractionText]}}} - + {'TADA.CharacteristicName': {Target.TADA.CharacteristicName: {Target.TADA.ResultSampleFractionText [Target.TADA.ResultSampleFractionText]}}} """ # Note: too nested for refactor into single function w/ char_tbl_TADA @@ -166,14 +162,14 @@ def re_case(word, domain_list): Parameters ---------- word : str - Word to alter in domain_list + Word to alter in domain_list. domain_list : list - List including word + List including word. + Returns ------- str - Word from domain_list in UPPERCASE - + Word from domain_list in UPPERCASE. """ domain_list_upper = [x.upper() for x in domain_list] try: @@ -196,9 +192,7 @@ def char_tbl_TADA(df, char): Returns ------- new_char_dict : dict - {Target.TADA.CharacteristicName: - {Target.TADA.ResultSampleFractionText: - [Target.TADA.ResultSampleFractionText]} + {Target.TADA.CharacteristicName: {Target.TADA.ResultSampleFractionText: [Target.TADA.ResultSampleFractionText]} """ cols = ['Target.TADA.CharacteristicName', 'TADA.ResultSampleFractionText', @@ -228,23 +222,24 @@ def char_tbl_TADA(df, char): def registry_adds_list(out_col): - """Get units to add to Pint unit registry by out_column. + """Get units to add to :mod:`pint` unit registry by out_col column. - Out_column typically refers back to CharacteristicName. + Typically out_col refers back to column used for a value from the + 'CharacteristicName' column. Parameters ---------- out_col : str - The result column a unit registry is being built for + The result column a unit registry is being built for. Returns ------- list - List of strings with unit additions in expected format + List of strings with unit additions in expected format. Examples -------- - Generate a new Pint unit registry object for e.g., Sediment + Generate a new pint unit registry object for e.g., Sediment: >>> from harmonize_wq import domains >>> domains.registry_adds_list('Sediment') @@ -294,13 +289,13 @@ def registry_adds_list(out_col): def bacteria_reg(ureg=None): - """Generate standard pint unit registry with bacteria units defined. + """Generate :class:`pint.UnitRegistry` with bacteria units defined. Parameters ---------- ureg : pint.UnitRegistry, optional Unit Registry Object with any custom units defined. Default None - starts with new unit registry + starts with new unit registry. Returns ------- @@ -309,7 +304,7 @@ def bacteria_reg(ureg=None): Examples -------- - Generate a new Pint unit registry object for e.g., bacteria + Generate a new pint UnitRegistry for e.g., bacteria: >>> domains.bacteria_reg() @@ -323,8 +318,8 @@ def bacteria_reg(ureg=None): def out_col_lookup(): """Get {CharacteristicName: out_column_name}. - This is often subset and used to write results to a new column based on - CharacteristicName. + This is often subset and used to write results to a new column from the + 'CharacteristicName' column. Returns ------- @@ -333,9 +328,9 @@ def out_col_lookup(): Examples -------- - The function returns the full dict {CharacteristicName: out_column_name}, - it can be subset by a CharactisticName to get the name of the column for - results. + The function returns the full dictionary {CharacteristicName: out_column_name}. + It can be subset by a 'CharactisticName' column value to get the name of + the column for results: >>> domains.out_col_lookup()['Escherichia coli'] 'E_coli' @@ -376,8 +371,8 @@ def characteristic_cols(category=None): Examples -------- - Running the function without a category returns the full list of column names, including a - category returns only the columns in that category + Running the function without a category returns the full list of column + names, including a category returns only the columns in that category: >>> domains.characteristic_cols('QA') ['ResultDetectionConditionText', @@ -541,12 +536,13 @@ def xy_datum(): """Get dictionary of expected horizontal datums. The structure has {key as expected string: value as {"Description": string - (Not currently used) and "EPSG": int (4-digit code)}. + and "EPSG": integer (4-digit code)}. Notes ----- - source URL: f'{BASE_URL}HorizontalCoordinateReferenceSystemDatum_CSV.zip' - Anything not in dict will be nan, i.e. must be int so these are missing: + source WQP: HorizontalCoordinateReferenceSystemDatum_CSV.zip + + Anything not in dict will be nan, and non-integer EPSG will be missing: "OTHER": {"Description": 'Other', "EPSG": nan}, "UNKWN": {"Description": 'Unknown', "EPSG": nan} @@ -559,8 +555,8 @@ def xy_datum(): Examples -------- - Running the function returns the full {abbreviation: {Description:values, - EPSG:values}}, here we show how the abbreviation can be used as a key to + Running the function returns the full dictionary with {abbreviation: + {'Description':values, 'EPSG':values}}. The abbreviation key can be used to get the EPSG code: >>> domains.xy_datum()['NAD83'] @@ -613,7 +609,7 @@ def stations_rename(): Returns ------- field_mapping : dict - dictionary where key = WQP field name and value = short name for .shp. + Dictionary where key = WQP field name and value = short name for .shp. Examples -------- @@ -668,8 +664,10 @@ def stations_rename(): def accepted_methods(): """Get accepted methods for each characteristic. - Note: Source should be in 'ResultAnalyticalMethod/MethodIdentifierContext' - This is not fully implemented + Notes + ----- + Source should be in 'ResultAnalyticalMethod/MethodIdentifierContext' + column. This is not fully implemented. Returns ------- diff --git a/harmonize_wq/harmonize.py b/harmonize_wq/harmonize.py index 91eb005..818efd0 100644 --- a/harmonize_wq/harmonize.py +++ b/harmonize_wq/harmonize.py @@ -19,7 +19,7 @@ def df_checks(df_in, columns=None): DataFrame that will be checked. columns : list, optional List of strings for column names. Default None, uses: - 'ResultMeasure/MeasureUnitCode','ResultMeasureValue','CharacteristicName' + 'ResultMeasure/MeasureUnitCode','ResultMeasureValue','CharacteristicName'. Examples -------- @@ -71,7 +71,7 @@ def convert_unit_series(quantity_series, unit_series, units, ureg=None, errors=' """Convert quantities to consistent units. Convert list of quantities (quantity_list), each with a specified old unit, - to a quantity in units using pint constructor method. + to a quantity in units using :mod:`pint` constructor method. Parameters ---------- @@ -83,7 +83,7 @@ def convert_unit_series(quantity_series, unit_series, units, ureg=None, errors=' units : str Desired units. ureg : pint.UnitRegistry, optional - Unit Registry Object with any custom units defined. The default is None + Unit Registry Object with any custom units defined. The default is None. errors : str, optional Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. @@ -103,7 +103,7 @@ def convert_unit_series(quantity_series, unit_series, units, ureg=None, errors=' >>> quantity_series = Series([1, 10]) >>> unit_series = Series(['mg/l', 'mg/ml',]) - Convert series to series of pint objects in 'mg/l' + Convert series to series of pint Quantity objects in 'mg/l': >>> from harmonize_wq import harmonize >>> harmonize.convert_unit_series(quantity_series, unit_series, units = 'mg/l') @@ -145,7 +145,7 @@ def convert_unit_series(quantity_series, unit_series, units, ureg=None, errors=' def add_qa_flag(df_in, mask, flag): - """Add flag to "QA_field" column in df_in. + """Add flag to 'QA_flag' column in df_in. Parameters ---------- @@ -174,7 +174,7 @@ def add_qa_flag(df_in, mask, flag): 1 Phosphorus 0.265 2 Carbon 2.1 - Assign simple flag string and mask to assign flag only to Carbon + Assign simple flag string and mask to assign flag only to Carbon: >>> flag = 'words' >>> mask = df['CharacteristicName']=='Carbon' @@ -211,7 +211,8 @@ def units_dimension(series_in, units, ureg=None): units : str Desired units. ureg : pint.UnitRegistry, optional - Unit Registry Object with any custom units defined. The default is None + Unit Registry Object with any custom units defined. + The default is None. Returns ------- @@ -230,7 +231,7 @@ def units_dimension(series_in, units, ureg=None): 2 g/kg dtype: object - Get list of unique units not in desired units dimension 'mg/l' + Get list of unique units not in desired units dimension 'mg/l': >>> from harmonize_wq import harmonize >>> harmonize.units_dimension(unit_series, units='mg/l') @@ -249,19 +250,21 @@ def units_dimension(series_in, units, ureg=None): def dissolved_oxygen(wqp): - """Standardize 'Dissolved oxygen (DO)' characteristic. + """Standardize 'Dissolved Oxygen (DO)' characteristic. - Uses and returns WQP Characteristic Info Object. + Uses :class:`wq_data.WQCharData` to check units, check unit + dimensionality and perform appropriate unit conversions. Parameters ---------- - wqp : WQCharData Object - WQP Characteristic Info Object. + wqp : wq_data.WQCharData + WQP Characteristic Info Object to check units, check unit + dimensionality and perform appropriate unit conversions. Returns ------- - wqp : WQP Characteristic Info Object. - WQP Characteristic Info Object with updated attributes + wqp : wq_data.WQCharData + WQP Characteristic Info Object with updated attributes. """ wqp.check_units() # Replace know problem units, fix and flag missing units @@ -281,19 +284,23 @@ def dissolved_oxygen(wqp): def salinity(wqp): """Standardize 'Salinity' characteristic. - Uses and returns WQP Characteristic Info Object. + Uses :class:`wq_data.WQCharData` to check basis, check units, check unit + dimensionality and perform appropriate unit conversions. - Note: PSU=PSS=ppth and 'ppt' is picopint in pint so it is changed to 'ppth' + Notes + ----- + PSU=PSS=ppth and 'ppt' is picopint in :mod:`pint` so it is changed to + 'ppth'. Parameters ---------- - wqp : WQCharData Object + wqp : wq_data.WQCharData WQP Characteristic Info Object. Returns ------- - wqp : WQP Characteristic Info Object. - WQP Characteristic Info Object with updated attributes + wqp : wq_data.WQCharData + WQP Characteristic Info Object with updated attributes. """ wqp.check_basis(basis_col='ResultTemperatureBasisText') # Moves '@25C' out wqp.check_units() # Replace know problem units, fix and flag missing units @@ -318,44 +325,46 @@ def salinity(wqp): def turbidity(wqp): """Standardize 'Turbidity' characteristic. - Uses and returns WQP Characteristic Info Object. + Uses :class:`wq_data.WQCharData` to check units, check unit + dimensionality and perform appropriate unit conversions - See USGS Report Chapter A6. Section 6.7. Turbidity - r"https://pubs.usgs.gov/twri/twri9a6/twri9a67/twri9a_Section6.7_v2.1.pdf" + See `USGS Report Chapter A6. Section 6.7. Turbidity + ` See ASTM D\315-17 for equivalent unit definitions: - 'NTU' - 400-680nm (EPA 180.1), range 0.0-40 - 'NTRU' - 400-680nm (2130B), range 0-10,000 - 'NTMU' - 400-680nm - 'FNU' - 780-900nm (ISO 7027), range 0-1000 - 'FNRU' - 780-900nm (ISO 7027), range 0-10,000 - 'FAU' - 780-900nm, range 20-1000 + 'NTU' - 400-680nm (EPA 180.1), range 0.0-40. + 'NTRU' - 400-680nm (2130B), range 0-10,000. + 'NTMU' - 400-680nm. + 'FNU' - 780-900nm (ISO 7027), range 0-1000. + 'FNRU' - 780-900nm (ISO 7027), range 0-10,000. + 'FAU' - 780-900nm, range 20-1000. Older methods: 'FTU' - lacks instrumentation specificity 'SiO2' (ppm or mg/l) - concentration of calibration standard (=JTU) 'JTU' - candle instead of formazin standard, near 40 NTU these may be - equivalent, but highly variable + equivalent, but highly variable. Conversions used: - cm <-> NTU see convert.cm_to_NTU() - r"https://extension.usu.edu/utahwaterwatch/monitoring/field-instructions/" - - Alternative conversions not currently used by default: - convert.FNU_to_NTU from Gohin (2011) Ocean Sci., 7, 705–732 - r"https://doi.org/10.5194/os-7-705-2011" - convert.SiO2_to_NTU linear relation from Otilia et al. 2013 - convert.JTU_to_NTU linear relation from Otilia et al. 2013 + cm <-> NTU see :func:`convert.cm_to_NTU` from + `USU ` + + Alternative conversions available but not currently used by default: + :func:`convert.FNU_to_NTU` from Gohin (2011) Ocean Sci., 7, 705–732 + ``. + :func:`convert.SiO2_to_NTU` linear relation from Otilia et al. 2013. + :func:`convert.JTU_to_NTU` linear relation from Otilia et al. 2013. + Otilia, Rusănescu Carmen, Rusănescu Marin, and Stoica Dorel. - "MONITORING OF PHYSICAL INDICATORS IN WATER SAMPLES." - r"https://hidraulica.fluidas.ro/2013/nr_2/84_89.pdf" + MONITORING OF PHYSICAL INDICATORS IN WATER SAMPLES. + ``. Parameters ---------- - wqp : WQCharData Object + wqp : wq_data.WQCharData WQP Characteristic Info Object. Returns ------- - wqp : WQP Characteristic Info Object. - WQP Characteristic Info Object with updated attributes + wqp : wq_data.WQCharData + WQP Characteristic Info Object with updated attributes. """ #These units exist but have not been encountered yet #formazin nephelometric multibeam unit (FNMU); @@ -391,17 +400,18 @@ def turbidity(wqp): def sediment(wqp): """Standardize 'Sediment' characteristic. - Uses and returns WQP Characteristic Info Object. + Uses :class:`wq_data.WQCharData` to check basis, check units, and check + unit dimensionality. Parameters ---------- - wqp : WQCharData Object + wqp : wq_data.WQCharData WQP Characteristic Info Object. Returns ------- - wqp : WQP Characteristic Info Object. - WQP Characteristic Info Object with updated attributes + wqp : wq_data.WQCharData + WQP Characteristic Info Object with updated attributes. """ #'< 0.0625 mm', < 0.125 mm, < 0.25 mm, < 0.5 mm, < 1 mm, < 2 mm, < 4 mm wqp.check_basis(basis_col='ResultParticleSizeBasisText') @@ -419,15 +429,17 @@ def sediment(wqp): def harmonize_all(df_in, errors='raise'): - """Harmonization all 'CharacteristicNames' with existing functions. + """Harmonizes all 'CharacteristicNames' column values with methods. All results are standardized to default units. Intermediate columns are - not retained. + not retained. See :func:`domains.out_col_lookup` for list of values with + methods. Parameters ---------- df_in : pandas.DataFrame - DataFrame with the expected columns. + DataFrame with the expected columns (changes based on values in + 'CharacteristicNames' column). errors : str, optional Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, invalid dimension conversions will raise an exception. @@ -437,11 +449,13 @@ def harmonize_all(df_in, errors='raise'): Returns ------- df : pandas.DataFrame - Updated copy of df_in + Updated copy of df_in. Examples -------- - Build example table from tests to use in place of Water Quality Portal query response + Build example df_in table from harmonize_wq tests to use in place of Water + Quality Portal query response, this table has 'Temperature, water' and + 'Phosphorous' results: >>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' @@ -467,6 +481,8 @@ def harmonize_all(df_in, errors='raise'): [359505 rows x 42 columns] + List columns that were added: + >>> df_result.columns[-7:] Index(['QA_flag', 'Phosphorus', 'Speciation', 'TP_Phosphorus', 'TDP_Phosphorus', 'Other_Phosphorus', 'Temperature'], @@ -475,9 +491,9 @@ def harmonize_all(df_in, errors='raise'): See Also -------- See any of the 'Simple' notebooks found in - :ref:'demos' for - examples of how this function is used to standardize, clean and wrangle a Water Quality Portal - query response. + 'demos' for + examples of how this function is used to standardize, clean, and wrangle a + Water Quality Portal query response. """ df_out = df_in.copy() @@ -490,16 +506,21 @@ def harmonize_all(df_in, errors='raise'): def harmonize_generic(df_in, char_val, units_out=None, errors='raise', intermediate_columns=False, report=False): - """Harmonize a given char_val using the appropriate function. + """Harmonize char_val rows based methods specific to that char_val. + All rows where the value in the 'CharacteristicName' column matches + char_val will have their results harmonized based on available methods for + that char_val. + Parameters ---------- df_in : pandas.DataFrame - DataFrame with the expected activity date time columns. + DataFrame with the expected columns (change based on char_val). char_val : str - Expected 'CharacteristicName'. + Target value in 'CharacteristicName' column. units_out : str, optional - Desired units to convert values into. The default is None. + Desired units to convert results into. + The default None, uses the constant domains.OUT_UNITS. errors : str, optional Values of ‘ignore’, ‘raise’, or ‘skip’. The default is ‘raise’. If ‘raise’, then invalid dimension conversions will raise an exception. @@ -513,12 +534,14 @@ def harmonize_generic(df_in, char_val, units_out=None, errors='raise', Returns ------- df : pandas.DataFrame - Updated copy of df_in + Updated copy of df_in. Examples -------- - Build example table from tests to use in place of Water Quality Portal query response - + Build example df_in table from harmonize_wq tests to use in place of Water + Quality Portal query response, this table has 'Temperature, water' and + 'Phosphorous' results: + >>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' >>> df1 = pandas.read_csv(tests_url + '/data/wqp_results.txt') @@ -543,15 +566,17 @@ def harmonize_generic(df_in, char_val, units_out=None, errors='raise', [359505 rows x 37 columns] + List columns that were added: + >>> df_result.columns[-2:] Index(['QA_flag', 'Temperature'], dtype='object') See Also -------- See any of the 'Detailed' notebooks found in - :ref:'demos' for - examples of how this function is used to standardize, clean and wrangle a Water Quality Portal - query response, one CharacteristicName at a time. + 'demos' for examples + of how this function is used to standardize, clean, and wrangle a Water + Quality Portal query response, one 'CharacteristicName' value at a time. """ # Check/retrieve standard attributes and df columns as object wqp = WQCharData(df_in, char_val) diff --git a/harmonize_wq/location.py b/harmonize_wq/location.py index d292da0..7422e88 100644 --- a/harmonize_wq/location.py +++ b/harmonize_wq/location.py @@ -18,7 +18,7 @@ def infer_CRS(df_in, crs_col='HorizontalCoordinateReferenceSystemDatumName'): """Replace missing or unrecognized Coordinate Reference System (CRS). - Replaces with desired CRS and adds QA_flag about it. + Replaces with desired CRS and notes it was mising in 'QA_flag' column. Parameters ---------- @@ -38,12 +38,12 @@ def infer_CRS(df_in, Returns ------- df_out : pandas.DataFrame - Updated copy of df_in + Updated copy of df_in. Examples -------- Build pandas DataFrame to use in example, where crs_col name is 'Datum' - rather than default 'HorizontalCoordinateReferenceSystemDatumName' + rather than default 'HorizontalCoordinateReferenceSystemDatumName': >>> from numpy import nan >>> df_in = pandas.DataFrame({'Datum': ['NAD83', 'WGS84', '', None, nan]}) @@ -64,8 +64,8 @@ def infer_CRS(df_in, 3 None Datum: MISSING datum, EPSG:4326 assumed 4326.0 4 NaN Datum: MISSING datum, EPSG:4326 assumed 4326.0 - Note: missing (nan) and bad CRS values (bad_crs_val=None) are given an EPSG - and QA flag, but others (e.g., '') are not. + NOTE: missing (NaN) and bad CRS values (bad_crs_val=None) are given an EPSG + and noted in QA_flag' columns. """ df_out = df_in.copy() if bad_crs_val: @@ -89,19 +89,23 @@ def harmonize_locations(df_in, out_EPSG=4326, """Create harmonized geopandas GeoDataframe from pandas DataFrame. Takes a :class:`~pandas.DataFrame` with lat/lon in multiple Coordinate - Reference Systems, transforms them to outCRS and converts to - :class:`geopandas.GeoDataFrame` + Reference Systems (CRS), transforms them to out_EPSG CRS, and converts to + :class:`geopandas.GeoDataFrame`. A 'QA_flag' column is added to the result + and populated for any row that has location based problems like limited + decimal precision or an unknown input CRS. Parameters ---------- df_in : pandas.DataFrame - DataFrame with the required columns to be converted to GeoDataFrame. + DataFrame with the required columns (see kwargs for expected defaults) + to be converted to GeoDataFrame. out_EPSG : int, optional EPSG factory code for desired output Coordinate Reference System datum. The default is 4326, for the WGS84 Datum used by WQP queries. intermediate_columns : Boolean, optional Return intermediate columns. Default 'False' does not return these. - Keyword Arguments: + **kwargs: optional + Accepts crs_col, lat_col, and lon_col parameters if non-default: crs_col : str, optional Name of column in DataFrame with the Coordinate Reference System datum. The default is 'HorizontalCoordinateReferenceSystemDatumName'. @@ -119,7 +123,7 @@ def harmonize_locations(df_in, out_EPSG=4326, Examples -------- - Build pandas DataFrame to use in example + Build pandas DataFrame to use in example: >>> df_in = pandas.DataFrame({'LatitudeMeasure': [27.5950355, ... 27.52183, @@ -146,12 +150,7 @@ def harmonize_locations(df_in, out_EPSG=4326, 1 27.521830 -82.644760 ... NaN POINT (-82.64476 27.52183) 2 28.066111 -82.377500 ... NaN POINT (-82.37750 28.06611) - [3 rows x 5 columns] - - Note: both geometries where the CRS was not the default 4326 (WGS1984) have - been transformed in the geometry column, a QA_flag column was also added to - record any location based problems like limited decimal precision or an - unknown input CRS. + [3 rows x 5 columns] """ df2 = df_in.copy() @@ -199,21 +198,21 @@ def harmonize_locations(df_in, out_EPSG=4326, def transform_vector_of_points(df_in, datum, out_EPSG): - """Transform points by vector (sub-set by datum). + """Transform points by vector (sub-sets points by EPSG==datum). Parameters ---------- df_in : pandas.DataFrame DataFrame that will be updated. - datum : TYPE - DESCRIPTION. + datum : int + Current datum (EPSG code) to transform. out_EPSG : int EPSG factory code for desired output Coordinate Reference System datum. Returns ------- df : pandas.DataFrame - Updated copy of df_in + Updated copy of df_in. """ # Create transform object for input datum (EPSG colum) and out_EPSG transformer = Transformer.from_crs(datum, out_EPSG) @@ -230,33 +229,34 @@ def transform_vector_of_points(df_in, datum, out_EPSG): def get_harmonized_stations(query, aoi=None): """Query, harmonize and clip stations. - Queries the Water Quality Portal (https://waterquality.data.us) for - stations with data matching the query, harmonizes those stations location - information and clips it to the Area Of Interest (AOI) if specified. + Queries the `Water Quality Portal `_ for + stations with data matching the query, harmonizes those stations' location + information, and clips it to the area of interest (aoi) if specified. - See www.waterqualitydata.us/webservices_documentation for API reference + See ``_ for API + reference. Parameters ---------- query : dict - Water Quality Portal query as dictionary + Water Quality Portal query as dictionary. aoi : geopandas.GeoDataFrame, optional Area of interest to clip stations to. The default None returns all stations in the query extent. Returns ------- - stations_gdf : geopandas.GeoDataFrame + stations_gdf : ``geopandas.GeoDataFrame`` Harmonized stations. - stations : pandas.DataFrame + stations : ``pandas.DataFrame`` Raw station results from WQP. - site_md : TYPE - WQP query metadata. + site_md : ``dataretrieval.utils.Metadata`` + Custom ``dataretrieval`` metadata object pertaining to the WQP query. Examples -------- See any of the 'Simple' notebooks found in - :ref:'demos' for + 'demos'_ for examples of how this function is used to query and harmonize stations. """ diff --git a/harmonize_wq/visualize.py b/harmonize_wq/visualize.py index 0b5f469..260c35c 100644 --- a/harmonize_wq/visualize.py +++ b/harmonize_wq/visualize.py @@ -11,7 +11,7 @@ def print_report(results_in, out_col, unit_col_in, threshold=None): Parameters ---------- - results_in : pandas.Dataframe + results_in : pandas.DataFrame DataFrame with subset of results. out_col : str Name of column in results_in with final result. @@ -27,9 +27,9 @@ def print_report(results_in, out_col, unit_col_in, threshold=None): See Also -------- See any of the 'Detailed' notebooks found in - :ref:'demos' for + `demos`_ for examples of how this function is leveraged by the - harmonize.harmonize_generic() report argument. + :func:`harmonize.harmonize_generic` report argument. """ # Series with just usable results. @@ -73,8 +73,9 @@ def map_counts(df_in, gdf, col=None): DataFrame with subset of results. gdf : geopandas.GeoDataFrame GeoDataFrame with monitoring locations. - col : str + col : str, optional Column in df_in to aggregate results to in addition to location. + The default is None, where results are only aggregated on locaion. Returns ------- @@ -159,11 +160,11 @@ def map_measure(df_in, gdf, col): Returns ------- geopandas.GeoDataFrame - GeoDataFrame with average value of results for each station + GeoDataFrame with average value of results for each station. Examples -------- - Build array of pint quantities for Temperature: + Build array of pint Quantity for Temperature: >>> from pint import Quantity >>> u = 'degree_Celsius' @@ -229,19 +230,19 @@ def station_summary(df_in, col): """Get summary table for stations. Summary table as :class:`~pandas.DataFrame` with rows for each - station, count and column average. + station, count, and column average. Parameters ---------- df_in : pandas.DataFrame - DataFrame with subset of results. + DataFrame with results to summarize. col : str Column name in df_in to summarize results for. Returns ------- pandas.DataFrame - + Table with result count and average summarized by station. """ # Column for station loc_id = 'MonitoringLocationIdentifier' diff --git a/harmonize_wq/wq_data.py b/harmonize_wq/wq_data.py index 505f75e..9d51e78 100644 --- a/harmonize_wq/wq_data.py +++ b/harmonize_wq/wq_data.py @@ -18,7 +18,7 @@ class WQCharData(): df_in : pandas.DataFrame DataFrame that will be updated. char_val : str - Expected CharacteristicName. + Expected value in 'CharacteristicName' column. Attributes ---------- @@ -27,15 +27,15 @@ class WQCharData(): c_mask : pandas.Series Row conditional (bool) mask to limit df rows to only those for the specific characteristic. - col : SimpleNamespace - Standard df column names for unit_in, unit_out, and measure. + col : types.SimpleNamespace + Standard WQCharData.df column names for unit_in, unit_out, and measure. out_col : str Column name in df for results, set using char_val. - ureg = pint.UnitRegistry() - Pint unit registry, starts set to standard unit registry. - units: str - Units all results in out_col will be converted into. Default units are - returned from domains.OUT_UNITS[out_col]. + ureg : pint.UnitRegistry() + pint unit registry, initially standard unit registry. + units : str + Units all results in out_col column will be converted into. + Default units are returned from :func:`domains.OUT_UNITS`[out_col]. Examples -------- @@ -261,9 +261,13 @@ def check_units(self, flag_col=None): Parameters ---------- flag_col : str, optional - Column to reference in QA_flags. - The default None uses WQCharData.col.unit_out instead. - + Column to reference in srting for 'QA_flags'. + The default None uses WQCharData.col.unit_out attribute. + + Returns + ------- + None. + Examples -------- Build DataFrame to use as input: @@ -300,7 +304,8 @@ def check_units(self, flag_col=None): 1 Temperature, water NaN NaN 2 Phosphorus mg/l ResultMeasure/MeasureUnitCode: 'Unknown' UNDEF... - Note: it didn't infer units for 'Temperature, water' because wq is Phosphorus specific + Note: it didn't infer units for 'Temperature, water' because wq is + Phosphorus specific. """ # Replace unit by dict using domain self.replace_unit_by_dict(domains.UNITS_REPLACE[self.out_col]) @@ -336,7 +341,11 @@ def check_basis(self, basis_col='MethodSpecificationName'): ---------- basis_col : str, optional Basis column name. Default is 'MethodSpecificationName' which is - replaced by 'Speciation', others are updated in place. + replaced by 'Speciation'. Other columns are updated in place. + + Returns + ------- + None. Examples -------- @@ -375,7 +384,8 @@ def check_basis(self, basis_col='MethodSpecificationName'): 1 NaN NaN 2 NaN PO4 - Note where basis was part of ResultMeasure/MeasureUnitCode it has been removed in Units: + Note where basis was part of 'ResultMeasure/MeasureUnitCode' it has + been removed in 'Units': >>> wq.df.iloc[0] CharacteristicName Phosphorus @@ -437,13 +447,17 @@ def update_ureg(self): def update_units(self, units_out): """Update class units attribute to convert everything into. - Note: It does not perform the conversion. + This just updates the attribute, it does not perform the conversion. Parameters ---------- units_out : str Units to convert results into. - + + Returns + ------- + None. + Examples -------- Build WQ Characteristic Data class: @@ -468,8 +482,13 @@ def measure_mask(self, column=None): Parameters ---------- column : str, optional - DataFrame column name to use. Default None uses self.out_col - + DataFrame column name to use. Default None uses WQCharData.out_col + attribute. + + Returns + ------- + None. + Examples -------- >>> from harmonize_wq import wq_data @@ -499,7 +518,11 @@ def convert_units(self, default_unit=None, errors='raise'): If ‘raise’, invalid dimension conversions will raise an exception. If ‘skip’, invalid dimension conversions will not be converted. If ‘ignore’, invalid dimension conversions will be NaN. - + + Returns + ------- + None. + Examples -------- Build pandas DataFrame to use as input: @@ -554,6 +577,10 @@ def apply_conversion(self, convert_fun, unit, u_mask=None): Mask to use to identify what is being converted. The default is None, creating a unit mask based on unit. + Returns + ------- + None. + Examples -------- Build pandas DataFrame to use as input: @@ -600,13 +627,13 @@ def apply_conversion(self, convert_fun, unit, u_mask=None): self.df = df_out def dimensions_list(self, m_mask=None): - """Get list of unique dimensions. + """Get list of unique unit dimensions. Parameters ---------- m_mask : pandas.Series, optional Conditional mask to limit rows. - The default None, uses measure_mask(). + The default None, uses :meth:`measure_mask`. Returns ------- @@ -643,14 +670,14 @@ def dimensions_list(self, m_mask=None): self.ureg) def replace_unit_str(self, old, new, mask=None): - """Replace ALL instances of old str with new str in units. + """Replace ALL instances of old with in WQCharData.col.unit_out column. Parameters ---------- old : str - sub-string to find and replace + Sub-string to find and replace. new : str - sub-string to replace old sub-string + Sub-string to replace old sub-string. mask : pandas.Series, optional Conditional mask to limit rows. The default None, uses the c_mask attribute. @@ -703,7 +730,11 @@ def replace_unit_by_dict(self, val_dict, mask=None): mask : pandas.Series, optional Conditional mask to limit rows. The default None, uses the c_mask attribute. - + + Returns + ------- + None. + Examples -------- Build pandas DataFrame to use as input: @@ -741,7 +772,7 @@ def replace_unit_by_dict(self, val_dict, mask=None): def fraction(self, frac_dict=None, suffix=None, fract_col='ResultSampleFractionText'): - """Create columns for sample fractions, use frac_dict to set names. + """Create columns for sample fractions using frac_dict to set names. Parameters ---------- @@ -758,11 +789,11 @@ def fraction(self, frac_dict=None, suffix=None, Returns ------- frac_dict : dict - frac_dict updated to include any frac_col not already defined. + frac_dict updated to include any fract_col not already defined. Examples -------- - Not fully implemented with TADA table yet + Not fully implemented with TADA table yet. """ c_mask = self.c_mask if suffix is None: @@ -850,15 +881,21 @@ def fraction(self, frac_dict=None, suffix=None, def dimension_fixes(self): """ Input/output for dimension handling. + + Result dictionary key is old_unit and value is equation to get it into + the desired dimension. Result list has substance to include as part of + unit. - Note: this is done one dimension at a time, except for mole - conversions which are further divided by basis (one at a time) + Notes + ----- + These are next processed interactively, one dimension at a time, except + for mole conversions which are further split by basis (one at a time). Returns ------- - dimension_dict : dict + dimension_dict : ``dict`` Dictionary with old_unit:new_unit. - mol_list : list + mol_list : ``list`` List of Mole (substance) units. Examples @@ -920,7 +957,11 @@ def moles_convert(self, mol_list): ---------- mol_list : list List of Mole (substance) units. - + + Returns + ------- + None. + Examples -------- Build pandas DataFrame to use as input: diff --git a/harmonize_wq/wrangle.py b/harmonize_wq/wrangle.py index 8f4304c..ea10f2a 100644 --- a/harmonize_wq/wrangle.py +++ b/harmonize_wq/wrangle.py @@ -14,7 +14,10 @@ def split_table(df_in): Splits :class:`pandas.DataFrame` in two, one with main results columns and one with Characteristic based metadata. - Note: runs datetime() and harmonize_depth() if expected columns are missing + Notes + ----- + Runs :func:`clean.datetime` and :func:`cleanharmonize_depth` if expected + columns ('Activity_datetime' and 'Depth') are missing. Parameters ---------- @@ -31,7 +34,7 @@ def split_table(df_in): Examples -------- See any of the 'Simple' notebooks found in - :ref:'demos' for + `demos`_ for examples of how this function is used to divide the table into columns of interest (main_df) and characteristic specific metadata (chars_df). @@ -53,8 +56,11 @@ def split_table(df_in): def split_col(df_in, result_col='QA_flag', col_prefix='QA'): - """Split column so that each value is in a characteristic specific column. + """Move each row value from a column to a characteristic specific column. + Values are moved from the result_col in df_in to a new column where the + column name is col_prefix + characteristic. + Parameters ---------- df_in : pandas.DataFrame @@ -72,7 +78,7 @@ def split_col(df_in, result_col='QA_flag', col_prefix='QA'): Examples -------- See any of the 'Simple' notebooks found in - :ref:'demos' for + `demos`_ for examples of how this function is used to split the QA column into multiple characteristic specific QA columns. @@ -152,7 +158,7 @@ def collapse_results(df_in, cols=None): Examples -------- See any of the 'Simple' notebooks found in - :ref:'demos' for + `demos`_ for examples of how this function is used to combine rows with the same sample organization, activity, location, and datetime. @@ -253,7 +259,7 @@ def get_activities_by_loc(characteristic_names, locations): Examples -------- - See wrangle.add_activities_to_df() + See :func:`wrangle.add_activities_to_df` """ # Split loc_list as query by list may cause the query url to be too long seg = 200 # Max length of each segment @@ -277,7 +283,8 @@ def add_activities_to_df(df_in, mask=None): df_in : pandas.DataFrame DataFrame that will be updated. mask : pandas.Series - Row conditional mask to sub-set rows to get activities for + Row conditional mask to sub-set rows to get activities for. + The default None, uses the entire set. Returns ------- @@ -286,7 +293,9 @@ def add_activities_to_df(df_in, mask=None): Examples -------- - Build example tables from tests + Build example df_in table from harmonize_wq tests to use in place of Water + Quality Portal query response, this table has 'Temperature, water' and + 'Phosphorous' results: >>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' @@ -299,7 +308,7 @@ def add_activities_to_df(df_in, mask=None): >>> df_activities.shape (359505, 100) - Look at the columns added + Look at the columns added: >>> df_activities.columns[-65:] Index(['ActivityTypeCode', 'ActivityMediaName', 'ActivityMediaSubdivisionName', @@ -376,7 +385,7 @@ def add_detection(df_in, char_val): df_in : pandas.DataFrame DataFrame that will be updated. char_val : str - Specific characteristic name to apply to + Specific characteristic name to apply to. Returns ------- @@ -385,7 +394,9 @@ def add_detection(df_in, char_val): Examples -------- - Build example tables from tests + Build example df_in table from harmonize_wq tests to use in place of Water + Quality Portal query response, this table has 'Temperature, water' and + 'Phosphorous' results: >>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' @@ -401,7 +412,7 @@ def add_detection(df_in, char_val): Note: the additional rows are due to one result being able to be assigned multiple detection results. - Look at the columns added + Look at the columns added: >>> df_detects.columns[-3:] Index(['DetectionQuantitationLimitTypeName', @@ -427,13 +438,16 @@ def add_detection(df_in, char_val): def get_detection_by_loc(loc_series, result_id_series, char_val=None): """Get detection quantitation by location and characteristic (optional). - Retrieves detection quantitation results by location, and characteristic - name (Optional). ResultIdentifier can not be used to search, location id is - used instead and then results are limited by ResultIdentifiers. + Retrieves detection quantitation results by location and characteristic + name (optional). ResultIdentifier can not be used to search. Instead + location id from loc_series is used and then results are limited by + ResultIdentifiers from result_id_series. - NOTES: There can be multiple Result Detection Quantitation limits / result. - A result may have a ResultIdentifier without any corresponding data - in the Detection Quantitation limits table (NaN in return). + Notes + ----- + There can be multiple Result Detection Quantitation limits / result. + A result may have a ResultIdentifier without any corresponding data in the + Detection Quantitation limits table (NaN in return). Parameters ---------- @@ -443,7 +457,7 @@ def get_detection_by_loc(loc_series, result_id_series, char_val=None): Series of result IDs to limit retrieved data. char_val : str, optional. Specific characteristic name to retrieve detection limits for. - The default None, uses all CharacteristicNames + The default None, uses all 'CharacteristicName' values returned. Returns ------- @@ -483,7 +497,7 @@ def merge_tables(df1, df2, df2_cols='all', merge_cols='activity'): df1 : pandas.DataFrame DataFrame that will be updated. df2 : pandas.DataFrame - DataFrame with new columns (df2_cols) that will be added to df_in. + DataFrame with new columns (df2_cols) that will be added to df1. df2_cols : str, optional Columns in df2 to add to df1. The default is 'all', for all columns not already in df1. @@ -494,11 +508,12 @@ def merge_tables(df1, df2, df2_cols='all', merge_cols='activity'): Returns ------- merged_results : pandas.DataFrame - Updated copy of df_in. + Updated copy of df1. Examples -------- - Build example tables from tests + Build example table from harmonize_wq tests to use in place of Water + Quality Portal query responses: >>> import pandas >>> tests_url = 'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests' @@ -574,8 +589,9 @@ def as_gdf(shp): Examples -------- - Use area of interest from tests GeoJSON: - + Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from + harmonize_wq tests: + >>> from harmonize_wq import wrangle >>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson' >>> type(wrangle.as_gdf(aoi_url)) @@ -603,7 +619,8 @@ def get_bounding_box(shp, idx=None): Examples -------- - Use area of interest from tests GeoJSON: + Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from + harmonize_wq tests: >>> from harmonize_wq import wrangle >>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson' @@ -625,12 +642,14 @@ def get_bounding_box(shp, idx=None): def clip_stations(stations, aoi): - """ - Clip stations to area of interest (aoi). + """Clip stations to area of interest (aoi). + + Locations and results are queried by extent rather than the exact geometry. + Clipping by the exact geometry helps reduce the size of the results. Notes ----- - aoi is first transformed to stations CRS. + aoi is first transformed to CRS of stations. Parameters ---------- @@ -646,10 +665,7 @@ def clip_stations(stations, aoi): Examples -------- - Locations and results are queried by extent rather than the exact geometry, clipping - by the exact geometry helps reduce the size of the results - - Build example GeoDataFrame of locations for stations + Build example geopandas GeoDataFrame of locations for stations: >>> import geopandas >>> from shapely.geometry import Point @@ -663,7 +679,8 @@ def clip_stations(stations, aoi): 0 In POINT (-87.12500 30.50000) 1 Out POINT (-87.50000 30.50000) - Use area of interest from tests GeoJSON: + Use area of interest GeoJSON for Pensacola and Perdido Bays, FL from + harmonize_wq tests: >>> aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson' @@ -682,8 +699,8 @@ def clip_stations(stations, aoi): def to_simple_shape(gdf, out_shp): """Simplify GeoDataFrame for better export to shapefile. - Adopts and adapts 'Simple' from NWQMC/pywqp. See domains.stations_rename() - for renaming of columns. + Adopts and adapts 'Simple' from `NWQMC/pywqp`_ + See :func:`domains.stations_rename` for renaming of columns. Parameters ---------- diff --git a/pyproject.toml b/pyproject.toml index 9943871..ae6dc8d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -15,7 +15,7 @@ version = "0.3.1" authors = [ { name="Justin Bousquin", email="Bousquin.Justin@epa.gov" }, ] -description = "Package to standardize, clean and wrangle Water Quality Portal data into more analytic-ready formats" +description = "Package to standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats" readme = "README.md" requires-python = ">=3.7" keywords = ["USEPA", "water data", "water quality"]