Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/smoothing and extrapolating gps coordinates #268

Merged

Conversation

BaptisteVandecrux
Copy link
Member

@BaptisteVandecrux BaptisteVandecrux commented Jun 27, 2024

This PR builds on top of #262. It:

  • reads merged (raw+tx for v2+v3 stations) aws-l2 data
  • runs toL3 on that L2 file within which the GPS latitude, longitude and altitude are post-processed into smooth, gap-free series.

This is done by:

  • finding breaks in the available GPS coordinates which correspond to station maintenance. The smoothing and interpolation will be done separately for each interval between maintenance so that station relocation is not smoothed.

  • For each interval, we apply a linear fit to the each coordinate, which provides a smoothed alternative during the measurement periods, is used to interpolate gaps and extrapolate to the next maintenance, to the latest timestamp or backward in time if necessary.

  • For stations where no GPS data is available, the static value contained as an attribute in the aws-l2 data file is given to each timestamp

The result are three new variables:

  • lat, lon, alt the smoothed and gapless time-dependent coordinates for each stations
    And three new attributes:
  • lat_avg, lon_avg, alt_avg the mean value of the previous three variables that can be given as attribute in thre netcdf files and potentially in the AWS_metadata.csv file distributed with the dataset.

@BaptisteVandecrux BaptisteVandecrux changed the base branch from main to make-python-friendly-CLI-scripts June 27, 2024 08:43
Base automatically changed from make-python-friendly-CLI-scripts to develop June 28, 2024 12:05
This update:
- clears up the SHF LHF calculation
- find breaks in the gps coordinates (indicative of station relocation)
- for each interval between breaks, smoothes, interpolate and extrapolate the gps coordinates

new required package: statsmodel
- both functions now take the station_configuration config file as input
- added some error handling in l2tol3-py when variables required for processing are not available
- now breaks in station locations are listed in the station_configurations file
- now the smoothed lat/lon/alt between two breaks (i.e. between two station relocation) is linear and not with lowess anymore
@BaptisteVandecrux BaptisteVandecrux force-pushed the feature/smoothing-and-extrapolating-gps-coordinates branch from 4343ea8 to 1bfe305 Compare July 2, 2024 08:09
@ladsmund ladsmund force-pushed the feature/smoothing-and-extrapolating-gps-coordinates branch from 1bfe305 to d6d1735 Compare July 2, 2024 12:31
src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
src/pypromice/process/L2toL3.py Show resolved Hide resolved
src/pypromice/process/L2toL3.py Show resolved Hide resolved
src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
…we now also recalculate dirWindSpd if needed for historical data
src/pypromice/process/aws.py Outdated Show resolved Hide resolved
src/pypromice/process/get_l2tol3.py Outdated Show resolved Hide resolved
ds=ds.rename(var_name)

standard_vars_to_drop = ["NR", "TA3", "TA4", "TA5", "NR_cor",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these variables related to the NEAD format or specifically for the files you are reading? 🤔
Consider creating a module dedicated for reading (and preprocessing) historical gc-net data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add them to skipped_variables in the station configuration file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, they are related to the files we are reading (here historical GC-Net).
But they 31 historical GC-Net stations and I'd rather have one list here rather than 31 skipped_variables list in the config files. Additionally the only other NEAD files we might read at some point are the old glaciobasis stations (4-5 AWS). For these few glaciobasis stations I'd consider using a skipped_variables info.
Last, this standard_vars_to_drop will be reduced very soon as

"z_surf_1",   "z_surf_2", "z_surf_combined", 
            "depth_t_i_1", "depth_t_i_2", "depth_t_i_3", "depth_t_i_4", "depth_t_i_5", 
            "depth_t_i_6", "depth_t_i_7", "depth_t_i_8", "depth_t_i_9", "depth_t_i_10", "t_i_10m"

will shortly be derived for the GEUS stations, and therefore won't need to be skipped anymore in the historical files.

and that in the longer term, we could also calculate

"TA2m", "RH2m", "VW10m", "SZA", "SAA",

and thereafter remove them from the list of variables to skip in the historical files.

Once again, keeping this list in one single place (rather than multiple config files) makes every update easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will look at making a smart solution for this in the future, but I think it works fine for now

if not os.path.isfile(filepath):
logger.info(stid+' is from an project '+folder_l3+' or '+folder_gcnet)
if not os.path.isfile(filepath):
logger.info(stid+' was listed as station but could not be found in '+folder_l3+' nor '+folder_gcnet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I anticipated that this would trigger an error and raise an exception. If the station list specifies data files, I would have expected those files to exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was that if a historical station is listed but cannot be found, then the latest data (which can most likely be found because it was produced by pypromice) should still be loaded and written in a l3/sites file. So that people fetching the latest data are not affected if there is a mess-up with the historical files' path.

src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
and corrected relative humidity
df_all = pd.Series(dtype=float) # Initialize an empty Series to gather all smoothed pieces
breaks = [None] + breaks + [None]
for i in range(len(breaks) - 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to call copy() since you are not mutating df. It is significantly slower to copy all data instead of just using slices.

The suggestion below is just for inspiration as an alternative to use indexes.

Suggested change
for i in range(len(breaks) - 1):
slice_start = None
for slice_end in breaks + [None]:
df = slice(slice_start, slice_end)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the copy()
But I don't see how slice_start is updated at each iteration (from None, breaks[0] ... to breaks[-1] on the last iteration).
I'd like to stick to my proposition unless there is something wrong or inefficient about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add a slice_start = slice_end

src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
src/pypromice/process/L2toL3.py Outdated Show resolved Hide resolved
).values

# Remove duplicate indices and return values as numpy array
df_all = df_all[~df_all.index.duplicated(keep='first')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this function can cause duplicated indices

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for each iteration/segment we select:

df = df_in.loc[slice(breaks[i], breaks[i+1])].copy()

which includes the timestamps of both breaks. So at one iteration the end break received an estimated value, and in the next iteration/segment that same break is a start break ad receives another value.

We therefore need to drop duplicates at the end.
I'm changing to keep='last' so that if a time stamp is a break, then it received the value estimated within the segment where it is a start break (and not an end break).

read config file in get_l2tol3 and pass only dictionary to L2toL3.py

+ fixed doc on L2toL3.py
+ better reverse of L1 dataset list in aws.py
@BaptisteVandecrux BaptisteVandecrux force-pushed the feature/smoothing-and-extrapolating-gps-coordinates branch from 73f6359 to 55b5f3a Compare July 4, 2024 12:12
PennyHow
PennyHow previously approved these changes Jul 5, 2024
Copy link
Member

@PennyHow PennyHow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me so far. I think the only additional thing I would like to see are checks on the get_l2tol3, join_l2 and join_l3 scripts in the actions script, so it checks the workflow on PRs. As long as it is being run and checked on the development vm, then I think the actions script should be updated later on when we have all changes implemented.

@@ -1,91 +1,91 @@
field,standard_name,long_name,units,coverage_content_type,coordinates,instantaneous_hourly,where_to_find,lo,hi,OOL,station_type,L0,L2,L3,max_decimals
time,time,Time,yyyy-mm-dd HH:MM:SS,physicalMeasurement,time lat lon alt,,,,,,all,1,1,1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time lat lon alt coordinate description is very specific to CF conventions, as each measurement has a "coordinate" in both time and space. See here in the CF conventions docs for an example. Did you find something else to suggest that the coordinate description should just be time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the stations' lat, lon, alt are changing with time, which makes unsuitable to be used as stable coordinates.
The CF convention is not very well designed for "floating" station like ours. I really don't want users to think the stations are static.

As a not for future enhancement, it would be great to have station_id or site_id as a coordinate (instead of a attribute), so that we could do things like

ds = xr.open_mf(list_files, concat_dim='station_id')
ds.t_u.mean(dim='time') # gives for each station the average temperature

@@ -91,7 +91,7 @@ def getL1(self):
logger.info('Level 1 processing...')
self.L0 = [utilities.addBasicMeta(item, self.vars) for item in self.L0]
self.L1 = [toL1(item, self.vars) for item in self.L0]
self.L1A = reduce(xr.Dataset.combine_first, self.L1)
self.L1A = reduce(xr.Dataset.combine_first, reversed(self.L1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why the order of the L1 datasets is reversed for combine_first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the logger/transmission files are loaded from older to newer.
In combine first, it starts to load the first element of the list and adds info (variables but also attributes) from the other elements if needed.
It means that the attributes of the oldest elements prevails over the newer elements.

For sites like EGP and KAN_U that switched from one boom to two booms, we need the newest data to be the first element, so that the latest value of attributes such as "number_of_booms" is used instead of older values, and therefore need to reverse that list.

ds=ds.rename(var_name)

standard_vars_to_drop = ["NR", "TA3", "TA4", "TA5", "NR_cor",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will look at making a smart solution for this in the future, but I think it works fine for now

src/pypromice/resources/variables.csv Outdated Show resolved Hide resolved
@BaptisteVandecrux BaptisteVandecrux merged commit 2eb7f79 into develop Jul 5, 2024
8 checks passed
@BaptisteVandecrux BaptisteVandecrux deleted the feature/smoothing-and-extrapolating-gps-coordinates branch July 5, 2024 13:07
BaptisteVandecrux added a commit that referenced this pull request Jul 5, 2024
ladsmund pushed a commit that referenced this pull request Aug 8, 2024
BaptisteVandecrux added a commit that referenced this pull request Aug 20, 2024
* Update .gitignore

* L2 split from L3 CLI processing

* unit tests moved to separate module

* file writing functions moved to separate module

* Loading functions moved to separate module

* Handling and reformating functions moved

* resampling functions moved

* aws module updated with structure changes

* get_l2 and l2_to_l3 process test added

* data prep and write function moved out of AWS class

* stations for testing changed

* creating folder before writing files, writing hourly daily monthly files out in L2toL3, trying not to re-write sorted tx file if already sorted

* update get_l3 to add historical data

* resampling frequency specified

* renamed join_levels to join_l2 because join_l3 will have different merging function, attribute management and use site_id and list_station_id

* skipping resample after join_l2, fixed setup.py for join_l2

* fixing test

* fixed function names

* update get_l3 to add historical data

* update get_l3 to add historical data

* Create get_l3_new.py

* further work on join_l3, varible_aliases in ressource folder

* cleaning up debug code in join_l3

* small fix in join_l3

* working verion

* delete encoding info after reading netcdf, debug of getColNames

* delete get_l3.py

* removing variables and output files metadata

* new variable description files

* added back ressource files, use level attributes for output definition

* make list of sites from station_config, switched print to logger.info

* removing get_l3, remove inst. values from averaged files, fixes on logging, attributes and tests,

* Updates to numpy dependency version and pandas deprecation warnings (#258)

* numpy dependency <2.0

* resample rules updated (deprecation warning)

* fillna replaced with ffill (deprecation warning)

* get_l3 called directly rather than from file

* process action restructured

* small changes following review, restored variable.csv history

renamed new variable.csv

moved old variable.csv

renamed new variables.csv

recreate variables.csv

* buiding a station list instead of a station_dict

* renamed l3m to l3_merged, reintroduced getVars and getMeta

* moving gcnet_postprocessing as part of readNead

* sorting out the station_list in reverse chronological order

* using tilde notation in setup.py

* better initialisation of station_attributes attribute

* moved addMeta, addVars, roundValues, reformatTime, reformatLon to write.py

* Inline comment describing enocding attribute removal when reading a netcdf

* loading toml file as dictionary within join_l3

instead of just reading the stid to join

* ressources renamed to resources (#261)

* using project attribute of a station locate AWS file and specify whether it's a Nead file

* update test after moving addVars and addMeta

* fixed logger message in resample

* better definition of monthly sample rates in addMeta

* dummy datasaet built in unit test now has 'level' attribute

* not storing timestamp_max for each station but pulling the info directly from the dataset when sorting

* removing unecessary import of addMeta, roundValues

* make CLI scripts usable within python

* return result in join_l2 and join_l3

* removing args from join_l2 function

* proper removal of encoding info when reading netcdf

* Refactored and Organized Test Modules

- Moved test modules and data from the package directory to the root-level tests directory.
- Updated directory structure to ensure clear separation of source code and tests.
- Updated import statements in test modules to reflect new paths.
- Restructured the tests module:
  - Renamed original automatic tests to `e2e` as they primarily test the main CLI scripts.
  - Added `unit` directory for unit tests.
  - Created `data` directory for shared test data files.

This comprehensive refactoring improves project organization by clearly separating test code from application code. It facilitates easier test discovery and enhances maintainability by following common best practices.

* Limited the ci tests to only run e2e

* naming conventions changed

* Feature/smoothing and extrapolating gps coordinates (#268)

* implemented gps postprocessing on top of the #262

This update:
- clears up the SHF LHF calculation
- reads dates of station relocations (when station coordinates are discontinuous) from the `aws-l0/metadata/station_configurations` 
- for each interval between station relocations, a linear function is fitted to the GPS observations of latitude longitude and altitude and is used to interpolate and extrapolate the gps observations
- these new smoothed and gap-free coordinates are the variables `lat, lon, alt`
- for bedrock stations (like KAN_B) static coordinates are used to build `lat, lon, alt`
- eventually `lat_avg`, `lon_avg` `alt_avg` are calculated from `lat, lon, alt` and added as attributes to the netcdf files.

Several minor fixes were also brought like:
- better encoding info removal when reading netcdf
- printing to files variables full of NaNs at `L2` and `L3/stations` but not printing them in the `L3/sites files`.
- recalculate dirWindSpd if needed for historical data
- due to xarray version, new columns need to be added manually before a concatenation of different datasets in join_l3

* Updated persistence.py to use explicit variable thresholds

Avoided applying the persistence filter on averaged pressure variables (`p_u` and `p_l`) due to their 0 decimal precision often leading to incorrect filtering.

* Fixed bug in persistence QC where initial repetitions were ignored

* Relocated unit persistence tests
* Added explicit test for `get_duration_consecutive_true`
* Renamed `duration_consecutive_true` to `get_duration_consecutive_true` for imperative clarity

* Updated python version in unittest

* Fixed bug in get_bufr

Configuration variables were to strictly validated.
* Made bufr_integration_test explicit

* Added __all__ to get_bufr.py

* Applied black code formatting

* Made bufr_to_csv as cli script in setup.py

* Updated read_bufr_file to use wmo_id as index

* Added script to recreate bufr files

* Added corresponding unit tests
* Added flag to raise exceptions on errors
* Added create_bufr_files.py to setup

* Updated tests parameters

Updated station config:
* Added sonic_ranger_from_gps
* Changed height_of_gps_from_station_ground from 0 to 1

* Added test for missing data in get_bufr

- Ensure get_bufr_variables raises AttributeError when station dimensions are missing

* Updated get_bufr to support static GPS heights.

* Bedrock stations shouldn’t depend on the noisy GPS signal for elevation.
* Added station dimension values for WEG_B
* Added corresponding unittest

* Updated github/workflow to run unittests

Added eccodes installation

* Updated get_bufr to support station config files in folder

* Removed station_configurations.toml from repository
* Updated bufr_utilities.set_station to validate wmo id
* Implemented StationConfig io tests
* Extracted StationConfiguration utils from get_bufr
* Added support for loading multiple station configuration files

Other
* Made ArgumentParser instantiation inline

* Updated BUFRVariables with scales and descriptions

* Added detailed descriptions with references to the attributes in BUFRVariables
* Change the attribute order to align with the exported schema
* Changed variable roundings to align with the scales defined in the BUFR schemas:
  * Latitude and longitude is set to 5. Was 6
  * heightOfStationGroundAboveMeanSeaLevel is set to 1. Was 2
  * heightOfBarometerAboveMeanSeaLevel is set to to 1. Was 2
  * pressure is set to -1. Was 1. Note: The BUFRVariable unit is Pa and not hPA
  * airTemperature is set to 2. Was 1.
  * heightOfSensorAboveLocalGroundOrDeckOfMarinePlatformTempRH is set to 2. Was 4
  * heightOfSensorAboveLocalGroundOrDeckOfMarinePlatformWSPD is set to 2. Was 4
 * Added unit tests to test the roundings
* Updated existing unit tests to align with corrected precision

* Increased the real_time_utilities rounding precisions

* Updated get_bufr to separate station position from bufr

* The station position determination (AWS_latest_locations) is separated from the bufr file export
* Updated the unit tests

Corrected minimum data check to allow p_i or t_i to be nan

Renamed process_station parameters for readability
* Rename now_timestamp -> target_timestamp
* Rename time_limit -> linear_regression_time_limit

Applied black

* Minor cleanup

* Updated StationConfiguration IO to handle unknown attributes from input

* Updated docstring in create_bufr_files.py

* Renamed e2e unittest methods

Added missing "test" prefix required by the unittest framework.

* Feature/surface heights and thermistor depths (#278)

* processes surface heights variables: `z_surf_combined`, `z_ice_surf`, `snow_height`, and thermistors' depths: `d_t_i_1-11`
* `variable.csv` was updated accordingly
* some clean-up of turbulent fluxes calculation, including renaming functions
* handling empty station configuration files and making errors understandable
* updated join_l3 so that surface height and thermistor depths in historical data are no longer ignored and to adjust the surface height between the merged datasets

* calculated either from `gps_lat, gps_lon, gps_alt` or `lat, lon, alt`, static values called `latitude`, `longitude` and `altitude` are saved as attributes along with  `latitude_origin`, `longitude_origin` and `altitude_origin` which states whether they come from gappy observations  `gps_lat, gps_lon, gps_alt`  or from gap-filled postprocess `lat, lon, alt`
* changed "H" to "h" in pandas and added ".iloc" when necessary to remove deprecation warnings

* made `make_metadata_csv.py` to update latest location in `aws-l3/AWS_station_metadata.csv` and `aws-l3/AWS_sites_metadata.csv`

---------

Co-authored-by: Penny How <[email protected]>

* L2toL3 test added (#282)

* 3.8 and 3.9 tests removed, tests only for 3.10 and 3.11

* echo syntax changed

* updated input file paths
---------

* better adjustment of surface height in join_l3, also adjusting z_ice_surf (#289)

* different decoding of GPS data if "L" is in GPS string (#288)

* Updated pressure field for BUFR output files

* Updated get_l2 to use aws.vars and aws.meta

get_l2 were previously also loading vars and meta in addition to AWS.
AWS is populating meta with source information during instantiation.

* Removed static processing level attribute from file_attributes

* Run black on write.py

* Implemented alternative helper functions for reading variables and metadata files

* Refactor getVar getMeta
* Use pypromice.resources instaed of pkg_resources

* Select format from multiple L0 input files

The format string was previously selected from the last l0 file.

* Updated attribute metadata

* Added test case for output meta data
* Added json formatted source string to attributes
* Added title string to attributes
* Updated ID string to include level
* Added utility function for fetching git commit id

* Updated test_process with full pipeline test

* Added test station configuration
* Cleanup test data files

* Removed station configuration generation

* Renamed folder name in temporaty test directory

* Added data issues repository path as an explicit parameter to AWS

* Added data issues path to process_test.yml

* Applied black on join_l3

* Updated join_l3 to generate source attribute for sites

Validate attribute keys in e2e test

* job name changed

* Bugfix/passing adj dir to l3 processing plus attribute fix (#292)

* passing adjustment_dir to L2toL3.py

* fixing attributes in join_l3

- station_attribute containing info from merged dataset was lost when concatenating the datasets
- The key "source" is not present in the attributes of the old GC-Net files so `station_source = json.loads(station_attributes["source"])` was throwing an error

* give data_issues_path to get_l2tol3 in test_process

* using data_adjustments_dir as input in AWS.getL3

* adding path to dummy data_issues folder to process_test

* making sure data_issues_path  is Path in get_l2tol3

---------

Co-authored-by: PennyHow <[email protected]>
Co-authored-by: Mads Christian Lund <[email protected]>
BaptisteVandecrux added a commit that referenced this pull request Aug 20, 2024
* implemented gps postprocessing on top of the #262

This update:
- clears up the SHF LHF calculation
- reads dates of station relocations (when station coordinates are discontinuous) from the `aws-l0/metadata/station_configurations` 
- for each interval between station relocations, a linear function is fitted to the GPS observations of latitude longitude and altitude and is used to interpolate and extrapolate the gps observations
- these new smoothed and gap-free coordinates are the variables `lat, lon, alt`
- for bedrock stations (like KAN_B) static coordinates are used to build `lat, lon, alt`
- eventually `lat_avg`, `lon_avg` `alt_avg` are calculated from `lat, lon, alt` and added as attributes to the netcdf files.

Several minor fixes were also brought like:
- better encoding info removal when reading netcdf
- printing to files variables full of NaNs at `L2` and `L3/stations` but not printing them in the `L3/sites files`.
- recalculate dirWindSpd if needed for historical data
- due to xarray version, new columns need to be added manually before a concatenation of different datasets in join_l3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants