Better Parameter Validation #80

AstroPatty · 2023-10-27T18:42:29Z

Hey all,

As promised, here is an implemented system for managing parameter defaults and validating input values. This type of checking is critical for long-running pipelines because it catches mistakes very early in a run and provides a very clear explanation of what is wrong. We want to place clear constraints on what types a parameter can be, what values it can take, and what its defaults are. Importantly, we also want to know exactly where all this information is, so we can update it when needed. This system uses pydantic for parameter validation, which is an industrial-strength standard library for this kind of stuff.

Note: This pull request is also intended as basic documentation for people contributing to slsim, so it'll be fairly long.

The system I've built here is very easy to use, but does require a bit of explanation. Currently, it only works (by design) with inputs to __init__ methods on classes but we can always update that in the future. Importantly, it can be implemented on a per-class basis.

To mark the class for validation, we simply decorate it with the @param_check decorator:

slsim/SomeModule/some_class.py

from slsim.Utils import param_check

class SomeClass:
    
    @param_check
    def __init__(self, param_a, param_b, param_c):
        self.a = param_a
        self.b = param_b
        self.c = param_c
 
   
    def calculate_something(self, param_d):
        #do some calculation
        return param_d

Then, create a subfolder called "_params". Within that subfolder, create a file with the same name as the file that contains the class

slsim/SomeModule/_params/some_class.py

from pydantic import BaseModel, Field, PositiveFloat

class SomeClass(BaseModel)
    param_a: list[float]
    param_b: str | list[str] = "some_input"
    param_c: int = Field(default = 2, ge=0)

class SomeClass_calculate_something(BaseModel):
    param_d: float = PositiveFloat

The first class defines the validation logic for parameters that go into the init function of SomeClass. Note that the name of the validator class is also SomeClass. This is required This will enforce the following constraints on the parameters:

param_a must be a list of floats
parm_b must be a string OR a list of strings. If nothing is provided, the default will be "some_input"
param_c must be an integer that is greater than 0. If nothing is provided, the default will be 2

The second class validates the inputs to the "SomeClass.calculate_something method." The name of the validation class in this case must always be "ClassName_MethodName," but other than that everything is identical. This method only has a single argument, which must be a positive floating point number.

We can also validate inputs to functions this way. For example:

slsim/SomeModule/some_functionality.py

def some_function(arg_a, arg_b, arg_c):
    # Do some really cool computation
    ...
    return result

Then we add a validation class

slsim/SomeModule/_params/some_functionality.py

from pydantic import BaseModel
class some_function(BaseModel):
    arg_a: list[PositiveFloat]
    arg_b: float
    arg_c: int = Field(default = 2, ge=0)

As expected, the name of the validation class must match the name of the decorated function.

Important note: All of this is entirely transparent to the user. They simply create an instance of SomeClass as they normally would. They will only run into an issue if the parameters they provide are invalid.

All validation is done by pydantic, which is incredibly fast. I've updated the GaussianMixtureModel class and the vel_disp_composite_model function as an example. The validation class for GaussianMixtureModel looks like this:

slsim/ParamDistributions/_params/GaussianMixtureModel.py

from pydantic import BaseModel, PositiveFloat, model_validator
import numpy as np

class GaussianMixtureModel(BaseModel):
    means: list[float] = [0.00330796, -0.07635054, 0.11829008]
    stds: list[PositiveFloat] = [np.sqrt(0.00283885), np.sqrt(0.01066668), np.sqrt(0.0097978)]
    weights: list[PositiveFloat] = [0.62703102, 0.23732313, 0.13564585]
    
    @field_validator("weights")
    @classmethod
    def check_weights(cls, weight_values):
        if sum(weight_values) != 1:
            raise ValueError("The sum of the weights must be 1")
        return weight_values

    @model_validator(mode="after")
    def check_lenghts(self):
        if len(self.means) != len(self.stds) or len(self.means) != len(self.weights):
            raise ValueError("The lenghts of means, stds and weights must be equal")
        return self

Note: I just pulled the defaults from the original class definition.

This does three things in addition to simple type checking:

Ensures stds and weights are positive values
Ensures the total value of the weights == 1
Ensures that the length of all three parameters are the same

Some of these were already implemented in the GaussianMixtureModel __init__ function. However, the checking is now much more robust and requires much less code. GaussianMixtureModel now just looks like this
slsim/ParamDistributions/gaussian_mixture_model.py

class GaussianMixtureModel:
    """A Gaussian Mixture Model (GMM) class.

    This class is used to represent a mixture of Gaussian distributions, each of which
    is defined by its mean, standard deviation and weight.
    """
    @check_params
    def __init__(self, means, stds, weights):
        """
        The constructor for GaussianMixtureModel class. The default values are the
        means, standard deviations, and weights of the fits to the data in the table
        2 of https://doi.org/10.1093/mnras/stac2235 and others. See "params.py" for
        defaults and validation logic.

        :param means: the mean values of the Gaussian components.
        :type means: list of float
        :param stds: The standard deviations of the Gaussian components.
        :type stds: list of float
        :param weights: The weights of the Gaussian components in the mixture.
        :type weights: list of float
        """
        self.means = means
        self.stds = stds
        self.weights = weights

Here's an example of a failure case:

>>> from slsim.ParamDistributions.gaussian_mixture_model import GaussianMixtureModel as GMM
>>> GMM(means=[1, 2, 3], stds = [4, 5], weights = [0.4, 0.4, 0.2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/patrick/code/DESC/slsim/slsim/Util/params.py", line 45, in new_init_fn
    parsed_args = get_defaults(init_fn)(**pargs, **kwargs)
  File "/Users/patrick/opt/anaconda3/envs/dev/lib/python3.10/site-packages/pydantic/main.py", line 164, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 1 validation error for GaussianMixtureModel
  Value error, The lengths of means, stds and weights must be equal [type=value_error, input_value={'means': [1, 2, 3], 'std...ights': [0.4, 0.4, 0.2]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.4/v/value_error

Where we see it is telling us that our three input arrays have to be the same length.

Additional notes:

We can introduce higher-level validation tools as needed. For example, we could validate that some value has particular units as defined in astropy.units. This will require some additional code, but I've basically already written it for cosmap so it will be straightforward.
Related to the above, pydantic is incredibly powerful and has tons of features. We likely won't use most of them, but they're always there.
We can eventually extend this system to read information in from configuration files, if we want.
We can always expand this system to do checking on more than just initialization functions, but this will definitely introduce some significant complexity on the backend. Happy to do this work though.
I would strongly recommend we update our minimum Python version to at least 3.10. This allows us to write union types (that is, marking a parameter as allowing multiple types) like float | list[float] whereas 3.8 would require Union[float, list[float]] and an additional import. Performance has also been a big focus of more recent Python versions, so we're losing out on all that if we stay on 3.8. With virtual environments, changing Python versions is trivial.
One thing we may want to change about this code as written is the naming scheme. "Parameter" is the standard name for function inputs, but of course it's also a word we use in a scientific context. So we may want to pick some convention that avoids confusion.

for more information, see https://pre-commit.ci

AstroPatty · 2023-10-27T21:38:05Z

Note: The build failed because it is on Python 3.8, and I used the more robust type hinting conventions that were introduced in Python 3.9. Namely, in Python 3.8 list type hints have to be declared like this:

from typing import List
value: List[float] = [1.0, 2.0]

While in Python 3.9 the type hinting can be done like this:

value: list[float] = [1.0, 2.0]

I think the second is much more natural and requires us to think less, which is an additional reason why I will advocate updating to a newer version of Python. It won't break backward compatibility and just gives us new features.

ajshajib · 2023-10-28T03:25:57Z

@PatrickRWells, thanks a lot for the PR! You can change the Python version in this file as part of this PR.

sibirrer · 2023-10-28T14:50:15Z

@PatrickRWells @ajshajib I am a bit worried about the need for requiring 3.9+ Python version. 3.8 should be a long-time supported version and many (or some) machines (be it personal computers or cluster environment) may still run on 3.8. This would narrow down the user base or cause some hassle for some.

AstroPatty · 2023-10-28T23:05:55Z

@sibirrer @ajshajib I must admit I really don't see this being an issue when virtual environments exist. I have never met an astronomer who doesn't use Anaconda. Virtual environment support is ubiquitous on clusters. It's not a hassle.

Also, Python 3.8 is scheduled for end-of-life next year. It is not a "long-term supported" version. Major libraries will start to drop support (numpy already has), as will major operating systems (without virtual environments of course). We really should not be building new software with a tool that is about to be abandoned when perfectly good alternatives exist that require virtually no overhead.

sibirrer · 2023-10-29T15:21:58Z

I see @PatrickRWells. I am fine with having a >=3.9 requirement. You can just not assume that all users follow the guidelines and then be too upset if their experience is impacted. At the end it is on the developers to provide a solution to the users. I tend to always favor better backwards compatibility in case the implemented features do not suffer too much and it is not necessary

nkhadka21 · 2023-10-29T22:14:42Z

Hi @PatrickRWells , thanks for the PR. You said that this only works for init method. How straightforward is it to extend for other parts of the code? It would be nice if you can write test functions for your implementation.

AstroPatty · 2023-10-30T00:00:49Z

@sibirrer I think if we're building software with a long-term lifetime in mind, we should try to remain compatible with up-to-date versions of our biggest dependencies. Both numpy and astropy have already dropped support for version 3.8, so we should do the same. Otherwise we run the risk down the line of running into problems (due to bugs or anything else) that are unfixable because we're using an old version, and miss out on updates that may meaningfully improve our experience. I think those teams probably understand the Python ecosystem well enough to make good calls on how much backward compatibility is the right amount.

It's also worth mentioning that one of the current big focuses of the Python team is performance, which is a critical aspect of our future work on LSST scale. 3.11 is roughly 30% faster than 3.10, in some applications. It will probably be wise for us to run our code in modern environments anyway.

Fortunately or unfortunately, Python is a language under active development, and new versions are expected to be released approximately once a year. I think this is a good thing in the long term, because the language gets meaningfully better. It does require some work on the part of us a developers if we really care about building tools that work for the long term, but that's just part of the deal.

Changes I will make to the PR:

I will update the GitHub workflow to do test builds in 3.9-3.11. That will allow us to be confident that our code works well. It is highly unlikely we will run into a scenario where our code does not build in later version.
I will update requirements.txt to explicitly list 3.9 as the minimum version.

Some things we should consider:

We should introduce some sort of true dependency management solution. I've used poetry quite extensively and found it quite useful (and very easy to use) for pure-python packages. It also can automatically build packages with basically no work required for us as developers. Combined with GitHub actions, we can completely automate releasing builds to PyPI, if we so choose.

AstroPatty · 2023-10-30T00:10:57Z

Hi @PatrickRWells , thanks for the PR. You said that this only works for init method. How straightforward is it to extend for other parts of the code? It would be nice if you can write test functions for your implementation.

@nkhadka21 Functionally it's not so hard. Right now, the only thing preventing it from working with other object methods is an explicit check that raises an exception if the decorator is not being used on an __init__ method, and the fact that I haven't designed a standard for creating pydantic models for these methods. However this should be a fairly easy change on the back end for other object methods. Extending it to work with functions is a bit more work, but should be doable.

The tricky part here is developer experience. If we're not careful, we run the risk of our "params.py" file getting big and unwieldy, which largely defeats the purpose. One possible approach is to put defaults directly in the same file as the classes/functions. I am happy to think a bit more about this and get back to you. In essence I'm really trying to build functionality that encourages us to be serious about good software engineering practices, but doesn't get in the way too much.

I think the long-term best solution is to combine this approach with a much tighter API. Right now, this package is mostly a bunch of classes we have to import directly. This is normally how things start off, but we should consider sitting down and really try to properly design an API at some point. That will allow us to control more easily how information flows through the package, and place checks where it makes the most sense to do so and they are most likely to catch mistakes. This is a big job, but it's the kind of stuff I enjoy working on and I'm happy to do so here. But we should be having conversations about this as a group to make sure we understand how the library is expected to be used.

Changes I will make to this PR:

I will add test cases for this functionality. I'll make a test class to use, then build out a test suite that checks all the possible different ways people might instantiate an object. Example: all positional arguments, all keyword arguments, a mix, keyword arguments in different orders, invalid inputs, etc...

- We now check what kind of function is being inputted - Introduce basic tests

for more information, see https://pre-commit.ci

AstroPatty · 2023-11-07T00:50:01Z

@nkhadka21 @sibirrer

Updates have been made based on discussion above. Here's a summary

Small changes:

Minimum python version is set to 3.9 for installation
Testing workflow will now test in python 3.9->3.11 (I have tested on my machine that all tests pass in 3.11 already)
All code now passes the pre-commit linter workflow

Bigger Changes:

Parameter checking functionality now works with methods besides __init__ and functions. This has required a change in the naming conventions. I have updated my original comment on this pull request with the new information.
This pull request now includes a full test suite for the parameter checking functionality (/tests/test_Params/test_params_check.py)
I have added validation to the vel_disp_composite_model function in slsim/Deflectors/velocity_dispersion so I can use it as a test case. Someone who knows more about velocity dispersion stuff may want to look it over and make sure my choices are sane.

Possible Future Changes:

One issue with this approach is that the LSP loses access to what the parameters of the functions are when they are checked. As a result, the popup help in tools like vscode looks like this

Note that the popup includes the docstring, but does not include information about the parameters and their types in the standard way. What we want is this:

This is possible to fix but requires python 3.10+. I would still advocate for moving to 3.10 for other reasons, but we can save this for another time.

codecov · 2023-11-07T04:27:34Z

Codecov Report

Merging #80 (cfec7a6) into main (a94055a) will increase coverage by 0.86%.
Report is 74 commits behind head on main.
The diff coverage is 90.44%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
+ Coverage   78.70%   79.56%   +0.86%     
==========================================
  Files          26       30       +4     
  Lines        1235     1385     +150     
==========================================
+ Hits          972     1102     +130     
- Misses        263      283      +20

Files	Coverage Δ
setup.py	`0.00% <ø> (ø)`
slsim/Deflectors/_params/velocity_dispersion.py	`100.00% <100.00%> (ø)`
slsim/Deflectors/all_lens_galaxies.py	`100.00% <100.00%> (ø)`
slsim/Deflectors/deflector_base.py	`84.61% <ø> (ø)`
slsim/Deflectors/elliptical_lens_galaxies.py	`88.40% <100.00%> (ø)`
slsim/Deflectors/velocity_dispersion.py	`100.00% <100.00%> (ø)`
slsim/Observations/image_quality_lenstronomy.py	`75.00% <ø> (ø)`
slsim/Observations/roman_speclite.py	`90.00% <100.00%> (ø)`
slsim/ParamDistributions/gaussian_mixture_model.py	`100.00% <100.00%> (ø)`
slsim/Pipelines/skypy_pipeline.py	`95.74% <100.00%> (ø)`
... and 20 more

AstroPatty and others added 14 commits October 27, 2023 10:03

Initial version working

b2ada30

add docstrings

3db5362

Finish documentation

f265226

Complete GaussianMixtureModel model

366de8c

Add decorator as top-level import in Util module

dd03ca2

Add to GaussianMixtureModel

84ea259

Move to using pydantic.PositiveFloat

405e19b

GMM: Enforce sum(weights) == 1

dc2202d

Spelling

4f4361d

Remove unnecessary import

fa6d791

One more spelling mistake

cb6983d

[pre-commit.ci] auto fixes from pre-commit.com hooks

c6d902f

for more information, see https://pre-commit.ci

Add pydantic to requirements

5e5f7cc

Fix linting issue

8cb7521

AstroPatty and others added 8 commits November 6, 2023 11:00

Several changes

8903241

- We now check what kind of function is being inputted - Introduce basic tests

[pre-commit.ci] auto fixes from pre-commit.com hooks

0905108

for more information, see https://pre-commit.ci

Working with all object methods

a730c7d

Working with tests

5fea8ec

[pre-commit.ci] auto fixes from pre-commit.com hooks

eadf85e

for more information, see https://pre-commit.ci

Passing linters

2e37905

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ab59d8

for more information, see https://pre-commit.ci

Tests written

5287a3d

Bump minimum version to 3.9

067e642

AstroPatty added 2 commits November 6, 2023 17:34

partial fix for typing

b5d21dd

Fix to run in 3.10

cfec7a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Parameter Validation #80

Better Parameter Validation #80

AstroPatty commented Oct 27, 2023 •

edited

Loading

AstroPatty commented Oct 27, 2023 •

edited

Loading

ajshajib commented Oct 28, 2023

sibirrer commented Oct 28, 2023

AstroPatty commented Oct 28, 2023 •

edited

Loading

sibirrer commented Oct 29, 2023

nkhadka21 commented Oct 29, 2023

AstroPatty commented Oct 30, 2023 •

edited

Loading

AstroPatty commented Oct 30, 2023 •

edited

Loading

AstroPatty commented Nov 7, 2023 •

edited

Loading

codecov bot commented Nov 7, 2023

Better Parameter Validation #80

Are you sure you want to change the base?

Better Parameter Validation #80

Conversation

AstroPatty commented Oct 27, 2023 • edited Loading

AstroPatty commented Oct 27, 2023 • edited Loading

ajshajib commented Oct 28, 2023

sibirrer commented Oct 28, 2023

AstroPatty commented Oct 28, 2023 • edited Loading

sibirrer commented Oct 29, 2023

nkhadka21 commented Oct 29, 2023

AstroPatty commented Oct 30, 2023 • edited Loading

AstroPatty commented Oct 30, 2023 • edited Loading

AstroPatty commented Nov 7, 2023 • edited Loading

codecov bot commented Nov 7, 2023

Codecov Report

AstroPatty commented Oct 27, 2023 •

edited

Loading

AstroPatty commented Oct 27, 2023 •

edited

Loading

AstroPatty commented Oct 28, 2023 •

edited

Loading

AstroPatty commented Oct 30, 2023 •

edited

Loading

AstroPatty commented Oct 30, 2023 •

edited

Loading

AstroPatty commented Nov 7, 2023 •

edited

Loading