-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: standardize repo structure and other prep for open-sourcing #60
Conversation
- Fixed all imports - Fixed test and coverage settings in pyproject.toml - Removed all Python path magic in __init.py__ files - Moved data files into the repo, and used `importlib` to load files by package name instead of path. This is more portable, especially once we turn this into a distributable package. - Refactored global_data to only load data once at module load time
- Move file format settings from config.py into respective data transform scripts - Move src/dest file settings from config.py to CLI args - Use consistent CLI arg and file exists handling - Add openpyxl dependency for handling NAICS Excel reading
- Use required enum-based `severity` arg over boolean `warning` with default to false. This default is likely partially the cause of several warning-level validations being set to error. - Remove `test_checks.py` as those tests are no longer needed with refactored `SBLCheck`. - Refactor JSON output to use `severity` enum value - Refactor exception handling for Pandera's `SchemaErrors`
Coverage reportThe coverage rate went from
Diff Coverage details (click to unfold)regtech_data_validator/checks.py
regtech_data_validator/create_schemas.py
regtech_data_validator/global_data.py
regtech_data_validator/main.py
regtech_data_validator/phase_validations.py
regtech_data_validator/schema_template.py
|
@@ -16,6 +16,9 @@ pytest-cov = "4.1.0" | |||
black = "23.3.0" | |||
ruff = "0.0.259" | |||
|
|||
[tool.poetry.group.data.dependencies] | |||
openpyxl = "^3.1.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used? I may have missed the usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's required by the NAICS code processing script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that detail to the new README I added for that dataset. Each of those could use instructions on how to run those two scripts too.
this file:
|
@aharjati, I haven't been running this in the DevContainer setup. What's the best way to test the fix? Is it basically just that the built-in VSCode test runner works as-expected? |
...and now |
we many need to update this list: |
The build is back |
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…#60) Grab bag of tune-up in prep for open-sourcing this repo. 1. Restructure repo to be more compliant with modern Python projects. 1. Move `tests` out to top-level directory. 2. Rename `src/validator` to `regtech_data_validator`. 2. Consolidate external datasource code and data to `data` dir. 1. Move `config.py` settings into their respective scripts, and file paths are now passed in as CLI args instead. 3. Move processed CSV files into the project itself. This allows for simpler data lookups via package name via `importlib.resources`. This allowed the removal of the `ROOT_PATH` Python path logic in all of the `__init__.py`s. 4. Refactor `global_data.py` to load data only once where module is first imported. 5. Refactor `SBLCheck`'s 1. `warning: bool` for a more explicit `severity`, backed by an enum that only allows `ERROR` and `WARNING`. 1. Several of the warning-level validations were not setting `warning=True`, and were thus defaulting to `False`. This will prevent that. I also fixed all these instances. 2. Removes the need for translation to `severity` when building JSON output. 2. Use explicit args in the constructor, and pass all shared args on to parent class. This removes the need for the arg `name`/`id` error handling. 6. Switch CLI output from Python dict to JSON. 7. Rollback `black` version used in linting Action due to bug in latest version. - psf/black#3953 **Note:** Some of the files that I both moved _and_ changed seem to now show as having deleted the old file and created a new one. I'm not sure why it's doing this. I did the moves and changes in separate commits, which usually prevents this, but doesn't seem to be the case here. Perhaps there's just so much change in some that git considers it a whole new file? 🤷 It's kind of annoying, especially if it results in losing git history for those files.
This PR is a bit of a grab bag of tune-up in prep for open-sourcing this repo. It includes:
tests
out to top-level directory.src/validator
toregtech_data_validator
.data
directory.config.py
settings into their respective scripts, and file paths are now passed in as CLI args instead.importlib.resources
. This allowed the removal of theROOT_PATH
Python path logic in all of the__init__.py
s.global_data.py
to load data only once where module is first imported.SBLCheck
'swarning: bool
for a more explicitseverity
, backed by an enum that only allowsERROR
andWARNING
.warning=True
, and were thus defaulting toFalse
. This will prevent that. I also fixed all these instances.severity
when building JSON output.This removes the need for the arg
name
/id
error handling.black
version used in linting Action due to bug in latest version.Note: Some of the files that I both moved and changed seem to now show as having deleted the old file and created a new one. I'm not sure why it's doing this. I did the moves and changes in separate commits, which usually prevents this, but doesn't seem to be the case here. Perhaps there's just so much change in some that git considers it a whole new file? 🤷 It's kind of annoying, especially if it results in losing git history for those files.