refactor: standardize repo structure and other prep for open-sourcing #60

hkeeler · 2023-10-17T07:29:21Z

This PR is a bit of a grab bag of tune-up in prep for open-sourcing this repo. It includes:

Structuring the repo to be more compliant with modern Python projects.
1. Move tests out to top-level directory.
2. Rename src/validator to regtech_data_validator.
Consolidate external datasource transform code and source data under top-level data directory.
1. Moved config.py settings into their respective scripts, and file paths are now passed in as CLI args instead.
Move processed CSV files into the project itself. This allows for simpler data lookups via package name via importlib.resources. This allowed the removal of the ROOT_PATH Python path logic in all of the __init__.pys.
Refactor global_data.py to load data only once where module is first imported.
Refactor SBLCheck's
1. warning: bool for a more explicit severity, backed by an enum that only allows ERROR and WARNING.
  1. Several of the warning-level validations were not setting warning=True, and were thus defaulting to False. This will prevent that. I also fixed all these instances.
  2. Removes the need for translation to severity when building JSON output.
2. Use explicit args in the constructor, and pass all shared args on to parent class.
  This removes the need for the arg name/id error handling.
Switch CLI output from Python dict to JSON.
Rollback black version used in linting Action due to bug in latest version.
- GitHub Action fails psf/black#3953

Note: Some of the files that I both moved and changed seem to now show as having deleted the old file and created a new one. I'm not sure why it's doing this. I did the moves and changes in separate commits, which usually prevents this, but doesn't seem to be the case here. Perhaps there's just so much change in some that git considers it a whole new file? 🤷 It's kind of annoying, especially if it results in losing git history for those files.

- Fixed all imports - Fixed test and coverage settings in pyproject.toml - Removed all Python path magic in __init.py__ files - Moved data files into the repo, and used `importlib` to load files by package name instead of path. This is more portable, especially once we turn this into a distributable package. - Refactored global_data to only load data once at module load time

- Move file format settings from config.py into respective data transform scripts - Move src/dest file settings from config.py to CLI args - Use consistent CLI arg and file exists handling - Add openpyxl dependency for handling NAICS Excel reading

- Use required enum-based `severity` arg over boolean `warning` with default to false. This default is likely partially the cause of several warning-level validations being set to error. - Remove `test_checks.py` as those tests are no longer needed with refactored `SBLCheck`. - Refactor JSON output to use `severity` enum value - Refactor exception handling for Pandera's `SchemaErrors`

github-actions · 2023-10-17T07:30:33Z

Coverage report

The coverage rate went from 93.97% to 85.13% ⬇️
The branch rate is 82%.

81.25% of new lines are covered.

Diff Coverage details (click to unfold)

regtech_data_validator/checks.py

100% of new lines are covered (100% of the complete file).

regtech_data_validator/create_schemas.py

90% of new lines are covered (93.24% of the complete file).
Missing lines: 57, 62

regtech_data_validator/global_data.py

100% of new lines are covered (100% of the complete file).

regtech_data_validator/main.py

0% of new lines are covered (0% of the complete file).
Missing lines: 8, 13, 20, 26, 27, 29, 32, 34, 49, 50

regtech_data_validator/phase_validations.py

100% of new lines are covered (100% of the complete file).

regtech_data_validator/schema_template.py

100% of new lines are covered (100% of the complete file).

aharjati · 2023-10-18T19:59:24Z

pyproject.toml

@@ -16,6 +16,9 @@ pytest-cov = "4.1.0"
 black = "23.3.0"
 ruff = "0.0.259"

+[tool.poetry.group.data.dependencies]
+openpyxl = "^3.1.2"


Is this used? I may have missed the usage

Yeah, it's required by the NAICS code processing script.

I can add that detail to the new README I added for that dataset. Each of those could use instructions on how to run those two scripts too.

aharjati · 2023-10-18T20:01:01Z

this file: .../regtech-data-validator/.devcontainer/devcontainer.json
needs to be updated to use new tests path :

"python.testing.pytestArgs": [
          "--rootdir",
          "${workspaceFolder}/tests"
        ]

hkeeler · 2023-10-18T22:30:05Z

@aharjati, I haven't been running this in the DevContainer setup. What's the best way to test the fix? Is it basically just that the built-in VSCode test runner works as-expected?

hkeeler · 2023-10-18T22:46:09Z

I think we're good. This seems to work now...

hkeeler · 2023-10-18T22:51:37Z

...and now black seems to be failing on the PR Check, though it doesn't seem related to the code itself. Maybe there's something going on environment-wise with GitHub Actions at the moment? I tried kicking the Action back off manually, but it failed the same way. I'll try again later. 😕

aharjati · 2023-10-19T20:40:49Z

...and now black seems to be failing on the PR Check, though it doesn't seem related to the code itself. Maybe there's something going on environment-wise with GitHub Actions at the moment? I tried kicking the Action back off manually, but it failed the same way. I'll try again later. 😕

we many need to update this list:
https://github.com/cfpb/regtech-data-validator/blob/pre-open-source/pyproject.toml#L39

hkeeler · 2023-10-20T07:29:09Z

The build is back ~~in black 🤘~~ to ✅ . Looks like there's a bug in the latest version (psf/black#3953). I pinned it to the last major version, and it's happy again. We can roll it forward once they have a fix up for that.

aharjati · 2023-10-20T14:56:29Z

The build is back ~~in black 🤘~~ to ✅ . Looks like there's a bug in the latest version (psf/black#3953). I pinned it to the last major version, and it's happy again. We can roll it forward once they have a fix up for that.

Thanks!

aharjati

lgtm

…#60) Grab bag of tune-up in prep for open-sourcing this repo. 1. Restructure repo to be more compliant with modern Python projects. 1. Move `tests` out to top-level directory. 2. Rename `src/validator` to `regtech_data_validator`. 2. Consolidate external datasource code and data to `data` dir. 1. Move `config.py` settings into their respective scripts, and file paths are now passed in as CLI args instead. 3. Move processed CSV files into the project itself. This allows for simpler data lookups via package name via `importlib.resources`. This allowed the removal of the `ROOT_PATH` Python path logic in all of the `__init__.py`s. 4. Refactor `global_data.py` to load data only once where module is first imported. 5. Refactor `SBLCheck`'s 1. `warning: bool` for a more explicit `severity`, backed by an enum that only allows `ERROR` and `WARNING`. 1. Several of the warning-level validations were not setting `warning=True`, and were thus defaulting to `False`. This will prevent that. I also fixed all these instances. 2. Removes the need for translation to `severity` when building JSON output. 2. Use explicit args in the constructor, and pass all shared args on to parent class. This removes the need for the arg `name`/`id` error handling. 6. Switch CLI output from Python dict to JSON. 7. Rollback `black` version used in linting Action due to bug in latest version. - psf/black#3953 **Note:** Some of the files that I both moved _and_ changed seem to now show as having deleted the old file and created a new one. I'm not sure why it's doing this. I did the moves and changes in separate commits, which usually prevents this, but doesn't seem to be the case here. Perhaps there's just so much change in some that git considers it a whole new file? 🤷 It's kind of annoying, especially if it results in losing git history for those files.

hkeeler added 8 commits October 16, 2023 19:11

Move files into more standard Python project layout

722b981

Move data-related code and config under data dir

ccee738

Add README for external data sources

9044db8

Fix multi-line string that was setup as a tuple

58873bc

Print CLI output as JSON instead of Python dict.

b732419

hkeeler requested a review from aharjati October 17, 2023 07:29

black and ruff fixups

3b18289

aharjati reviewed Oct 18, 2023

View reviewed changes

Fix path to tests dir in DevContainer setup

5a515e9

Remove tools dir from black exclude list

3901ca3

hkeeler requested a review from aharjati October 20, 2023 06:45

hkeeler added 2 commits October 20, 2023 03:03

Add --verbose to black Action to debug failures

a86f79e

Will reverting Action's black version back help?

3793bca

aharjati approved these changes Oct 20, 2023

View reviewed changes

hkeeler changed the title ~~[WIP] refactor: standardize repo structure and other prep for open-sourcing~~ refactor: standardize repo structure and other prep for open-sourcing Oct 20, 2023

hkeeler merged commit ba6a1c4 into main Oct 20, 2023
3 checks passed

hkeeler deleted the pre-open-source branch October 20, 2023 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: standardize repo structure and other prep for open-sourcing #60

refactor: standardize repo structure and other prep for open-sourcing #60

hkeeler commented Oct 17, 2023 •

edited

Loading

github-actions bot commented Oct 17, 2023 •

edited

Loading

regtech_data_validator/checks.py

regtech_data_validator/create_schemas.py

regtech_data_validator/global_data.py

regtech_data_validator/main.py

regtech_data_validator/phase_validations.py

regtech_data_validator/schema_template.py

aharjati Oct 18, 2023

hkeeler Oct 18, 2023

hkeeler Oct 18, 2023

aharjati commented Oct 18, 2023

hkeeler commented Oct 18, 2023

hkeeler commented Oct 18, 2023

hkeeler commented Oct 18, 2023

aharjati commented Oct 19, 2023

hkeeler commented Oct 20, 2023

aharjati commented Oct 20, 2023

aharjati left a comment

refactor: standardize repo structure and other prep for open-sourcing #60

refactor: standardize repo structure and other prep for open-sourcing #60

Conversation

hkeeler commented Oct 17, 2023 • edited Loading

github-actions bot commented Oct 17, 2023 • edited Loading

Coverage report

regtech_data_validator/checks.py

regtech_data_validator/create_schemas.py

regtech_data_validator/global_data.py

regtech_data_validator/main.py

regtech_data_validator/phase_validations.py

regtech_data_validator/schema_template.py

aharjati Oct 18, 2023

Choose a reason for hiding this comment

hkeeler Oct 18, 2023

Choose a reason for hiding this comment

hkeeler Oct 18, 2023

Choose a reason for hiding this comment

aharjati commented Oct 18, 2023

hkeeler commented Oct 18, 2023

hkeeler commented Oct 18, 2023

hkeeler commented Oct 18, 2023

aharjati commented Oct 19, 2023

hkeeler commented Oct 20, 2023

aharjati commented Oct 20, 2023

aharjati left a comment

Choose a reason for hiding this comment

hkeeler commented Oct 17, 2023 •

edited

Loading

github-actions bot commented Oct 17, 2023 •

edited

Loading