201 restrict number of validation findings processed #202

jcadam14 · 2024-05-31T22:45:28Z

Closes #196
Closes #201

Lots of updates. Part of this was in the original memory improvements PR #197. Copied comment:

Removed static validation data (description, name, link, severity, scope) from initial error dataframe to drastically reduce size
Changed df_to_download to use groupby before pivot, to reduce size of to_csv dataframes, concat to_csvs and prepend headers, fixed two field swap that wasn't occurring, add description, link, etc during group
Created df_to_dicts function to allow clients (ie filing-api) to use the "json" object without having to ujson.dumps/ujson.loads which takes up processing time for larger data sets
Created get_all_checks and find_check to get check to pull out validation static data for final data
Updated pytests

cfpb/sbl-filing-api#257 was written to update the filing-api to use the df_to_dicts and to change how the submission state is determined since the validation df will not have severity.

Question: I left the df_to_table, df_to_csv, df_to_str alone, without adding the static data. I'm not sure if these are actually used anywhere else other than the tests (which I updated to just have the trimmed down error dataframe columns). Do we want to add that to those, too?

For the limiting of error and json size, added the concept of passed in maxes to the validation and df_to_json/dicts functions. These maxes are used to create 'ratios' of validation group errors or records in relation to the original total number of errors, which are then used against the max, if total is more than max, to calculate new error totals per validation group.

Also added for the df_to_json a 'static' max records per group that can be passed in to avoid the ratio calculations and instead return the same number of errors per group instead.

Added a first row below the download csv header that has a warning about the submission having more errors than can be returned. The download formatter does not do any ratio/max setting, that is done prior at the validation section. I can change this function to also further truncate the download reports results, like the df_to_json does. But currently that doesn't seem to be a need.

Added a ValidationResults class to hold all the info from validation (single-field counts, multi-field counts, register counts, etc) since the amount of data needing to come out of validation has grown a bit.

Updated pytests to test the max concepts, and beefed up the data current pytests were using so that we hit more cases than what we previously had.

…-findings-processed

github-actions · 2024-05-31T22:49:05Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/regtech_data_validator
cli.py
create_schemas.py
data_formatters.py					190
phase_validations.py
validation_results.py
Project Total

_{This report was generated by python-coverage-comment-action}

lchen-2101 · 2024-06-03T18:07:41Z

src/regtech_data_validator/create_schemas.py

+        results.phase = ValidationPhase.SYNTACTICAL.value
+        return results
+
+    results = validate(get_phase_2_schema_for_lei(context), df)


guessing this one should have the max_errors passed in as well?

Updated, missed that.

lchen-2101 · 2024-06-03T18:09:46Z

src/regtech_data_validator/create_schemas.py

+    results = validate(get_phase_1_schema_for_lei(context), df, max_errors)
+
+    if not results.is_valid:
+        results.phase = ValidationPhase.SYNTACTICAL.value


small code cleaniness nitpick, don't do it if it changes too many things; feels like results should be immutable coming from the validate, so the phase should be part of the results instead of having to be set here.

Nope, that makes sense. Added PHASE_1 and PHASE_2 statics that are used by the validation() function to determine phase of results based on the DataFrameSchema name.

Actually you know what, I might remove the PHASE_1 and PHASE_2 and use SYNTACTICAL and LOGICAL in all those places instead. Give me a little bit to see how big a change that would be.

lchen-2101 · 2024-06-03T20:41:05Z

src/regtech_data_validator/validation_results.py

+    register_count: int
+    is_valid: bool
+    findings: pd.DataFrame
+    phase: ValidationPhase = None


let's get rid of the default placeholder now, also let's slap a frozen=True to the decorator since we aren't modifying it anymore.

Very cool, I learned a thing.

…rocessed

lchen-2101

LGTM

jcadam14 added 13 commits May 21, 2024 11:23

Testing avoiding ujson until needed in filing-api

2dff026

Testing improvements

d894d81

Mem updates

7dcb262

removed del

a8d3fd6

Rework csvs

d245ee1

Changed json grouping to dynamically adjust size limit

1b7f937

Mem usage changes

4526b67

Updates to improve memory and processing of dataframes

2620139

Merge branch 'main' into test_df_to_dict

024a04d

Update to validation results to limit errors to process

ac0ede2

Merge branch 'test_df_to_dict' into 201-restrict-number-of-validation…

199a511

…-findings-processed

Added max variables passed into functions

782401c

Removed unused static max

3575fd8

jcadam14 requested review from hkeeler, guffee23, lchen-2101 and nargis-sultani May 31, 2024 22:45

jcadam14 self-assigned this May 31, 2024

jcadam14 linked an issue May 31, 2024 that may be closed by this pull request

Restrict number of validation findings processed #201

Closed

Changed functions to use math.ceil

a805c61

lchen-2101 reviewed Jun 3, 2024

View reviewed changes

jcadam14 mentioned this pull request Jun 3, 2024

Change filing-api to use df_to_dicts in data-validator cfpb/sbl-filing-api#257

Closed

jcadam14 added 2 commits June 3, 2024 15:16

poetry lock

6b5aeef

Updated from PR comments

726e325

jcadam14 requested a review from lchen-2101 June 3, 2024 19:49

jcadam14 added 2 commits June 3, 2024 16:09

Removed "phase_1/2" in favor of ValidationPhase enums

24ef47a

Linting

dd47381

lchen-2101 reviewed Jun 3, 2024

View reviewed changes

jcadam14 added 2 commits June 3, 2024 15:07

Merge branch 'main' into 201-restrict-number-of-validation-findings-p…

7a958f8

…rocessed

Added frozen

a5ab65a

lchen-2101 approved these changes Jun 3, 2024

View reviewed changes

lchen-2101 merged commit c986982 into main Jun 3, 2024
6 checks passed

lchen-2101 deleted the 201-restrict-number-of-validation-findings-processed branch June 3, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

201 restrict number of validation findings processed #202

201 restrict number of validation findings processed #202

jcadam14 commented May 31, 2024 •

edited

Loading

github-actions bot commented May 31, 2024 •

edited

Loading

lchen-2101 Jun 3, 2024

jcadam14 Jun 3, 2024

lchen-2101 Jun 3, 2024

jcadam14 Jun 3, 2024 •

edited

Loading

jcadam14 Jun 3, 2024 •

edited

Loading

lchen-2101 Jun 3, 2024

jcadam14 Jun 3, 2024

lchen-2101 left a comment

201 restrict number of validation findings processed #202

201 restrict number of validation findings processed #202

Conversation

jcadam14 commented May 31, 2024 • edited Loading

github-actions bot commented May 31, 2024 • edited Loading

Coverage report

lchen-2101 Jun 3, 2024

Choose a reason for hiding this comment

jcadam14 Jun 3, 2024

Choose a reason for hiding this comment

lchen-2101 Jun 3, 2024

Choose a reason for hiding this comment

jcadam14 Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

jcadam14 Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

lchen-2101 Jun 3, 2024

Choose a reason for hiding this comment

jcadam14 Jun 3, 2024

Choose a reason for hiding this comment

lchen-2101 left a comment

Choose a reason for hiding this comment

jcadam14 commented May 31, 2024 •

edited

Loading

github-actions bot commented May 31, 2024 •

edited

Loading

jcadam14 Jun 3, 2024 •

edited

Loading

jcadam14 Jun 3, 2024 •

edited

Loading