Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

201 restrict number of validation findings processed #202

Merged
merged 20 commits into from
Jun 3, 2024

Conversation

jcadam14
Copy link
Contributor

@jcadam14 jcadam14 commented May 31, 2024

Closes #196
Closes #201

Lots of updates. Part of this was in the original memory improvements PR #197. Copied comment:

  • Removed static validation data (description, name, link, severity, scope) from initial error dataframe to drastically reduce size
  • Changed df_to_download to use groupby before pivot, to reduce size of to_csv dataframes, concat to_csvs and prepend headers, fixed two field swap that wasn't occurring, add description, link, etc during group
  • Created df_to_dicts function to allow clients (ie filing-api) to use the "json" object without having to ujson.dumps/ujson.loads which takes up processing time for larger data sets
  • Created get_all_checks and find_check to get check to pull out validation static data for final data
  • Updated pytests

cfpb/sbl-filing-api#257 was written to update the filing-api to use the df_to_dicts and to change how the submission state is determined since the validation df will not have severity.

Question: I left the df_to_table, df_to_csv, df_to_str alone, without adding the static data. I'm not sure if these are actually used anywhere else other than the tests (which I updated to just have the trimmed down error dataframe columns). Do we want to add that to those, too?

For the limiting of error and json size, added the concept of passed in maxes to the validation and df_to_json/dicts functions. These maxes are used to create 'ratios' of validation group errors or records in relation to the original total number of errors, which are then used against the max, if total is more than max, to calculate new error totals per validation group.

Also added for the df_to_json a 'static' max records per group that can be passed in to avoid the ratio calculations and instead return the same number of errors per group instead.

Added a first row below the download csv header that has a warning about the submission having more errors than can be returned. The download formatter does not do any ratio/max setting, that is done prior at the validation section. I can change this function to also further truncate the download reports results, like the df_to_json does. But currently that doesn't seem to be a need.

Added a ValidationResults class to hold all the info from validation (single-field counts, multi-field counts, register counts, etc) since the amount of data needing to come out of validation has grown a bit.

Updated pytests to test the max concepts, and beefed up the data current pytests were using so that we hit more cases than what we previously had.

@jcadam14 jcadam14 self-assigned this May 31, 2024
@jcadam14 jcadam14 linked an issue May 31, 2024 that may be closed by this pull request
Copy link

github-actions bot commented May 31, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/regtech_data_validator
  cli.py
  create_schemas.py
  data_formatters.py 190
  phase_validations.py
  validation_results.py
Project Total  

This report was generated by python-coverage-comment-action

results.phase = ValidationPhase.SYNTACTICAL.value
return results

results = validate(get_phase_2_schema_for_lei(context), df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guessing this one should have the max_errors passed in as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, missed that.

results = validate(get_phase_1_schema_for_lei(context), df, max_errors)

if not results.is_valid:
results.phase = ValidationPhase.SYNTACTICAL.value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small code cleaniness nitpick, don't do it if it changes too many things; feels like results should be immutable coming from the validate, so the phase should be part of the results instead of having to be set here.

Copy link
Contributor Author

@jcadam14 jcadam14 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that makes sense. Added PHASE_1 and PHASE_2 statics that are used by the validation() function to determine phase of results based on the DataFrameSchema name.

Copy link
Contributor Author

@jcadam14 jcadam14 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you know what, I might remove the PHASE_1 and PHASE_2 and use SYNTACTICAL and LOGICAL in all those places instead. Give me a little bit to see how big a change that would be.

@jcadam14 jcadam14 requested a review from lchen-2101 June 3, 2024 19:49
register_count: int
is_valid: bool
findings: pd.DataFrame
phase: ValidationPhase = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get rid of the default placeholder now, also let's slap a frozen=True to the decorator since we aren't modifying it anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, I learned a thing.

Copy link
Collaborator

@lchen-2101 lchen-2101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lchen-2101 lchen-2101 merged commit c986982 into main Jun 3, 2024
6 checks passed
@lchen-2101 lchen-2101 deleted the 201-restrict-number-of-validation-findings-processed branch June 3, 2024 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restrict number of validation findings processed Data validator memory improvements
2 participants