-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
201 restrict number of validation findings processed #202
201 restrict number of validation findings processed #202
Conversation
…-findings-processed
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
results.phase = ValidationPhase.SYNTACTICAL.value | ||
return results | ||
|
||
results = validate(get_phase_2_schema_for_lei(context), df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guessing this one should have the max_errors passed in as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, missed that.
results = validate(get_phase_1_schema_for_lei(context), df, max_errors) | ||
|
||
if not results.is_valid: | ||
results.phase = ValidationPhase.SYNTACTICAL.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small code cleaniness nitpick, don't do it if it changes too many things; feels like results should be immutable coming from the validate, so the phase should be part of the results instead of having to be set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that makes sense. Added PHASE_1 and PHASE_2 statics that are used by the validation() function to determine phase of results based on the DataFrameSchema name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you know what, I might remove the PHASE_1 and PHASE_2 and use SYNTACTICAL and LOGICAL in all those places instead. Give me a little bit to see how big a change that would be.
register_count: int | ||
is_valid: bool | ||
findings: pd.DataFrame | ||
phase: ValidationPhase = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's get rid of the default placeholder now, also let's slap a frozen=True
to the decorator since we aren't modifying it anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, I learned a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes #196
Closes #201
Lots of updates. Part of this was in the original memory improvements PR #197. Copied comment:
cfpb/sbl-filing-api#257 was written to update the filing-api to use the df_to_dicts and to change how the submission state is determined since the validation df will not have severity.
Question: I left the df_to_table, df_to_csv, df_to_str alone, without adding the static data. I'm not sure if these are actually used anywhere else other than the tests (which I updated to just have the trimmed down error dataframe columns). Do we want to add that to those, too?
For the limiting of error and json size, added the concept of passed in maxes to the validation and df_to_json/dicts functions. These maxes are used to create 'ratios' of validation group errors or records in relation to the original total number of errors, which are then used against the max, if total is more than max, to calculate new error totals per validation group.
Also added for the df_to_json a 'static' max records per group that can be passed in to avoid the ratio calculations and instead return the same number of errors per group instead.
Added a first row below the download csv header that has a warning about the submission having more errors than can be returned. The download formatter does not do any ratio/max setting, that is done prior at the validation section. I can change this function to also further truncate the download reports results, like the df_to_json does. But currently that doesn't seem to be a need.
Added a ValidationResults class to hold all the info from validation (single-field counts, multi-field counts, register counts, etc) since the amount of data needing to come out of validation has grown a bit.
Updated pytests to test the max concepts, and beefed up the data current pytests were using so that we hit more cases than what we previously had.