generated from cfpb/open-source-project-template
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change filing-api to use df_to_dicts in data-validator #257
Comments
This was referenced May 28, 2024
lchen-2101
pushed a commit
to cfpb/regtech-data-validator
that referenced
this issue
Jun 3, 2024
Closes #196 Closes #201 Lots of updates. Part of this was in the original memory improvements PR #197. Copied comment: - Removed static validation data (description, name, link, severity, scope) from initial error dataframe to drastically reduce size - Changed df_to_download to use groupby before pivot, to reduce size of to_csv dataframes, concat to_csvs and prepend headers, fixed two field swap that wasn't occurring, add description, link, etc during group - Created df_to_dicts function to allow clients (ie filing-api) to use the "json" object without having to ujson.dumps/ujson.loads which takes up processing time for larger data sets - Created get_all_checks and find_check to get check to pull out validation static data for final data - Updated pytests cfpb/sbl-filing-api#257 was written to update the filing-api to use the df_to_dicts and to change how the submission state is determined since the validation df will not have severity. Question: I left the df_to_table, df_to_csv, df_to_str alone, without adding the static data. I'm not sure if these are actually used anywhere else other than the tests (which I updated to just have the trimmed down error dataframe columns). Do we want to add that to those, too? For the limiting of error and json size, added the concept of passed in maxes to the validation and df_to_json/dicts functions. These maxes are used to create 'ratios' of validation group errors or records in relation to the original total number of errors, which are then used against the max, if total is more than max, to calculate new error totals per validation group. Also added for the df_to_json a 'static' max records per group that can be passed in to avoid the ratio calculations and instead return the same number of errors per group instead. Added a first row below the download csv header that has a warning about the submission having more errors than can be returned. The download formatter **does not** do any ratio/max setting, that is done prior at the validation section. I can change this function to also further truncate the download reports results, like the df_to_json does. But currently that doesn't seem to be a need. Added a ValidationResults class to hold all the info from validation (single-field counts, multi-field counts, register counts, etc) since the amount of data needing to come out of validation has grown a bit. Updated pytests to test the max concepts, and beefed up the data current pytests were using so that we hit more cases than what we previously had.
lchen-2101
pushed a commit
that referenced
this issue
Jun 5, 2024
Closes #257 - Updated submission_processor to use df_to_dicts instead of df_to_json - Updated submission_processor to use ValidationResults instead of tuple - Updated state logic of submission to be based on the results.findings being empty (for successful), results.phase being SYNTACTICAL (always errors) or error counts being greater than 0. Else there were just warnings. This was done since the validation results dataframe doesn't have severity anymore. - Add config Settings to have values for max validation errors, max json errors, or max records per group (json) - Updated pytests to work with all this I didn't include any local.env file updates to set values for the max's. I can add those if desired.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Based on cfpb/regtech-data-validator#196, update filing-api to use df_to_dicts to remove need of calling ujson.loads from the df_to_json dumps.
Update validation handler to use built json results instead of the validation dataframe to determine submission state since the validation dataframe no longer has severity.
Update poetry.lock to pull in latest data-validator.
This should be reviewed and merged in after cfpb/regtech-data-validator#202 is approved
The text was updated successfully, but these errors were encountered: