Change filing-api to use df_to_dicts in data-validator #257

jcadam14 · 2024-05-28T23:26:14Z

Based on cfpb/regtech-data-validator#196, update filing-api to use df_to_dicts to remove need of calling ujson.loads from the df_to_json dumps.

Update validation handler to use built json results instead of the validation dataframe to determine submission state since the validation dataframe no longer has severity.

Update poetry.lock to pull in latest data-validator.

This should be reviewed and merged in after cfpb/regtech-data-validator#202 is approved

Closes #196 Closes #201 Lots of updates. Part of this was in the original memory improvements PR #197. Copied comment: - Removed static validation data (description, name, link, severity, scope) from initial error dataframe to drastically reduce size - Changed df_to_download to use groupby before pivot, to reduce size of to_csv dataframes, concat to_csvs and prepend headers, fixed two field swap that wasn't occurring, add description, link, etc during group - Created df_to_dicts function to allow clients (ie filing-api) to use the "json" object without having to ujson.dumps/ujson.loads which takes up processing time for larger data sets - Created get_all_checks and find_check to get check to pull out validation static data for final data - Updated pytests cfpb/sbl-filing-api#257 was written to update the filing-api to use the df_to_dicts and to change how the submission state is determined since the validation df will not have severity. Question: I left the df_to_table, df_to_csv, df_to_str alone, without adding the static data. I'm not sure if these are actually used anywhere else other than the tests (which I updated to just have the trimmed down error dataframe columns). Do we want to add that to those, too? For the limiting of error and json size, added the concept of passed in maxes to the validation and df_to_json/dicts functions. These maxes are used to create 'ratios' of validation group errors or records in relation to the original total number of errors, which are then used against the max, if total is more than max, to calculate new error totals per validation group. Also added for the df_to_json a 'static' max records per group that can be passed in to avoid the ratio calculations and instead return the same number of errors per group instead. Added a first row below the download csv header that has a warning about the submission having more errors than can be returned. The download formatter **does not** do any ratio/max setting, that is done prior at the validation section. I can change this function to also further truncate the download reports results, like the df_to_json does. But currently that doesn't seem to be a need. Added a ValidationResults class to hold all the info from validation (single-field counts, multi-field counts, register counts, etc) since the amount of data needing to come out of validation has grown a bit. Updated pytests to test the max concepts, and beefed up the data current pytests were using so that we hit more cases than what we previously had.

Closes #257 - Updated submission_processor to use df_to_dicts instead of df_to_json - Updated submission_processor to use ValidationResults instead of tuple - Updated state logic of submission to be based on the results.findings being empty (for successful), results.phase being SYNTACTICAL (always errors) or error counts being greater than 0. Else there were just warnings. This was done since the validation results dataframe doesn't have severity anymore. - Add config Settings to have values for max validation errors, max json errors, or max records per group (json) - Updated pytests to work with all this I didn't include any local.env file updates to set values for the max's. I can add those if desired.

jcadam14 self-assigned this May 28, 2024

This was referenced May 28, 2024

Memory and processing improvements cfpb/regtech-data-validator#197

Closed

201 restrict number of validation findings processed cfpb/regtech-data-validator#202

Merged

jcadam14 mentioned this issue Jun 3, 2024

Updated to use new data-validator api #265

Merged

jcadam14 linked a pull request Jun 3, 2024 that will close this issue

Updated to use new data-validator api #265

Merged

lchen-2101 closed this as completed in #265 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change filing-api to use df_to_dicts in data-validator #257

Change filing-api to use df_to_dicts in data-validator #257

jcadam14 commented May 28, 2024 •

edited

Loading

Change filing-api to use df_to_dicts in data-validator #257

Change filing-api to use df_to_dicts in data-validator #257

Comments

jcadam14 commented May 28, 2024 • edited Loading

jcadam14 commented May 28, 2024 •

edited

Loading