Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change filing-api to use df_to_dicts in data-validator #257

Closed
jcadam14 opened this issue May 28, 2024 · 0 comments · Fixed by #265
Closed

Change filing-api to use df_to_dicts in data-validator #257

jcadam14 opened this issue May 28, 2024 · 0 comments · Fixed by #265
Assignees

Comments

@jcadam14
Copy link
Contributor

jcadam14 commented May 28, 2024

Based on cfpb/regtech-data-validator#196, update filing-api to use df_to_dicts to remove need of calling ujson.loads from the df_to_json dumps.

Update validation handler to use built json results instead of the validation dataframe to determine submission state since the validation dataframe no longer has severity.

Update poetry.lock to pull in latest data-validator.

This should be reviewed and merged in after cfpb/regtech-data-validator#202 is approved

@jcadam14 jcadam14 self-assigned this May 28, 2024
lchen-2101 pushed a commit to cfpb/regtech-data-validator that referenced this issue Jun 3, 2024
Closes #196
Closes #201 

Lots of updates. Part of this was in the original memory improvements PR
#197. Copied comment:
- Removed static validation data (description, name, link, severity,
scope) from initial error dataframe to drastically reduce size
- Changed df_to_download to use groupby before pivot, to reduce size of
to_csv dataframes, concat to_csvs and prepend headers, fixed two field
swap that wasn't occurring, add description, link, etc during group
- Created df_to_dicts function to allow clients (ie filing-api) to use
the "json" object without having to ujson.dumps/ujson.loads which takes
up processing time for larger data sets
- Created get_all_checks and find_check to get check to pull out
validation static data for final data
- Updated pytests

cfpb/sbl-filing-api#257 was written to update
the filing-api to use the df_to_dicts and to change how the submission
state is determined since the validation df will not have severity.

Question: I left the df_to_table, df_to_csv, df_to_str alone, without
adding the static data. I'm not sure if these are actually used anywhere
else other than the tests (which I updated to just have the trimmed down
error dataframe columns). Do we want to add that to those, too?

For the limiting of error and json size, added the concept of passed in
maxes to the validation and df_to_json/dicts functions. These maxes are
used to create 'ratios' of validation group errors or records in
relation to the original total number of errors, which are then used
against the max, if total is more than max, to calculate new error
totals per validation group.

Also added for the df_to_json a 'static' max records per group that can
be passed in to avoid the ratio calculations and instead return the same
number of errors per group instead.

Added a first row below the download csv header that has a warning about
the submission having more errors than can be returned. The download
formatter **does not** do any ratio/max setting, that is done prior at
the validation section. I can change this function to also further
truncate the download reports results, like the df_to_json does. But
currently that doesn't seem to be a need.

Added a ValidationResults class to hold all the info from validation
(single-field counts, multi-field counts, register counts, etc) since
the amount of data needing to come out of validation has grown a bit.

Updated pytests to test the max concepts, and beefed up the data current
pytests were using so that we hit more cases than what we previously
had.
@jcadam14 jcadam14 linked a pull request Jun 3, 2024 that will close this issue
lchen-2101 pushed a commit that referenced this issue Jun 5, 2024
Closes #257 

- Updated submission_processor to use df_to_dicts instead of df_to_json
- Updated submission_processor to use ValidationResults instead of tuple
- Updated state logic of submission to be based on the results.findings
being empty (for successful), results.phase being SYNTACTICAL (always
errors) or error counts being greater than 0. Else there were just
warnings. This was done since the validation results dataframe doesn't
have severity anymore.
- Add config Settings to have values for max validation errors, max json
errors, or max records per group (json)
- Updated pytests to work with all this

I didn't include any local.env file updates to set values for the max's.
I can add those if desired.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant