Skip to content

Commit

Permalink
Update README: Fix cfpb-val output, other small updates for clarity (
Browse files Browse the repository at this point in the history
…#95)

Small changes to the docs, updating the `cfpb-val` examples' output to
match current functionality, and tweaked some other areas for clarity.
  • Loading branch information
hkeeler authored Feb 9, 2024
1 parent 2395d8a commit 2e70654
Showing 1 changed file with 28 additions and 21 deletions.
49 changes: 28 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
[![Coverage badge](https://github.com/cfpb/regtech-data-validator/raw/python-coverage-comment-action-data/badge.svg)](https://github.com/cfpb/regtech-data-validator/tree/python-coverage-comment-action-data)

Python-based tool for parsing and validating CFPB's RegTech-related data submissions.
It uses the [Pandera](https://pandera.readthedocs.io/en/stable/) data testing
framework to define schemas for datasets and to perform all data validations,
which is in turn based on the [Pandas](http://pandas.pydata.org/docs/getting_started/)
data analytics tool. It is intended to be used as a library for Python-based apps,
but can also be used directly via command-line interface.
It uses [Pandera](https://pandera.readthedocs.io/en/stable/), a
[Pandas](http://pandas.pydata.org/docs/getting_started/)-based data testing framework,
to define schemas for datasets and to perform all data validations. It is intended to
be used as a library for Python-based apps, but can also be used directly via
command-line interface.

We are currently focused on implementing the SBL (Small Business Lending) data
submission. For details on this dataset and its validations, please see:
Expand Down Expand Up @@ -55,7 +55,7 @@ This project includes the `cfpb-val` CLI utility for validating CFPB's RegTech-r
data collection file formats. It currently supports the small business lending (SBL) data
collected for 2024, but may support more formats in the future. This tool is intended for
testing purposes, allowing a quick way to check the validity of a file without having
to submit it through the full CFPB-hosted filing systems.
to submit it through the full filing systems.

### Validating data

Expand All @@ -76,13 +76,12 @@ $ cfpb-val validate --help

#### Examples

1. Validate file with no findings
1. Validate a file with no findings

$ cfpb-val validate tests/data/sbl-validations-pass.csv
status: SUCCESS, findings: 0

**Note:** No output is returned if the file contains no validations errors or warnings.

1. Validate file with findings, passing in LEI as context
1. Validate a file with findings, passing in LEI as context

$ cfpb-val validate tests/data/sbl-validations-fail.csv --context lei=000TESTFIUIDDONOTUSE

Expand All @@ -100,8 +99,9 @@ $ cfpb-val validate --help
│ 117 │ 302 │ po_4_gender_flag │ 9001 │ error │ E1040 │ po_4_gender_flag.invalid_enum_value │
│ 118 │ 306 │ po_4_gender_ff │ 12345678901234567890123456789012345678901234567890 │ error │ E1060 │ po_4_gender_ff.invalid_text_length │
╰────────────┴───────────┴──────────────────┴────────────────────────────────────────────────────┴─────────────────────┴───────────────┴──────────────────────────────────────╯
status: FAILURE, findings: 118

1. Validate file with findings with JSON output
1. Validate a file with findings with output in JSON format

$ cfpb-val validate tests/data/sbl-validations-fail.csv --output json

Expand Down Expand Up @@ -144,6 +144,7 @@ $ cfpb-val validate --help
]
},
...
status: FAILURE, findings: 118

## Test Data

Expand All @@ -159,6 +160,11 @@ We use these test files in for automated test, but can also be passed in via the

## Development

This section is for developer who wish to contribute to this project.

**Note:** If you simply want to use the **cfpb-val** tool for testing you data,
you don't need to read any further.
### Best practices
#### `Check` functions
Expand Down Expand Up @@ -233,28 +239,29 @@ Test coverage details can be found in this project's
branch.
### Testing the FIG CSV
### Checking validation code vs. validations CSV
A standard pytest ([`test_csv_to_code_differences.py`](tests/test_csv_to_code_differences.py)) has been written that compares the validation code in [`phase_validations.py`](regtech_data_validator/phase_validations.py)
to the [`FIG CSV`](https://github.com/cfpb/sbl-content/blob/main/fig-files/validation-spec/2024-validations.csv). This test will check that
the list of validation IDs in one match the other, and will report on IDs that are missing in either.
The ([`test_csv_to_code_differences.py`](tests/test_csv_to_code_differences.py)) test compares the validation code in
[`phase_validations.py`](regtech_data_validator/phase_validations.py) against the CSV-based 2024 SBL validation spec
([`2024-validations.csv`](https://github.com/cfpb/sbl-content/blob/main/fig-files/validation-spec/2024-validations.csv)).
This test checks that the list of validation IDs in one match the other, and will report on IDs that are missing in either.
The test will also validate that all severities (error or warning) match. The test will then
do a hard string compare between the violation descriptions, with a couple of caveats:
do a hard string compare between the validation descriptions, with a couple of caveats:
- Any python validation check whose description starts with a single quote will first add the single quote
to the CSV's description, if one doesn't exist. This is done because if someone modifies the CSV in Excel,
Excel will drop the beginning single quote, which it interprets as a formatter telling Excel "this field is a string"
- Certain descriptions in the CSV have 'complex' formatting to produce layouts with lists, new lines and white space
that may not compare correctly. Since how error descriptions will be formatted on the results page for a submission,
that may not compare correctly. Since how validation descriptions will be formatted on the results page for a submission,
currently the test will strip off some of this formatting and compare the text.
This test is ran automatically as part of our unit testing pipeline. A developer can also
This test runs automatically as part of our unit testing pipeline. You can also
run the test manually by running the command `poetry run pytest tests/test_csv_to_code_differences.py`
This will create an errors.csv file at the root of the repo that can be used to easily view
This will create an `errors.csv` file at the root of the repo that can be used to easily view
differences found between the two files.
Normally the pytest will point to the main branch in the sbl-content repo, but a developer
can modify the test to point to a development branch that has upcoming changes, run the test with the above command,
Normally the pytest will point to the main branch in the sbl-content repo, but you can modify the test to
point to a development branch that has upcoming changes, run the test with the above command,
and then evaluate what changes may need to be made to the python validation code.
## Linting
Expand Down

0 comments on commit 2e70654

Please sign in to comment.