Skip to content

Commit

Permalink
feat: Replace main.py with Typer-based CLI app (#63)
Browse files Browse the repository at this point in the history
### Replace `main.py` to [Typer](https://typer.tiangolo.com/) CLI app

The CLI setup in `main.py` is pretty primitive, and is really geared
towards _us_, developers on this project. I would like to create a
utility that outside technical users can use to test their files against
our validation logic. To get there without having to reinvent many
wheels, we need an actual CLI framework. That's where
[Typer](https://typer.tiangolo.com/) comes in.

Typer is a wrapper around
[Click](https://click.palletsprojects.com/en/8.1.x/), written by the
same dev as FastAPI, so it follows many of the same patterns as we're
used to there, such as leaning into Python's `typing` and
[Pydantic](https://docs.pydantic.dev/latest/) for example.


### Return validation results as Pandas `DataFrame` instead of vanilla
`dict`

I also wanted to support multiple output formats; at least json, csv,
and default to a human-readable table. And while we _could_ do that with
a `dict`, it's much easier to work with `DataFrame`s as they're already
tabular in nature and have a ton of transform functionality built-in.
So, I have also refactored the `validate` function to return a Panda's
`DataFrame` instead of a vanilla `dict`. This made it much easier to
output validation data in multiple formats.

Along with this, I did a fair bit of refactoring of that surrounding
code, breaking up the big monolith function into smaller testable
functions.

### `cfpb-val` script

To simply using the CLI, I've added a shortcut script so you don't have
to call the full path to the Python script. In it's current form, you
can now call just...

    cfpb-val --help

...instead of...

    python regtech_data_validator/cli.py --help 

#### Better script name?

I'm not in love with `cfpb-val`. I started off with `cfpb-comply`, but I
didn't love that either. I'm looking for something that's short, but
also very clear that it's CFPB-related, and having to do with this
project. `cfpb-rdv`?

### Lots of `README.md` love

I've reworked the `README.md` quite a bit, focusing on users coming to
this repo for the first time, and trying to get the CLI to work. This
includes pushing the Poetry-based install up to the top and simplifying
it, slimming and pushing down development-related setup, and adding
details about how to use the CLI itself.

See:
https://github.com/cfpb/regtech-data-validator/blob/add-typer-cli/README.md

### Refactor `lei` arg to more generic `context`

In attempt to future-proof this a bit, I've added a multi-value
`context` arg to the CLI, which allows you to pass arbitrary
`<key>=<value>` pairs, `lei` being the only key that's current used for
anything.

### Other tidbits I found along the way

- Removed deprecated `python.formatting.provider` VS Code configuration
- Removes no longer used `Makefile`
- Fixes several validations that had incorrect IDs, severities, or were
in the wrong phase

---------

Co-authored-by: lchen-2101 <[email protected]>
  • Loading branch information
hkeeler and lchen-2101 authored Nov 8, 2023
1 parent 36007b2 commit e6210e3
Show file tree
Hide file tree
Showing 14 changed files with 999 additions and 1,507 deletions.
1 change: 0 additions & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@
],
"editor.tabSize": 4,
"editor.formatOnSave": true,
"python.formatting.provider": "none",
"python.envFile": "${workspaceFolder}/.env",
"editor.codeActionsOnSave": {
"source.organizeImports": true
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/linters.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ jobs:
- uses: psf/black@stable
with:
options: "--check --diff --verbose"
version: "~= 22.0"
ruff:
runs-on: ubuntu-latest
steps:
Expand Down
8 changes: 0 additions & 8 deletions Makefile

This file was deleted.

1,371 changes: 220 additions & 1,151 deletions README.md

Large diffs are not rendered by default.

338 changes: 183 additions & 155 deletions poetry.lock

Large diffs are not rendered by default.

17 changes: 12 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "regtech-data-validator"
version = "0.1.0"
Expand All @@ -13,15 +17,18 @@ pandera = "0.16.1"
[tool.poetry.group.dev.dependencies]
pytest = "7.4.0"
pytest-cov = "4.1.0"
black = "23.3.0"
ruff = "0.0.259"
black = "23.10.1"
ruff = "0.1.4"

[tool.poetry.group.data.dependencies]
openpyxl = "^3.1.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.poetry.group.cli.dependencies]
tabulate = "^0.9.0"
typer = "^0.9.0"

[tool.poetry.scripts]
cfpb-val = 'regtech_data_validator.cli:app'

# Black formatting
[tool.black]
Expand Down
1 change: 0 additions & 1 deletion regtech_data_validator/check_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
the function. This may or may not align with the name of the validation
in the fig."""


import re
from datetime import datetime, timedelta
from typing import Dict
Expand Down
155 changes: 155 additions & 0 deletions regtech_data_validator/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
from dataclasses import dataclass
from enum import StrEnum
import json
from pathlib import Path
from typing import Annotated, Optional

import pandas as pd
from tabulate import tabulate
import typer

from regtech_data_validator.create_schemas import validate_phases


app = typer.Typer(no_args_is_help=True, pretty_exceptions_enable=False)


@dataclass
class KeyValueOpt:
key: str
value: str


def parse_key_value(kv_str: str) -> KeyValueOpt:
split_str = kv_str.split('=')

if len(split_str) != 2:
raise ValueError(f'Invalid key/value pair: {kv_str}')

return KeyValueOpt(split_str[0], split_str[1])


class OutputFormat(StrEnum):
CSV = 'csv'
JSON = 'json'
PANDAS = 'pandas'
TABLE = 'table'


def df_to_str(df: pd.DataFrame) -> str:
with pd.option_context('display.width', None, 'display.max_rows', None):
return str(df)


def df_to_csv(df: pd.DataFrame) -> str:
return df.to_csv()


def df_to_table(df: pd.DataFrame) -> str:
# trim field_value field to just 50 chars, similar to DataFrame default
table_df = df.drop(columns='validation_desc').sort_index()
table_df['field_value'] = table_df['field_value'].str[0:50]

# NOTE: `type: ignore` because tabulate package typing does not include Pandas
# DataFrame as input, but the library itself does support it. ¯\_(ツ)_/¯
return tabulate(table_df, headers='keys', showindex=True, tablefmt='rounded_outline') # type: ignore


def df_to_json(df: pd.DataFrame) -> str:
findings_json = []
findings_by_v_id_df = df.reset_index().set_index(['validation_id', 'record_no', 'field_name'])

for v_id_idx, v_id_df in findings_by_v_id_df.groupby(by='validation_id'):
v_head = v_id_df.iloc[0]

finding_json = {
'validation': {
'id': v_id_idx,
'name': v_head.at['validation_name'],
'description': v_head.at['validation_desc'],
'severity': v_head.at['validation_severity'],
},
'records': [],
}
findings_json.append(finding_json)

for rec_idx, rec_df in v_id_df.groupby(by='record_no'):
record_json = {'record_no': rec_idx, 'fields': []}
finding_json['records'].append(record_json)

for field_idx, field_df in rec_df.groupby(by='field_name'):
field_head = field_df.iloc[0]
record_json['fields'].append({'name': field_idx, 'value': field_head.at['field_value']})

json_str = json.dumps(findings_json, indent=4)

return json_str


@app.command()
def describe() -> None:
"""
Describe CFPB data submission formats and validations
"""

print('Feature coming soon...')


@app.command(no_args_is_help=True)
def validate(
path: Annotated[
Path,
typer.Argument(
exists=True,
dir_okay=False,
readable=True,
resolve_path=True,
show_default=False,
help='Path of file to be validated',
),
],
context: Annotated[
Optional[list[KeyValueOpt]],
typer.Option(
parser=parse_key_value,
metavar='<key>=<value>',
help='[example: lei=12345678901234567890]',
show_default=False,
),
] = None,
output: Annotated[Optional[OutputFormat], typer.Option()] = OutputFormat.TABLE,
) -> tuple[bool, pd.DataFrame]:
"""
Validate CFPB data submission
"""
context_dict = {x.key: x.value for x in context} if context else {}
input_df = pd.read_csv(path, dtype=str, na_filter=False)
is_valid, findings_df = validate_phases(input_df, context_dict)

status = 'SUCCESS'
no_of_findings = 0

if not is_valid:
status = 'FAILURE'
no_of_findings = len(findings_df.index.unique())

match output:
case OutputFormat.PANDAS:
print(df_to_str(findings_df))
case OutputFormat.CSV:
print(df_to_csv(findings_df))
case OutputFormat.JSON:
print(df_to_json(findings_df))
case OutputFormat.TABLE:
print(df_to_table(findings_df))
case _:
raise ValueError(f'output format "{output}" not supported')

typer.echo(f"status: {status}, findings: {no_of_findings}", err=True)

# returned values are only used in unit tests
return is_valid, findings_df


if __name__ == '__main__':
app()
Loading

0 comments on commit e6210e3

Please sign in to comment.