Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create alembic script for institutions table update #48

Closed
lchen-2101 opened this issue Oct 31, 2023 · 10 comments · Fixed by #60
Closed

Create alembic script for institutions table update #48

lchen-2101 opened this issue Oct 31, 2023 · 10 comments · Fixed by #60
Assignees

Comments

@lchen-2101
Copy link
Collaborator

the fields that need to be added and modified:

lei
tax_id
rssd
respondent_name (modified from name)
parent_id_rssd
parent_name
top_holder_rssd
top_holder_name
prim_fed_reg
street_line1
street_line2
city
state_cd
zip_cd
parent_lei
top_holder_lei
fi_type

@lchen-2101 lchen-2101 changed the title Create alembic script for institutions table Create alembic script for institutions table update Oct 31, 2023
@nargis-sultani nargis-sultani self-assigned this Nov 1, 2023
@hkeeler
Copy link
Member

hkeeler commented Nov 1, 2023

A few thoughts and questions. Just trying to make the column names more explicit, and keep things flexible for the future. Some of these are probably questions for @nongarak and @Kibrael.

  1. I agree name is probably not sufficient, but respondent_name feels like HMDA-specific terminology. How about legal_name, which is what the GLEIF data uses? I feel like we should lean into their terminology since we're using LEI as our primary identifier.

  2. I'd like to prefix the address fields in case we need to support multiple address in the future. GLEIF data has LegalAddress and HeadquartersAddress (plus 5 additional "other" address). I think we've said we want headquarters address, so something like:

    • hq_address_street_1
    • hq_address_street_2
    • hq_address_city
    • hq_address_state
    • hq_address_zip

    ...but if it's actually LegalAddress we're basing this off of, then legal_*.

  3. What does the _cd signify on state_cd and zip_cd?

  4. When we say tax_id, do we mean tin, which can be one of a few different ID types, or are we specifically talking about Employer Identification Number (EIN)? If the latter, do we want to name it as such? Or name it federal_tax_id, which sounds like it is synonymous with EIN?

  5. Looking at NIC's search website, it seems like it's generally referred to as RSSD ID. So...
    - rssd -> rssd_id
    - parent_id_rssd -> parent_rssd_id
    - top_holder_rssd_id

  6. I assume there's going to be a corresponding federal_regulator table. I think it's a nice convention to have to column names match corresponding lookup table, so something like primary_federal_regulator.

  7. Similarly, I assume we'll have an institution_type table, so I think the fi_type should match up there too.

  8. How do we feel about US-centric address fields? Is every entry in this database going to have a US address? I'm guessing yes if this is only for institutions who are filing, but if we wanted to include the parent/top-holder records, and link them by foreign key, the answer may turn into no. In past projects I've used something like the following, which also lines up with GLEIF's.

    • hq_address_state -> hq_address_region
    • hq_address_zip -> hq_address_postal_code
    • ...addition of hq_address_country

    ...also, not everything is a state, even if this is always going to be US addresses. For one, there's...

    • "District of Columbia (United States of America)","US-DC","COUNTRY_AND_SUBDIVISION"

    ...and GLEIF considers all the rest of the US territories to be countries, not regions within the US.

    • "Guam","GU","COUNTRY_ONLY"
    • "Puerto Rico","PR","COUNTRY_ONLY"
    • "Virgin Islands (U.S.)","VI","COUNTRY_ONLY"
    • etc.

    Are we going to follow that, adding a _country column? If yes, _state/_region will need to be nullable, which it wouldn't otherwise.

    I have mixed feelings about this one. I like that it lines up with GLEIF and is more open, but I suspect this will be 99% US addresses, and adding the extra flexibility/abstraction may lead to initial what's a region? type question when someone first comes to understand this data.

    See GLEIF Accepted Legal Jurisdictions Code List for how represent this data.

@Kibrael
Copy link

Kibrael commented Nov 2, 2023

  1. Agree with Hans

  2. Agree with Hans

  3. _cd means code

  4. @nongarak Can you compare between HMDA and SBL to find out the answer for each of these? I think HMDA is EIN, but I am not sure.

  5. No opinion

  6. I don't understand the need for a federal regulator table. I think this is just the primary federal regulator attribute from NIC currently.

  7. institution_type is the HMDA field, there is a differently defined column for SBL FI type. They do not have the same values.

  8. This might take some investigation. We need to accommodate for all US territories, which is bigger than the HMDA scope (US states + PR).
    For HMDA, I smashed the PR country code into the state column for the Panel data.

    There is some patterning in how they do the state coding in GLEIF, IE "US-DC", I think the US prefix is standard and can be
    used in patterns.

@nongarak can you post the answer Larry gave us about foreign owned/controlled entities that are covered for US reporting?

@hkeeler
Copy link
Member

hkeeler commented Nov 2, 2023

  1. I don't understand the need for a federal regulator table. I think this is just the primary federal regulator attribute from NIC currently.

By having a separate lookup table we get a few things.

  1. Guaranteed data integrity. You can only ever put a value in that field if it matches one of the values in that table.
  2. That's where you put the various metadata about each of the regulators. We may want the full name of the regulator in the SBL app. If we don't keep that in a table, we'll have to keep that logic in the app...and we don't want that.
  1. institution_type is the HMDA field, there is a differently defined column for SBL FI type. They do not have the same values.

Yeah. We should figure out how to differentiate those. institution_type and fi_type in the same table is too confusing in my book. We could...

  • Be explicit: hmda_institution_type and sbl_institution_type?
  • Some other way of differentiating these fields by the values in them?
  1. This might take some investigation. We need to accommodate for all US territories, which is bigger than the HMDA scope (US states + PR).
    For HMDA, I smashed the PR country code into the state column for the Panel data.
    There is some patterning in how they do the state coding in GLEIF, IE "US-DC", I think the US prefix is standard and can

👍 Yep. I think the the ISO-3166 standard. To add to the fun there, it sounds like the standard allows for those US subdivisions to be either their own country (VI) or under the US (US-VI). It seems like GLEIF is going with the first option. Be curious if that is consistent throughout. Maybe that's why they have their own spreadsheet of jurisdictions so there's no ambiguity.

@hkeeler
Copy link
Member

hkeeler commented Nov 2, 2023

@Kibrael and I chatted about some of this...

  1. I think we're good on why we'd want a lookup table for federal_regulator and similar tables.
  2. We're going to go with hmda_institution_type and sbl_institution_type.
  3. We're going to keep it simple to start, and just put US territories into a state_or_territory column, and not have a country column.

@hkeeler
Copy link
Member

hkeeler commented Nov 10, 2023

Here's my take on the current state of things here. There's still a fair number of questions, but nothing that is a show-stopper from starting work on all this. I'll work with @Kibrael and @nongarak to try to get the rest answered over the coming days.

Below are the tables we'll need for this work. I've included the data for the lookup tables when available.


institution table

column datatype constraints notes / questions
lei CHAR(20) required, unique Table's primary key
legal_name VARCHAR required
tax_id CHAR(9) ❓required, unique Q1: Do we know if this is only EIN (12-3456789) format, or will SSN-formatted IDs (123-45-6789) come through as well? Q2: Do we strip dashes and just have this be 9 int digits?
rssd_id INTEGER ❓required, unique
primary_federal_regulator_id VARCHAR(4) Foreign key to federal_regulator table.
hmda_institution_type_id CHAR(❓) Foreign key to hmda_institution_type table.
sbl_institution_type_id VARCHAR Foreign key to sbl_institution_type table.
hq_address_street_1 VARCHAR required Q: Do we need more than 2 street_x columns? @nongarak's demo showed how GLEIF's address data can has up to 5 (I think), and the first is frequently filled with C/O type data, not actual address data. Or will downstream NIC/GLEIF data munging get us down to just 2 useful fields?
hq_address_street_2 VARCHAR
hq_address_city VARCHAR
hq_address_state CHAR(2) Foreign key to address_state table
hq_address_zip CHAR(5) required Q: Do we need to support 12345-6789 extended zips? Q: Do all territories have "zip codes"? Should we make this a more universal hq_address_postal_code?
parent_lei CHAR(20)
parent_legal_name VARCHAR Q: Required if parent_lei or parent_rssd_id is set?
parent_rssd_id INTEGER
top_holder_lei CHAR(20)
top_holder_legal_name VARCHAR Q: Required if parent_lei or parent_rssd_id is set?
top_holder_rssd_id INTEGER

federal_regulator table

column datatype constraints notes / questions
id VARCHAR(4) required, unique
name VARCHAR required, unique

As for the data we insert into this lookup table, I think we can just follow what's in NIC. Per page 26 of NIC BULK DATA DOWNLOAD DATA DICTIONARY AND REFERENCE GUIDE

The Primary Federal Regulator column contains the name of the agency that is the primary regulator of a depository institution, non-deposit trust company, bank holding company, thrift holding company, farm credit bank, or federal housing enterprise. The following agencies are represented:

FCA = Farm Credit Administration
FDIC = Federal Deposit Insurance Corporation
FHFA = Federal Housing Finance Agency
FRS = Federal Reserve System
NCUA = National Credit Union Administration
OCC = Office of the Comptroller of the Currency
OTS = Office of Thrift Supervision (only valid until July 21, 2011)

Q: Should CFPB be in that list too?

hmda_institution_type table

column datatype constraints notes / questions
id required, unique Table's primary key
name VARCHAR required, unique

@Kibrael, what values will go in this table? Is this the same as NIC's ENTITY_TYPE? Per page 14 of Per page 26 of NIC BULK DATA DOWNLOAD DATA DICTIONARY AND REFERENCE GUIDE:

The Entity Type field is derived from other fields. The entity types listed below are in alphabetic
sequence by entity code.

AGB - Agreement Corporation - Banking
AGI - Agreement Corporation - Investment
BHC - Bank Holding Company
...

sbl_institution_type table

column datatype constraints notes / questions
id required, unique Table's primary key
name VARCHAR required, unique

@Kibrael, what values go in this table?

address_state table

column datatype constraints notes / questions
code CHAR(2) required, unique Table's primary key.
name VARCHAR required, unique

Values here should be pretty self-explanatory. All 50 states and D.C. and territories...

code name
AS American Samoa
DC District of Columbia
GU Guam
MP Northern Mariana Islands
PR Puerto Rico
UM United States Minor Outlying Islands
VI Virgin Islands, U.S.

...assuming we're going with ISO-3166.

@nongarak
Copy link

Whoops, sorry for the delay. Turns out I did not have notifications turned on for GH Public. I'm glad you've answered the hard questions without me though! Some thoughts:

re: TIN vs. SSN. I'll ask Regs, but you're right - sole props / single member LLCs can use SSN instead of TIN. I doubt we'll have that many of them, but it's a case worth considering. We probably couldn't publish those.

re: street address columns. I think it's probably best if we include all of them, as we will be pulling it straight from GLEIF. Long term maybe we could do some data munging, but I don't think it's a priority and also there is zero consistency in how folks submit this data.

re: ZIP code. There is nothing in the reg specifying what HQ address means, so I think we can safely suck in whatever GLEIF has. If we can suck in the full zip5+4 code for their confirmation purposes, we can later strip it to zip5 for any other analysis/reporting purposes if we need to. All US territories also use ZIP code. I don't think there's a need for calling it "postal code"

re: parent/top_holder_legal_name - definitely required if the parent RSSD or LEI are set, but also can be set without those things (not all parents will have RSSD or LEI).

re: should CFPB be on the prim fed reg list - that's definitely a @Kibrael question, but if there's no decisive answer we can ask Regs. My gut says no, because I feel like PFR has some definition somewhere that we aren't aware of, but also CFPB has "primary authority to enforce federal consumer financial laws for banks and other depository institutions with total assets of more than $10 billion, and their affiliates" (though I'm not sure what the difference between primary supervision/enforcement authority and PFR might be).

re: HMDA institution type code: I haven't parsed through it all, but this ticket outlines the logic used to describe it (on GHE)

re: SBL Type of Financial Institution - here is what the reg says the options are.

Paragraph 109(b)(9).

  1. Type of financial institution. A financial institution complies with § 1002.109(b)(9) by selecting the applicable type or types of financial institution from the list below. A financial institution shall select all applicable types.
    i. Bank or savings association.
    ii. Minority depository institution.
    iii. Credit union.
    iv. Nondepository institution.
    v. Community development financial institution (CDFI).
    vi. Other nonprofit financial institution.
    vii. Farm Credit System institution.
    viii. Government lender.
    ix. Commercial finance company.
    x. Equipment finance company.
    xi. Industrial loan company.
    xii. Online lender.
    xiii. Other

  2. Use of “other” for type of financial institution. A financial institution reports type of financial institution as “other” where none of the enumerated types of financial institution appropriately describe the applicable type of financial institution, and the institution reports the type of financial institution via free-form text field. A financial institution that selects at least one type from the list is permitted, but not required, to also report “other” (with appropriate free-form text) if there is an additional aspect of its business that is not one of the enumerated types set out in comment 109(b)(9)-1. 3. Additional types of financial institution. The Bureau may add additional types of financial institutions via the Filing Instructions Guide and related materials. Refer to the Filing Instructions Guide for any updates for each reporting year.

I guess that means we need to add another column for free form text field to your table, @hkeeler

re: @Kibrael's earlier question about the readout from Larry on foreign controlled entities, he said there should be a US HQ address that they would report. So a big international bank (like HSBC or whatever) would report their US entity information.

@nongarak
Copy link

Other notes:
GLEIF names the address data as headquartersaddress_firstaddressline, _additionaladdressline1, etc. which is where I originally got hq_address and hq_addressline1, etc. from. I think we should stick to calling it 'addresslineX' instead of hq_address_street_X because the additional address lines coming in from GLEIF are not actually all street addresses - there's the C/Os, the suite numbers, the floors, etc. and unless and until we're willing to devise some way of cleaning that info, I think we should just call it what GLEIF calls it since we're just sucking it in and displaying it on the page. Might as well have the internal variable name match the external variable name. For reference, here's the snipped of the SQL call I made:

                       , headquartersaddress_firstaddressline       AS hq_address
                       , headquartersaddress_additionaladdressline1 AS hq_addresline1
                       , headquartersaddress_additionaladdressline2 AS hq_addressline2
                       , headquartersaddress_additionaladdressline3 AS hq_addressline3

If you want _ between every word like hq_address_line_1 that's fine with me.

@nongarak
Copy link

I went through and changed all of the column names in my code to match the names hans gave above (except the address naming convention). The final list is this, in order of data point in the rule:

  1. lei_lei
  2. legal_name_lei
  3. hq_address_lei
  4. hq_address_line_1_lei
  5. hq_address_line_2_lei
  6. hq_address_line_3_lei
  7. hq_address_city_lei
  8. hq_address_state_lei
  9. hq_address_country_lei
  10. hq_address_zip_lei
  11. email_domain_hmda
  12. primary_federal_regulator_id_nic
  13. tax_id_hmda
  14. type_of_fi
  15. type_of_fi_other
  16. rssd_id_nic
  17. parent_legal_name_lei
  18. parent_lei_lei
  19. parent_rssd_id_nic
  20. top_holder_legal_name_lei
  21. top_holder_lei_lei
  22. top_holder_rssd_id_nic

@nongarak
Copy link

@nargis-sultani you're right, technically we don't need the domains field in this dataset. I have that all cleaned up and parsed out in a separate table and that's probably more appropriate, I can remove it from the current iteration of the dataset here.

@Kibrael
Copy link

Kibrael commented Dec 5, 2023

  • Tax ID: should check with privacy, likely will be suppressed on publication if relates to individuals. format can be annoying. we should use string if we have mixed source inputs that would compress to non-unique integers
  • primary_federal_regulator_id: should be prudential_federal_regulator (or similar). CFPB is not a prudential regulator, this category refers to the regulators of depositories with a safety and soundness mission, FDIC, FRS/FRB, OCC, NCUA.
    • Paragraph 109(b)(4)
      1. Federal prudential regulator. For purposes of § 1002.109(b)(4), Federal prudential regulator means, if applicable, the Federal prudential regulator for a financial institution that is a depository institution as determined pursuant to section 3q of the Federal Deposit Insurance Act (12 U.S.C. 1813(q)), including the Office of the Comptroller of the Currency, the Federal Deposit Insurance Corporation, or the Board of Governors of the Federal Reserve System; or the National Credit Union Administration Board for financial institutions that are Federal credit unions.
        Values: OCC, NCUA, FDIC, FRB. these can be enumerations or string values. I don't care either way b/c they're short.
  • hmda_institution_type_id: 1-18 enumerations, supply definitions, only needed for HMDA screens. is the source field for OLC and agency code. code link: [GHE]/Kibrael/hmda-panel/blob/31d964bae5ea3985b00bebd95de206e1b6502ca8/python/lib/panel_data_pull_and_update.py#L1135
    1 National Bank (OCC supervised)
    2 State Member Bank (FRS Supervised):
    3 State non-member bank (FDIC supervised)
    4 State Chartered Thrift (FDIC supervised)
    5 Federal Chartered Thrift (OCC supervised)
    6 Credit Union (NCUA supervised)
    7 Federal Branch or Agency of Foreign Banking Organization (FBO)
    8 Branch or Agency of FBO (FRS supervised)
    9 MBS of national Bank (OCC supervised)
    10 MBS of state member bank (FRS supervised)
    11 MBS of state non-member bank (FDIC supervised)
    12 MBS of Bank Holding Company (BHC) (FRS supervised)
    13 MBS of credit union (NCUA supervised)
    14 independent MBS, no depository affiliation
    15 MBS of Savings and Loan Holding Co
    16 MBS of state chartered Thrift
    17 MBS of federally chartered thrift (OCC supervised)
    18 Affiliate of depository institution. MBS is in the same ownership org as a depository.
  • sbl_institution_type: create enums for this field, use the reg definitions as values. dependent option for free form text
    1 Bank or savings association.
    2 Minority depository institution.
    3 Credit union.
    4 Nondepository institution.
    5 Community development financial institution (CDFI).
    6) Other nonprofit financial institution.
    7 Farm Credit System institution.
    8 Government lender.
    9 Commercial finance company.
    10 Equipment finance company.
    11 Industrial loan company.
    12 Online lender.
    13 Other
  • parent_legal_name: can be populated if parent LEI or parent RSSD are known, otherwise user input is needed. source will be dependent on which data values are available with LEI taking precedence over RSSD and RSSD over user supplied.
  • top_holder_legal_name: can be populated if top_holder LEI or top_holder RSSD are known, otherwise user input is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants