From bc8731c1635455497c10226769076c3576c194e0 Mon Sep 17 00:00:00 2001 From: Jackson Hoffart Date: Tue, 9 Apr 2024 11:11:19 +0200 Subject: [PATCH] fix matching coverage vignette --- vignettes/matching-coverage.Rmd | 53 +++++++++++++++++++++------------ 1 file changed, 34 insertions(+), 19 deletions(-) diff --git a/vignettes/matching-coverage.Rmd b/vignettes/matching-coverage.Rmd index b214d826..0f5fef76 100644 --- a/vignettes/matching-coverage.Rmd +++ b/vignettes/matching-coverage.Rmd @@ -17,9 +17,15 @@ sector_in_scope <- glue::glue_collapse( ) ``` -`r2dii.match` allows you to match loans from your loanbook to the companies in an asset-based company dataset. However, matching every loan is unlikely -- some loan-taking companies may be missing from the asset-based company dataset, or they may not operate in the sectors 2DII focuses on (`r sector_in_scope`). Thus, you may want to measure how much of the loanbook matched some asset. This article shows two ways to calculate such matching coverage: +`r2dii.match` allows you to match loans from your loanbook to the companies in +an asset-based company dataset. However, matching every loan is unlikely -- some +loan-taking companies may be missing from the asset-based company dataset, or +they may not operate in the sectors PACTA focuses on (`r sector_in_scope`). +Thus, you may want to measure how much of the loanbook matched some asset. This +article shows two ways to calculate such matching coverage: -(1) Calculate the portion of your loanbook covered, by dollar value (i.e. using one of the `loan_size_*` columns). +(1) Calculate the portion of your loanbook covered, by dollar value (i.e. using +one of the `loan_size_*` columns). (2) Count the number of companies matched. @@ -35,7 +41,8 @@ library(r2dii.data) library(r2dii.match) ``` -We will use example datasets from `r2dii.data`. To demonstrate our point, we create a `loanbook` dataset with two mismatching loans: +We will use example datasets from `r2dii.data`. To demonstrate our point, we +create a `loanbook` dataset with two mismatching loans: ```{r} loanbook <- loanbook_demo %>% @@ -43,7 +50,7 @@ loanbook <- loanbook_demo %>% name_ultimate_parent = ifelse(id_loan == "L1", "unmatched company name", name_ultimate_parent), sector_classification_direct_loantaker = - ifelse(id_loan == "L2", 99, sector_classification_direct_loantaker) + ifelse(id_loan == "L2", "99", sector_classification_direct_loantaker) ) ``` @@ -55,9 +62,16 @@ matched <- loanbook %>% prioritize() ``` -Note that this `matched` dataset will contain _only_ loans that were matched successfully. To determine coverage, we need to go back to the original `loanbook` dataset. We must determine the 2DII sectors of each loan, as dictated by the `sector_classification_direct_loantaker` column. +Note that this `matched` dataset will contain _only_ loans that were matched +successfully. To determine coverage, we need to go back to the original +`loanbook` dataset. We must determine the 2DII sectors of each loan, as dictated +by the `sector_classification_direct_loantaker` column. -For this, we join the loanbook with the [`sector_classifications`](https://rmi-pacta.github.io/r2dii.data/reference/sector_classifications.html) dataset, which lists all sector classification code standards used by 'PACTA'. Unfortunately we need to work around two caveats (you may ignore them because they are conceptually uninteresting): +For this, we join the loanbook with the +[`sector_classifications`](https://rmi-pacta.github.io/r2dii.data/reference/sector_classifications.html) +dataset, which lists all sector classification code standards used by 'PACTA'. +Unfortunately we need to work around two caveats (you may ignore them because +they are conceptually uninteresting): * In the two datasets, the columns we want to merge by have different names. We use the argument `by` to `left_join()` to merge the columns `sector_classification_system` and `sector_classification_direct_loantaker` (from `loanbook`) with the columns `code_system` and `code` (from `sector_classifications`), respectively. @@ -70,7 +84,7 @@ merge_by <- c("code_system", "code") %>% loanbook_with_sectors <- loanbook %>% modify_at(names(merge_by)[[2]], as.character) %>% left_join(sector_classifications, by = merge_by) %>% - modify_at(names(merge_by)[[2]], as.double) + modify_at(names(merge_by)[[2]], as.character) ``` We can join these two datasets together, to generate our `coverage` dataset: @@ -94,7 +108,9 @@ coverage <- left_join(loanbook_with_sectors, matched) %>% ### 1. Calculate the portion of your loanbook covered by dollar value -From the `coverage` dataset, we can calculate the total loanbook coverage by dollar value. Let's create two helper functions, one to calculate dollar-value and another one to plot coverage in general. +From the `coverage` dataset, we can calculate the total loanbook coverage by +dollar value. Let's create two helper functions, one to calculate dollar-value +and another one to plot coverage in general. ```{r} dollar_value <- function(data, ...) { @@ -155,8 +171,8 @@ coverage %>% ### 2. Count the number of companies -You might also be interested in knowing how many companies in your loanbook were -matched. It probably makes most sense to do this at the `direct_loantaker` +You might also be interested in knowing how many companies in your loanbook were +matched. It probably makes most sense to do this at the `direct_loantaker` level: ``` {r} @@ -185,17 +201,16 @@ In the example below, we see two classification codes coming from the SIC classification standard: ``` {r} -r2dii.data::sic_classification %>% - filter(code %in% c(41111, 36200)) +r2dii.data::nace_classification %>% + filter(code %in% c("D35.11", "D35.14")) ``` -Notice that the code 41111 corresponds to power generation. This is an identical -match to 2DII's `power` sector, and thus the `borderline` flag is set to -`FALSE`. In contrast, code 36200 corresponds to the manufacture of electricity -distribution and control apparatus. In a perfect world, we would set this code -to `not in scope`, however there is still a chance that these companies produce -electricity. For this reason, we have mapped it to `power` with -`borderline = TRUE`. +Notice that the code D35.11 corresponds to power generation. This is an +identical match to PACTA's `power` sector, and thus the `borderline` flag is set +to `FALSE`. In contrast, code D35.14 corresponds to the distribution of +electricity. In a perfect world, we would set this code to `not in scope`, +however there is still a chance that these companies produce electricity. For +this reason, we have mapped it to `power` with `borderline = TRUE`. In practice, if a company has a `borderline` of `TRUE` and _is_ matched, then consider the company in scope. If it has a `borderline` of `TRUE` and _isn't_