Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: match_name gains join_id col, allowing for an initial matching override based on some unique ID column #460

Merged
merged 28 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# r2dii.match (development version)

* `match_name` gains argument `join_id` allowing an optional perfect join based on a mutual ID column between `loanbook` and `abcd` inputs, prior to attempting fuzzy matching (#135).

# r2dii.match 0.1.4

* `to_alias` can now handle strange encodings without error (#425, @kalashsinghal @Tilmon).
Expand Down
169 changes: 155 additions & 14 deletions R/match_name.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@
#' only `sector`, the value in the `name` column should be `NA` and
#' vice-versa. This file can be used to manually match loanbook companies to
#' abcd.
#' @param join_id A join specification passed to [dplyr::inner_join()]. If a
#' character string, it assumes identical join columns between `loanbook` and
#' `abcd`. If a named character vector, it uses the name as the join column of `loanbook` and
#' the value as the join column of `abcd`.
#' @param ... Arguments passed on to [stringdist::stringsim()].
#' @param ald `r lifecycle::badge('superseded')` `ald` has been superseded by
#' `abcd`.
Expand Down Expand Up @@ -83,6 +87,25 @@
#' code_system = "XYZ"
#' )
#'
#' # match on LEI
#' loanbook <- tibble(
#' sector_classification_system = "XYZ",
#' sector_classification_direct_loantaker = "D35.11",
#' id_ultimate_parent = "UP15",
#' name_ultimate_parent = "Won't fuzzy match",
#' id_direct_loantaker = "C294",
#' name_direct_loantaker = "Won't fuzzy match",
#' lei_direct_loantaker = "LEI123"
#' )
#'
#' abcd <- tibble(
#' name_company = "alpine knits india pvt. limited",
#' sector = "power",
#' lei = "LEI123"
#' )
#'
#' match_name(loanbook, abcd, join_by = c(lei_direct_loantaker = "lei"))
#'
#' restore <- options(r2dii.match.sector_classifications = your_classifications)
#'
#' loanbook <- tibble(
Expand Down Expand Up @@ -111,6 +134,7 @@
method = "jw",
p = 0.1,
overwrite = NULL,
join_id = NULL,
ald = deprecated(),
...) {
restore <- options(datatable.allow.cartesian = TRUE)
Expand All @@ -125,16 +149,77 @@
abcd <- ald
}

match_name_impl(
loanbook = loanbook,
abcd = abcd,
by_sector = by_sector,
min_score = min_score,
method = method,
p = p,
overwrite = overwrite,
...
)
if (!is.null(join_id)) {
check_join_id(join_id, loanbook, abcd)

Check warning on line 153 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L153

Added line #L153 was not covered by tests

crucial_names <- c("name_company", "sector", join_id)
check_crucial_names(abcd, crucial_names)

Check warning on line 156 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L155-L156

Added lines #L155 - L156 were not covered by tests

prep_abcd <- dplyr::transmute(
abcd,
name_abcd = .data[["name_company"]],
sector_abcd = tolower(.data[["sector"]]),
!!join_id := .data[[join_id]]

Check warning on line 162 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L158-L162

Added lines #L158 - L162 were not covered by tests
)

prep_abcd <- dplyr::distinct(prep_abcd)

Check warning on line 165 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L165

Added line #L165 was not covered by tests

prep_lbk <- may_add_sector_and_borderline(loanbook)
prep_lbk <- distinct(prep_lbk)

Check warning on line 168 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L167-L168

Added lines #L167 - L168 were not covered by tests

join_matched <- dplyr::inner_join(
prep_lbk,
prep_abcd,
by = join_id,
na_matches = "never"

Check warning on line 174 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L170-L174

Added lines #L170 - L174 were not covered by tests
)

join_by_list <- as_join_by(join_id)
loanbook_join_id <- join_by_list[[1]]

Check warning on line 178 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L177-L178

Added lines #L177 - L178 were not covered by tests

join_matched <- dplyr::mutate(
join_matched,
score = 1,
source = "id joined",
level = loanbook_join_id,
name = .data[["name_abcd"]]

Check warning on line 185 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L180-L185

Added lines #L180 - L185 were not covered by tests
)

loanbook <- dplyr::filter(
loanbook,
!.data[[loanbook_join_id]] %in% join_matched[[loanbook_join_id]]

Check warning on line 190 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L188-L190

Added lines #L188 - L190 were not covered by tests
)
}

if (nrow(loanbook) != 0) {
fuzzy_matched <- match_name_impl(
loanbook = loanbook,
abcd = abcd,
by_sector = by_sector,
min_score = min_score,
method = method,
p = p,
overwrite = overwrite,
...
)
} else {
fuzzy_matched <- tibble()

Check warning on line 206 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L206

Added line #L206 was not covered by tests
}

if (exists("join_matched")) {
out <- dplyr::bind_rows(join_matched, fuzzy_matched)

Check warning on line 210 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L210

Added line #L210 was not covered by tests
} else if (nrow(fuzzy_matched) == 0 && exists("join_matched")) {
out <- join_matched

Check warning on line 212 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L212

Added line #L212 was not covered by tests
} else {
out <- fuzzy_matched
}

if (identical(nrow(out), 0L)) {
rlang::warn("Found no match.")
return(empty_loanbook_tibble(loanbook, dplyr::groups(loanbook)))
}

out
}

match_name_impl <- function(loanbook,
Expand All @@ -145,6 +230,7 @@
p = 0.1,
overwrite = NULL,
...) {

old_groups <- dplyr::groups(loanbook)
loanbook <- ungroup(loanbook)

Expand All @@ -164,8 +250,8 @@
setDT(a)

if (identical(nrow(a), 0L)) {
rlang::warn("Found no match.")
return(empty_loanbook_tibble(loanbook, old_groups))
rlang::inform("Found no match via fuzzy matching.")
return(a)
}

a <- unique(a)[
Expand All @@ -179,8 +265,8 @@
a <- a[score >= min_score, ]

if (identical(nrow(a), 0L)) {
rlang::warn("Found no match.")
return(empty_loanbook_tibble(loanbook, old_groups))
rlang::inform("Found no match via fuzzy matching.")
return(a)
}

l <- rename(prep_lbk, alias_lbk = "alias")
Expand Down Expand Up @@ -349,3 +435,58 @@
"borderline"
)
}

check_join_id <- function(join_id, loanbook, abcd) {

join_id_list <- as_join_by(join_id)

Check warning on line 441 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L441

Added line #L441 was not covered by tests

if (!rlang::has_name(loanbook, join_id_list[[1]])) {
rlang::abort(
"join_id_not_in_loanbook",
message = glue(
"The join_id `{join_id_list[[1]]}` must be present in `loanbook` input."

Check warning on line 447 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L443-L447

Added lines #L443 - L447 were not covered by tests
)
)
} else if (!rlang::has_name(abcd, join_id_list[[2]])) {
rlang::abort(
"join_id_not_in_abcd",
message = glue(
"The join_id `{join_id_list[[2]]}` must be present in `abcd` input."

Check warning on line 454 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L450-L454

Added lines #L450 - L454 were not covered by tests
)
)
}

invisible(join_id)

Check warning on line 459 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L459

Added line #L459 was not covered by tests
}

as_join_by <- function(x) {

if (rlang::is_list(x)) {
if (length(x) != 1L) {
rlang::abort("`join_id` must be a vector of length 1.")

Check warning on line 466 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L464-L466

Added lines #L464 - L466 were not covered by tests
}
x_name <- names(x) %||% x
y_name <- unname(x)
} else if (rlang::is_character(x)) {
x_name <- names(x) %||% x
y_name <- unname(x)

Check warning on line 472 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L468-L472

Added lines #L468 - L472 were not covered by tests

# If x partially named, assume unnamed are the same in both tables
x_name[x_name == ""] <- y_name[x_name == ""]

Check warning on line 475 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L475

Added line #L475 was not covered by tests
} else {
rlang::abort("`by` must be a string or a character vector.")

Check warning on line 477 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L477

Added line #L477 was not covered by tests
}

if (!rlang::is_character(x_name)) {
rlang::abort("`by$x` must evaluate to a character vector.")

Check warning on line 481 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L480-L481

Added lines #L480 - L481 were not covered by tests
}
if (!rlang::is_character(y_name)) {
rlang::abort("`by$y` must evaluate to a character vector.")

Check warning on line 484 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L483-L484

Added lines #L483 - L484 were not covered by tests
}

c(x_name, y_name)

Check warning on line 487 in R/match_name.R

View check run for this annotation

Codecov / codecov/patch

R/match_name.R#L487

Added line #L487 was not covered by tests
}

`%||%` <- function(x, y) {
if (is.null(x)) y else x
}
9 changes: 7 additions & 2 deletions R/restructure_abcd.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
#' from values in the `name_company` column.
#'
#' @param data A data frame. Should be an asset-level dataset.
#' @param join_id (Optional) A string giving the name of an `ID` column to
#' preserve in the restructuring
#'
#' @seealso [r2dii.data::abcd_demo] `to_alias()`.
#'
Expand All @@ -15,12 +17,15 @@
#' restructure_abcd(r2dii.data::abcd_demo)
#' @noRd
restructure_abcd <- function(data) {
check_crucial_names(data, c("name_company", "sector"))
crucial_names <- c("name_company", "sector")
check_crucial_names(data, crucial_names)

out <- dplyr::transmute(
data,
name = .data$name_company, sector = tolower(.data$sector)
name = .data$name_company,
sector = tolower(.data$sector)
)

out <- distinct(out)
add_alias(out)
}
Expand Down
25 changes: 25 additions & 0 deletions man/match_name.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading