Consider refactoring components with strict methodological significance to `pacta.data.preparation` or `pacta.scenario.preparation` #94

jdhoffa · 2024-02-08T16:00:34Z

PROBABLY DON'T DO THIS PRIOR TO SUCCESSFUL DELIVERY OF PACTA COP CH 2024

Towards a world where workflows handle file I/O and configuration and DevOps mainly
and pacta.* handles as much methodology as possible. Relates to RMI/practices#2

maybe refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/357 and use pacta.data.preparation::determine_relevant_years() #208)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 94 to 98 in 5801f95

    
           relevant_years <- sort( 
        
             unique( 
        
               market_share_target_reference_year:(market_share_target_reference_year + time_horizon) 
        
             ) 
        
           )

refactor to pacta.scenario.preparation ~~(done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/358 and use pacta.data.preparation::prepare_scenarios_long() #209)~~ not going to do this until after a switch to pacta.scenario.data.preparation is made

workflow.data.preparation/run_pacta_data_preparation.R

Lines 131 to 146 in 5801f95

    
           # scenario values will be linearly interpolated for each group below 
        
           interpolation_groups <- c( 
        
             "source", 
        
             "scenario", 
        
             "sector", 
        
             "technology", 
        
             "scenario_geography", 
        
             "indicator", 
        
             "units" 
        
           ) 
        
           scenario_raw_data %>% 
        
             pacta.scenario.preparation::interpolate_yearly(!!!rlang::syms(interpolation_groups)) %>% 
        
             filter(.data$year >= .env$market_share_target_reference_year) %>% 
        
             pacta.scenario.preparation::add_market_share_columns(reference_year = market_share_target_reference_year) %>% 
        
             pacta.scenario.preparation::format_p4i(green_techs) %>%

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/359 and use pacta.data.preparation::standardize_asset_type_names() #210)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 175 to 186 in 5801f95

    
           factset_issue_code_bridge <- 
        
             pacta.data.preparation::factset_issue_code_bridge %>% 
        
             select(issue_type_code, asset_type) %>% 
        
             mutate( 
        
               asset_type = case_when( 
        
                 .data$asset_type == "Listed Equity" ~ "Equity", 
        
                 .data$asset_type == "Corporate Bond" ~ "Bonds", 
        
                 .data$asset_type == "Fund" ~ "Funds", 
        
                 .data$asset_type == "Other" ~ "Others", 
        
                 TRUE ~ "Others" 
        
               ) 
        
             )

refactor to pacta.scenario.preparation ~~(done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/358 and use pacta.data.preparation::prepare_scenarios_long() #209)~~ not going to do this until after a switch to pacta.scenario.data.preparation is made

workflow.data.preparation/run_pacta_data_preparation.R

Lines 196 to 213 in 5801f95

    
           scenarios_long <- scenario_raw %>% 
        
             inner_join( 
        
               pacta.scenario.preparation::scenario_source_pacta_geography_bridge, 
        
               by = c( 
        
                 scenario_source = "source", 
        
                 scenario_geography = "scenario_geography_source" 
        
                 ) 
        
               ) %>% 
        
             select(-"scenario_geography") %>% 
        
             rename(scenario_geography = "scenario_geography_pacta") %>% 
        
             filter( 
        
               .data$scenario_source %in% .env$scenario_sources_list, 
        
               .data$ald_sector %in% c(.env$sector_list, .env$other_sector_list), 
        
               .data$scenario_geography %in% unique(.env$scenario_regions$scenario_geography), 
        
               .data$year %in% unique( 
        
                 c(.env$relevant_years, .env$market_share_target_reference_year + 10) 
        
               ) 
        
             )

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/348)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 260 to 270 in 5801f95

    
           ar_company_id__country_of_domicile <- 
        
             entity_info %>% 
        
             select("ar_company_id", "country_of_domicile") %>% 
        
             filter(!is.na(.data$ar_company_id)) %>% 
        
             distinct() 
        
           ar_company_id__credit_parent_ar_company_id <- 
        
             entity_info %>% 
        
             select("ar_company_id", "credit_parent_ar_company_id") %>% 
        
             filter(!is.na(.data$ar_company_id)) %>% 
        
             distinct()

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/360 and Use pacta.data.preparation::prepare_masterdata_debt() #211)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 291 to 320 in 5801f95

    
           masterdata_debt <- readr::read_csv(masterdata_debt_path, na = "", show_col_types = FALSE) 
        
           company_id__creditor_company_id <- 
        
             masterdata_debt %>% 
        
             select("company_id", "creditor_company_id") %>% 
        
             distinct() %>% 
        
             mutate(across(.cols = dplyr::everything(), .fns = as.character)) 
        
           masterdata_debt %>% 
        
             pacta.data.preparation::prepare_masterdata( 
        
               ar_company_id__country_of_domicile, 
        
               pacta_financial_timestamp, 
        
               zero_emission_factor_techs 
        
             ) %>% 
        
             left_join(company_id__creditor_company_id, by = c(id = "company_id")) %>% 
        
             left_join(ar_company_id__credit_parent_ar_company_id, by = c(id = "ar_company_id")) %>% 
        
             mutate(id = if_else(!is.na(.data$credit_parent_ar_company_id), .data$credit_parent_ar_company_id, .data$id)) %>% 
        
             mutate(id = if_else(!is.na(.data$creditor_company_id), .data$creditor_company_id, .data$id)) %>% 
        
             mutate(id_name = "credit_parent_ar_company_id") %>% 
        
             group_by( 
        
               .data$id, .data$id_name, .data$ald_sector, .data$ald_location, 
        
               .data$technology, .data$year, .data$country_of_domicile, 
        
               .data$ald_production_unit, .data$ald_emissions_factor_unit, 
        
             ) %>% 
        
             summarise( 
        
               ald_emissions_factor = stats::weighted.mean(.data$ald_emissions_factor, .data$ald_production, na.rm = TRUE), 
        
               ald_production = sum(.data$ald_production, na.rm = TRUE), 
        
               .groups = "drop" 
        
             ) %>% 
        
             saveRDS(file.path(data_prep_outputs_path, "masterdata_debt_datastore.rds"))

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/348 and https://github.com/RMI-PACTA/pacta.data.preparation/pull/353)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 350 to 370 in 5801f95

    
           ar_company_id__sectors_with_assets__ownership <- 
        
             readRDS(file.path(data_prep_outputs_path, "masterdata_ownership_datastore.rds")) %>% 
        
             filter(year %in% relevant_years) %>% 
        
             select(ar_company_id = id, ald_sector) %>% 
        
             distinct() %>% 
        
             group_by(ar_company_id) %>% 
        
             summarise(sectors_with_assets = paste(unique(ald_sector), collapse = " + ")) 
        
           financial_data %>% 
        
             left_join(factset_entity_id__ar_company_id, by = "factset_entity_id") %>% 
        
             left_join(factset_entity_id__security_mapped_sector, by = "factset_entity_id") %>% 
        
             left_join(ar_company_id__sectors_with_assets__ownership, by = "ar_company_id") %>% 
        
             mutate(has_asset_level_data = if_else(is.na(sectors_with_assets) | sectors_with_assets == "", FALSE, TRUE)) %>% 
        
             mutate(has_ald_in_fin_sector = if_else(stringr::str_detect(sectors_with_assets, security_mapped_sector), TRUE, FALSE)) %>% 
        
             select( 
        
               isin, 
        
               has_asset_level_data, 
        
               has_ald_in_fin_sector, 
        
               sectors_with_assets 
        
             ) %>% 
        
             saveRDS(file.path(data_prep_outputs_path, "abcd_flags_equity.rds"))

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/348 and https://github.com/RMI-PACTA/pacta.data.preparation/pull/353)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 375 to 403 in 5801f95

    
           ar_company_id__sectors_with_assets__debt <- 
        
             readRDS(file.path(data_prep_outputs_path, "masterdata_debt_datastore.rds")) %>% 
        
             filter(year %in% relevant_years) %>% 
        
             select(ar_company_id = id, ald_sector) %>% 
        
             distinct() %>% 
        
             group_by(ar_company_id) %>% 
        
             summarise(sectors_with_assets = paste(unique(ald_sector), collapse = " + ")) 
        
           financial_data %>% 
        
             left_join(factset_entity_id__ar_company_id, by = "factset_entity_id") %>% 
        
             left_join(factset_entity_id__security_mapped_sector, by = "factset_entity_id") %>% 
        
             left_join(ar_company_id__sectors_with_assets__debt, by = "ar_company_id") %>% 
        
             mutate(has_asset_level_data = if_else(is.na(sectors_with_assets) | sectors_with_assets == "", FALSE, TRUE)) %>% 
        
             mutate(has_ald_in_fin_sector = if_else(stringr::str_detect(sectors_with_assets, security_mapped_sector), TRUE, FALSE)) %>% 
        
             left_join( 
        
               select(entity_info, "factset_entity_id", "credit_parent_id"), 
        
               by = "factset_entity_id" 
        
             ) %>% 
        
             mutate( 
        
               # If FactSet has no credit_parent, we define the company as it's own parent 
        
               credit_parent_id = if_else(is.na(credit_parent_id), factset_entity_id, credit_parent_id) 
        
             ) %>% 
        
             group_by(credit_parent_id) %>% 
        
             summarise( 
        
               has_asset_level_data = sum(has_asset_level_data, na.rm = TRUE) > 0, 
        
               has_ald_in_fin_sector = sum(has_ald_in_fin_sector, na.rm = TRUE) > 0, 
        
               sectors_with_assets = paste(sort(unique(na.omit(unlist(str_split(sectors_with_assets, pattern = " [+] "))))), collapse = " + ") 
        
             ) %>% 
        
             ungroup() %>%

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/351)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 421 to 441 in 5801f95

    
           fund_data <- 
        
             fund_data %>% 
        
             group_by(factset_fund_id, fund_reported_mv) %>% 
        
             filter((fund_reported_mv[[1]] - sum(holding_reported_mv)) / fund_reported_mv[[1]] > -1e-5) %>% 
        
             ungroup() 
        
           # build MISSINGWEIGHT for under and over 
        
           fund_missing_mv <- 
        
             fund_data %>% 
        
             group_by(factset_fund_id, fund_reported_mv) %>% 
        
             summarise( 
        
               holding_isin = "MISSINGWEIGHT", 
        
               holding_reported_mv = fund_reported_mv[[1]] - sum(holding_reported_mv), 
        
               .groups = "drop" 
        
             ) %>% 
        
             ungroup() %>% 
        
             filter(holding_reported_mv != 0) 
        
           fund_data %>% 
        
             bind_rows(fund_missing_mv) %>% 
        
             saveRDS(file.path(data_prep_outputs_path, "fund_data.rds"))

refactor to pacta.data.prepartion (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/352)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 453 to 476 in 5801f95

    
           isin_to_fund_table <- readRDS(factset_isin_to_fund_table_path) 
        
           # filter out fsyms that have more than 1 row and no fund data 
        
           isin_to_fund_table <- 
        
             isin_to_fund_table %>% 
        
             mutate(has_fund_data = factset_fund_id %in% fund_data$factset_fund_id) %>% 
        
             group_by(fsym_id) %>% 
        
             mutate(n = n()) %>% 
        
             filter(n == 1 | (n > 1 & has_fund_data)) %>% 
        
             ungroup() %>% 
        
             select(-n, -has_fund_data) 
        
           # filter out fsyms that have more than 1 row and have fund data for both rows 
        
           isin_to_fund_table <- 
        
             isin_to_fund_table %>% 
        
             mutate(has_fund_data = factset_fund_id %in% fund_data$factset_fund_id) %>% 
        
             group_by(fsym_id) %>% 
        
             mutate(n = n()) %>% 
        
             filter(!(all(has_fund_data) & n > 1)) %>% 
        
             ungroup() %>% 
        
             select(-n, -has_fund_data) 
        
           isin_to_fund_table %>% 
        
             saveRDS(file.path(data_prep_outputs_path, "isin_to_fund_table.rds"))

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/348)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 489 to 496 in 5801f95

    
           iss_company_emissions <- 
        
             readRDS(factset_iss_emissions_data_path) %>% 
        
             group_by(factset_entity_id) %>% 
        
             summarise( 
        
               icc_total_emissions = sum(icc_total_emissions + icc_scope_3_emissions, na.rm = TRUE), 
        
               .groups = "drop" 
        
             ) %>% 
        
             mutate(icc_total_emissions_units = "tCO2e") # units are defined in the ISS/FactSet documentation (see #144)

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/361 and use new ISS prep functions #213)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 502 to 533 in 5801f95

    
           iss_entity_emission_intensities <- 
        
             readRDS(factset_entity_financing_data_path) %>% 
        
             left_join(currencies, by = "currency") %>% 
        
             mutate( 
        
               ff_mkt_val = ff_mkt_val * exchange_rate, 
        
               ff_debt = ff_debt * exchange_rate, 
        
               currency = "USD" 
        
             ) %>% 
        
             select(-exchange_rate) %>% 
        
             group_by(factset_entity_id, currency) %>% 
        
             summarise( 
        
               ff_mkt_val = sum(ff_mkt_val, na.rm = TRUE), 
        
               ff_debt = sum(ff_debt, na.rm = TRUE), 
        
               .groups = "drop" 
        
             ) %>% 
        
             inner_join(iss_company_emissions, by = "factset_entity_id") %>% 
        
             transmute( 
        
               factset_entity_id = factset_entity_id, 
        
               emission_intensity_per_mkt_val = if_else( 
        
                 ff_mkt_val == 0, 
        
                 NA_real_, 
        
                 icc_total_emissions / ff_mkt_val 
        
               ), 
        
               emission_intensity_per_debt = if_else( 
        
                 ff_debt == 0, 
        
                 NA_real_, 
        
                 icc_total_emissions / ff_debt 
        
               ), 
        
               ff_mkt_val, 
        
               ff_debt, 
        
               units = paste0(icc_total_emissions_units, " / ", "$ USD") 
        
             )

refactor to pacta.data.preparation (done in https://github.com/RMI-PACTA/pacta.data.preparation/pull/361 and use new ISS prep functions #213)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 547 to 563 in 5801f95

    
           iss_entity_emission_intensities %>% 
        
             inner_join(factset_entity_info, by = "factset_entity_id") %>% 
        
             group_by(sector_code, factset_sector_desc, units) %>% 
        
             summarise( 
        
               emission_intensity_per_mkt_val = weighted.mean( 
        
                 emission_intensity_per_mkt_val, 
        
                 ff_mkt_val, 
        
                 na.rm = TRUE 
        
               ), 
        
               emission_intensity_per_debt = weighted.mean( 
        
                 emission_intensity_per_debt, 
        
                 ff_debt, 
        
                 na.rm = TRUE 
        
               ), 
        
               .groups = "drop" 
        
             ) %>% 
        
             ungroup() %>%

AB#10388

The text was updated successfully, but these errors were encountered:

jdhoffa · 2024-02-08T16:01:00Z

cc: @AlexAxthelm and @cjyetman this would get us closer to having "all methodology" in pacta.* and "all file I/O" in workflow.*

depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/333 works toward #94 - replaces explicit code in [run_pacta_data_preparation.R](https://github.com/RMI-PACTA/workflow.data.preparation/edit/main/run_pacta_data_preparation.R) with the function `pacta.data.preparation::determine_relevant_years()` which wraps all the necessary methodological logic and does robust input checking

cjyetman · 2024-02-10T13:48:09Z

note that any of the code here referencing masterdata* is unlikely to get added to pacta.data.preparation since we're very close to not using/relying on the masterdata files at all anymore

…` functions (#195) - [x] depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/353 - work towards #94 Co-authored-by: CJ Yetman - RMI <[email protected]>

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/359 Co-authored-by: CJ Yetman - RMI <[email protected]>

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/357 --------- Co-authored-by: CJ Yetman - RMI <[email protected]>

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/360 --------- Co-authored-by: CJ Yetman - RMI <[email protected]>

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/361 --------- Co-authored-by: CJ Yetman - RMI <[email protected]>

cjyetman · 2024-04-05T08:01:43Z

all but the scenario stuff has been implemented, closing
(scenario stuff will defer until pacta.scenario.data.preparation is implemented)

jdhoffa added the potentially dangerous label Feb 8, 2024

This was referenced Feb 10, 2024

Proposal: Convert to Package infrastructure #77

Closed

use new pacta.data.preparation::determine_relevant_years() function #120

Closed

cjyetman mentioned this issue Feb 10, 2024

use new pacta.data.preparation::determine_relevant_years() function #121

Closed

jdhoffa mentioned this issue Feb 16, 2024

force garbage collection after large objects are removed #140

Merged

cjyetman added the priority label Mar 22, 2024

cjyetman added the ADO label Mar 25, 2024

cjyetman added a commit that referenced this issue Apr 5, 2024

use pacta.data.preparation::standardize_asset_type_names() (#210)

d9f22c6

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/359 Co-authored-by: CJ Yetman - RMI <[email protected]>

cjyetman added a commit that referenced this issue Apr 5, 2024

use pacta.data.preparation::determine_relevant_years() (#208)

d84dc80

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/357 --------- Co-authored-by: CJ Yetman - RMI <[email protected]>

cjyetman added a commit that referenced this issue Apr 5, 2024

Use pacta.data.preparation::prepare_masterdata_debt() (#211)

988ac5d

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/360 --------- Co-authored-by: CJ Yetman - RMI <[email protected]>

cjyetman added a commit that referenced this issue Apr 5, 2024

use new ISS prep functions (#213)

5b3c9bc

- towards #94 - depends on https://github.com/RMI-PACTA/pacta.data.preparation/pull/361 --------- Co-authored-by: CJ Yetman - RMI <[email protected]>

cjyetman closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider refactoring components with strict methodological significance to `pacta.data.preparation` or `pacta.scenario.preparation` #94

Consider refactoring components with strict methodological significance to `pacta.data.preparation` or `pacta.scenario.preparation` #94

jdhoffa commented Feb 8, 2024 •

edited by cjyetman

Loading

jdhoffa commented Feb 8, 2024

cjyetman commented Feb 10, 2024

cjyetman commented Apr 5, 2024

Consider refactoring components with strict methodological significance to pacta.data.preparation or pacta.scenario.preparation #94

Consider refactoring components with strict methodological significance to pacta.data.preparation or pacta.scenario.preparation #94

Comments

jdhoffa commented Feb 8, 2024 • edited by cjyetman Loading

jdhoffa commented Feb 8, 2024

cjyetman commented Feb 10, 2024

cjyetman commented Apr 5, 2024

Consider refactoring components with strict methodological significance to `pacta.data.preparation` or `pacta.scenario.preparation` #94

Consider refactoring components with strict methodological significance to `pacta.data.preparation` or `pacta.scenario.preparation` #94

jdhoffa commented Feb 8, 2024 •

edited by cjyetman

Loading