Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

new clean_data library of functions for preparing new data #247

Closed
wants to merge 109 commits into from

Conversation

cjyetman
Copy link
Member

@cjyetman cjyetman commented Oct 22, 2020

WIP: draft PR to add a new clean_data.R script that contains improved versions of all the functions needed to pull in, clean, merge, etc. a new set of data for a PACTA analysis, including a robust testing regime for each of the functions.

this should address #344 before being considered complete
also should possibly address 2DegreesInvesting/create_interactive_report/issues/268

closes #235

@cjyetman cjyetman marked this pull request as draft October 22, 2020 07:51
@cjyetman cjyetman changed the title initial commit for new clean_data script with tests new clean_data library of functions for preparing new data Oct 22, 2020
@cjyetman
Copy link
Member Author

@Clare2D in the current version of get_and_clean_currency_data(), it calls set_currency_timestamp(), which sets all currency exchange rate values in column "ExchangeRate_2019Q2" equal to 1.

I don't understand what the purpose of that is. Can you explain it to me... or is it not necessary... or should it be doing something more robust (e.g. always set the currently most recent exchange rate to 1)?

https://github.com/2DegreesInvesting/PACTA_analysis/blob/9978f00a3e871571d8e37b71d9e7bee848a38f24/0_portfolio_input_check_functions.R#L203-L216

@Clare2D
Copy link
Contributor

Clare2D commented Oct 22, 2020

This was a temporary solution as we didn't had currency data for 2019Q2. Best to delete this. It won't impact what we're doing at the moment. Thanks.

@cjyetman
Copy link
Member Author

cjyetman commented Oct 22, 2020

@Clare2D the get_average_emission_data() and get_company_emission_data() functions return an empty data frame if the first and only argument inc_emission_factors is FALSE

Is this desired/necessary behavior? My intuition would be that these functions should always return what they're intended to return (actual data) and it should be up to the caller to decide whether to ask for the data at all and/or what to do with the data once it has received it.

For example, I would think that this...

if (inc_emission_factors) { 
  average_emission_data <- get_average_emission_data()
} else {
  average_emission_data <- tibble()
}

or this...

average_emission_data <- get_average_emission_data()
if (!inc_emission_factors) { average_emission_data <- tibble() }

is more logical than...

average_emission_data <- get_average_emission_data(inc_emission_factors)

https://github.com/2DegreesInvesting/PACTA_analysis/blob/9978f00a3e871571d8e37b71d9e7bee848a38f24/0_portfolio_input_check_functions.R#L1288-L1304


same applies to the get_and_clean_revenue_data() function and its has_revenue argument
https://github.com/2DegreesInvesting/PACTA_analysis/blob/9978f00a3e871571d8e37b71d9e7bee848a38f24/0_portfolio_input_check_functions.R#L977-L991

@cjyetman cjyetman self-assigned this Oct 22, 2020
@cjyetman
Copy link
Member Author

cjyetman commented Oct 22, 2020

@Clare2D the current version of get_and_clean_fund_data() returns NA if it can't do what it's supposed to do, which is not even the same object type (data frame) of what you would expect to get if the function returned what it usually should. That seems like a bad idea. Presumably, the objective is to allow the caller to call this function and not cause an error that halts the entire script if something goes wrong with just this function?

One alternative way to achieve this would be to add an additional function that first tests whether get_and_clean_fund_data() can finish without error, and then only calling get_and_clean_fund_data() if that passes, like...

if (can_get_fund_data()) {
  fund_data <- get_and_clean_fund_data()
} else {
  fund_data <- tibble()  # or fund_data <- NA if really necessary
}

or at least (if it doesn't cause errors elsewhere) it would be preferable to return an object of the same type.

thoughts anyone (@maurolepore @jdhoffa @jacobvjk)?

https://github.com/2DegreesInvesting/PACTA_analysis/blob/9978f00a3e871571d8e37b71d9e7bee848a38f24/0_portfolio_input_check_functions.R#L863-L878

@jacobvjk
Copy link
Member

One alternative way to achieve this would be to add an additional function that first tests whether get_and_clean_fund_data() can finish without error, and then only calling get_and_clean_fund_data() if that passes, like...

if (can_get_fund_data()) {
  fund_data <- get_and_clean_fund_data()
} else {
  fund_data <- tibble()  # or fund_data <- NA if really necessary
}

or at least (if it doesn't cause errors elsewhere) it would be preferable to return an object of the same type.

thoughts anyone (@maurolepore @jdhoffa @jacobvjk)?

https://github.com/2DegreesInvesting/PACTA_analysis/blob/9978f00a3e871571d8e37b71d9e7bee848a38f24/0_portfolio_input_check_functions.R#L863-L878

From what I can tell, the fund_data object is being used in the next function call get_and_clean_fin_data(fund_data), where it's being checked for whether there are any relevant rows in it (data_check(fund_data) https://github.com/2DegreesInvesting/PACTA_analysis/blob/9978f00a3e871571d8e37b71d9e7bee848a38f24/0_global_functions.R#L2) and if so, some fund_isin Bloomberg check is done check_funds_wo_bbg(fund_data) https://github.com/2DegreesInvesting/PACTA_analysis/blob/9978f00a3e871571d8e37b71d9e7bee848a38f24/0_portfolio_input_check_functions.R#L611

So it would indeed seem like the safest approach to return an empty tibble, as far as I can tell.. maybe someone else knows of any other issues related to this?

@maurolepore
Copy link
Contributor

@cjyetman is this stale, WIP, or ready for review?

@cjyetman
Copy link
Member Author

cjyetman commented Nov 2, 2020

WIP... or better WOTBB (work-on-the-back-burner)

@AlexAxthelm AlexAxthelm marked this pull request as ready for review September 11, 2022 10:44
@jdhoffa
Copy link
Member

jdhoffa commented Nov 22, 2022

This seems stale, and also in a deprecated repo. Ok to Close?

@cjyetman
Copy link
Member Author

Can you imagine trying to resolve all the merge conflicts AND adapting this to work with the new data!?!? LOL closing

@cjyetman cjyetman closed this Nov 22, 2022
@cjyetman cjyetman deleted the new-clean_data-script branch November 22, 2022 17:29
@jdhoffa
Copy link
Member

jdhoffa commented Nov 22, 2022

hhahahhahaa, sounds like a fun Xmas holiday task :-O

@Clare2D
Copy link
Contributor

Clare2D commented Nov 23, 2022

Just got an email about this... Just wanted to say... you should definitely work on this over xmas! Hope you're all well!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract data processing, cleaning and renaming
5 participants