remove FactSet database code #73

cjyetman · 2024-01-15T22:32:22Z

closes #72

FactSet database is gone, so this simplifies things a bit.

jdhoffa · 2024-01-16T01:11:46Z

I have so many questions lol. Hopefully they have simple answers:

This seems absurdly dangerous to me. The 'get_*' functions have a lot of information encoded in them.

Is this functionality replicated somewhere?
How can we ensure consistency between whatever new process pulls these tables and what is pulled here/ has been pulled in the past?
Where is the new functionality stored and documented?

In any case, this PR at least requires:

an update to the README that it is expected that these datasets are required as input
specific documentation for exactly how to reproduce each dataset, or pointer to a repo containing the updated pulling functionality

When/ why did we stop using the FactSet database?

Otherwise LGTM /s 😭

cjyetman · 2024-01-16T11:14:48Z

As I understand it, the database is gone and it ain't coming back. The FactSet get_* functions live in pacta.data.preparation, so they are not removed in this PR.

Not exactly sure, but I guess the process of getting FactSet data is now encoded in a new repo that @AlexAxthelm created.

an update to the README that it is expected that these datasets are required as input

Fair enough. I updated the README to remove the relevant FactSet settings, but I can add a note about the prep-prepared FactSet data being available in the inputs directory once I figure out exactly what that looks like.

specific documentation for exactly how to reproduce each dataset, or pointer to a repo containing the updated pulling functionality

not sure replication instructions are feasible to detail in the README, but yeah we can link to the new FactSet downloader repo

jdhoffa · 2024-01-16T11:33:56Z

Yeah fair enough I think I overreacted a bit, but mainly I want to be certain that we have a smooth process for pulling/ preparing these datasets and stitching them in as inputs, since that's what they have to be now.

So yeah then I guess minimal changes required are:

explain FS datasets that are necessary as input in README
link to whatever repo now pulls them, @AlexAxthelm ?

jdhoffa

See comment thread

jdhoffa · 2024-01-16T11:56:16Z

Yeah fair enough I think I overreacted a bit, but mainly I want to be certain that we have a smooth process for pulling/ preparing these datasets and stitching them in as inputs, since that's what they have to be now.

So yeah then I guess minimal changes required are:

explain FS datasets that are necessary as input in README

link to whatever repo now pulls them, @AlexAxthelm ?

It seems that the repo that now pulls this is here: https://github.com/RMI-PACTA/workflow.factset

AlexAxthelm · 2024-01-16T12:23:45Z

explain FS datasets that are necessary as input in README
@jdhoffa I agree that we should document the input format of the files that are exported by workflow.factset.

README.md

cjyetman · 2024-01-19T14:09:46Z

@AlexAxthelm I don't have the bandwidth today (literally and figuratively) to fix it, but...

I think these paths will need to be changed for this to work (assuming the pre-prepared FactSet data files would be placed in the same input folder as the AI files)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 72 to 82 in 7c9feed

    
           # pre-flight filepaths --------------------------------------------------------- 
        
           scenarios_analysis_input_path <- file.path(data_prep_inputs_path, "Scenarios_AnalysisInput.csv") 
        
           scenario_regions_path <- file.path(data_prep_inputs_path, "scenario_regions.csv") 
        
           currencies_data_path <- file.path(data_prep_inputs_path, "currencies.rds") 
        
           factset_financial_data_path <- file.path(data_prep_inputs_path, "factset_financial_data.rds") 
        
           factset_entity_info_path <- file.path(data_prep_inputs_path, "factset_entity_info.rds") 
        
           factset_entity_financing_data_path <- file.path(data_prep_inputs_path, "factset_entity_financing_data.rds") 
        
           factset_fund_data_path <- file.path(data_prep_inputs_path, "factset_fund_data.rds") 
        
           factset_isin_to_fund_table_path <- file.path(data_prep_inputs_path, "factset_isin_to_fund_table.rds") 
        
           factset_iss_emissions_data_path <- file.path(data_prep_inputs_path, "factset_iss_emissions.rds")

jdhoffa · 2024-01-19T14:19:57Z

@AlexAxthelm I don't have the bandwidth today (literally and figuratively) to fix it, but...

I think these paths will need to be changed for this to work (assuming the pre-prepared FactSet data files would be placed in the same input folder as the AI files)

workflow.data.preparation/run_pacta_data_preparation.R

Lines 72 to 82 in 7c9feed

# pre-flight filepaths ---------------------------------------------------------

scenarios_analysis_input_path <- file.path(data_prep_inputs_path, "Scenarios_AnalysisInput.csv")

scenario_regions_path <- file.path(data_prep_inputs_path, "scenario_regions.csv")

currencies_data_path <- file.path(data_prep_inputs_path, "currencies.rds")

factset_financial_data_path <- file.path(data_prep_inputs_path, "factset_financial_data.rds")

factset_entity_info_path <- file.path(data_prep_inputs_path, "factset_entity_info.rds")

factset_entity_financing_data_path <- file.path(data_prep_inputs_path, "factset_entity_financing_data.rds")

factset_fund_data_path <- file.path(data_prep_inputs_path, "factset_fund_data.rds")

factset_isin_to_fund_table_path <- file.path(data_prep_inputs_path, "factset_isin_to_fund_table.rds")

factset_iss_emissions_data_path <- file.path(data_prep_inputs_path, "factset_iss_emissions.rds")

I think the file paths would be the same? The process just drops everything into the inputs folder, so it would stay as the inputs folder.

However, I guess we could move the lines of code logically from "pre-flight" to "input"

cjyetman · 2024-01-19T14:43:32Z

ah crud, you're right... data_prep_inputs_path seems appropriate... I'm confusing myself in circles.... I was remembering some former state where I used an "intermediary" directory or something like that for the "pre-flight" files

carry on 😉

jdhoffa · 2024-01-19T15:40:32Z

ah crud, you're right... data_prep_inputs_path seems appropriate... I'm confusing myself in circles.... I was remembering some former state where I used an "intermediary" directory or something like that for the "pre-flight" files

carry on 😉

no problem. We tend to change things: fast, frequently, and profoundly
so i'd say it's pretty fair that you forget the current set-up hahahaha

still, i do think a minor improval would be to put the lines under the "input" heading, since there no longer really 'pre-flight'

jdhoffa

1 change requested in documentation, otherwise LGTM

README.md

cjyetman · 2024-01-22T20:24:59Z

factset_dataset could probably work, though it looks like it's just in the config file but not used anywhere in the script

you're correct that it's not referenced in the run_data_preparation.R script, but it is in the copy_raw_data.R:

workflow.data.preparation/ACI/copy_raw_data.R

Lines 56 to 71 in acc9c0f

factset_files <- list.files(

path = file.path(

cfg[["factset-extracted_path"]],

cfg[["factset_dataset"]]

),

include.dirs = FALSE,

full.names = TRUE

)

logger::log_trace("factset_file: {factset_files}")

files_to_copy <- c(

masterdata_debt,

masterdata_ownership,

ar_fs_bridge,

factset_files

)

Not to get too much into an explanation of #75 , but since I was trying to not change the logic of the core script in that PR, I added a script to copy the raw datafiles into a data prep inputs directory (from their home locations on AFS), and then run run_data_preparation.R as-is (+ some more logging). So that config param determines which FS file get copied into data prep inputs.

I'm pretty concerned about changes like that. As the person that originally created this repo, I've implicitly been the go-to person for figuring out what happened when something goes wrong, and the more we deviate from the path that I laid out, the less I feel comfortable with that responsibility, so you're going to have to start taking over responsibility of that if we keep moving away from a process that I'm comfortable with towards one that you're comfortable with.

jdhoffa · 2024-01-23T11:44:28Z

Seems like we're so close here

factset_dataset could probably work, though it looks like it's just in the config file but not used anywhere in the script

you're correct that it's not referenced in the run_data_preparation.R script, but it is in the copy_raw_data.R:

workflow.data.preparation/ACI/copy_raw_data.R

Lines 56 to 71 in acc9c0f

factset_files <- list.files(

path = file.path(

cfg[["factset-extracted_path"]],

cfg[["factset_dataset"]]

),

include.dirs = FALSE,

full.names = TRUE

)

logger::log_trace("factset_file: {factset_files}")

files_to_copy <- c(

masterdata_debt,

masterdata_ownership,

ar_fs_bridge,

factset_files

)

Not to get too much into an explanation of #75 , but since I was trying to not change the logic of the core script in that PR, I added a script to copy the raw datafiles into a data prep inputs directory (from their home locations on AFS), and then run run_data_preparation.R as-is (+ some more logging). So that config param determines which FS file get copied into data prep inputs.

I'm pretty concerned about changes like that. As the person that originally created this repo, I've implicitly been the go-to person for figuring out what happened when something goes wrong, and the more we deviate from the path that I laid out, the less I feel comfortable with that responsibility, so you're going to have to start taking over responsibility of that if we keep moving away from a process that I'm comfortable with towards one that you're comfortable with.

Sorry to be nit-picky, but we are talking about code changes in a different PR now. It is totally valid to be critical of that change, but can we keep the conversation focused on this PR for now to avoid confusion? (for my sake haha)

cjyetman · 2024-01-23T11:55:27Z

Sorry to be nit-picky, but we are talking about code changes in a different PR now. It is totally valid to be critical of that change, but can we keep the conversation focused on this PR for now to avoid confusion? (for my sake haha)

The relevance to this PR is that it is still uncertain how to specify which set of FactSet data one is using in a way that is traceable and repeatable.

jdhoffa · 2024-01-23T12:03:13Z

Sorry to be nit-picky, but we are talking about code changes in a different PR now. It is totally valid to be critical of that change, but can we keep the conversation focused on this PR for now to avoid confusion? (for my sake haha)

The relevance to this PR is that it is still uncertain how to specify which set of FactSet data one is using in a way that is traceable and repeatable.

And there is a proposed way to handle this in an unmerged PR? If so, I think maybe we move the whole conversation to that PR (and flag it as a dependency for this one)

cjyetman · 2024-01-23T12:11:02Z

And there is a proposed way to handle this in an unmerged PR? If so, I think maybe we move the whole conversation to that PR (and flag it as a dependency for this one)

There is a proposed way in a different PR that I don't really agree with, so seems a bit strange to let this PR be dependent on something else that may or may not ultimately exist as-is, no?

jdhoffa · 2024-01-23T12:25:25Z

Well, that's up to you I think.

If you think that

The AI datasets have their filenames encoded in the parameters... etc.

is something that needs to be addressed before this PR is merged, and if there is no suitable way to do it in the current set-up, then this PR automatically has to depend on something that hasn't been done yet.

If you don't want to address it before merging, then we can avoid that dependency, but we may be merging something half-baked

jdhoffa · 2024-01-23T12:26:03Z

But, in the case that we are discussing "whatever the PR is that tries to solve data provenance". I think discussions about that PR should occur on that PR?

cjyetman · 2024-01-23T12:28:10Z

I think that this PR is not fully baked, i.e. ready to merge, if it does not have a way of tracking the provenance of the FactSet data, which would be a regression from main.

jdhoffa · 2024-01-23T12:29:48Z

Yup agreed.

Then this PR depends on either #75 or #78 (or some as-of-yet unopened PR), and discussions about how that is solved should happen there.

AlexAxthelm · 2024-01-23T13:30:49Z

I've added a comment in #75 outlining a change to that PR which might address this discussion, but basically the idea is:

Change from specifying a single "inputs" directory with all of the files in it, to specifying the path to each type of data (AI, FS)
if we want to make a copy of the data inputs, use a copy_inputs config option.

That would let us keep the current behavior by setting the AI and FS paths both to /inputs, but point to the actual raw files on Azure Files when we're running on ACI (without copying to an intermediate directory in between).

…tset_data_path`

cjyetman · 2024-01-23T14:21:30Z

Implemented asset_impact_data_path and factset_data_path as @AlexAxthelm suggested here

So for the way I use this repo, the expectation is that one would download the appropriate FactSet archive from Azure and unzip the tar.gz and set the factset_data_path to the name of the unzipped directory, e.g. "factset-pacta_timestamp-20221231T000000Z_pulled-20231221T195325Z", from which the provenance of the FactSet files can be inferred and recorded in the manifest file. Hence:

workflow.data.preparation/config.yml

Lines 87 to 88 in 1f25f76

    
           2022Q4: 
        
             factset_data_path: "factset-pacta_timestamp-20221231T000000Z_pulled-20231221T195325Z"

and

workflow.data.preparation/run_pacta_data_preparation.R

Lines 785 to 789 in 1f25f76

    
           timestamps = list( 
        
             imf_quarter_timestamp = imf_quarter_timestamp, 
        
             factset_data_identifier = basename(factset_data_path), 
        
             pacta_financial_timestamp = pacta_financial_timestamp 
        
           ),

AlexAxthelm

One (non-critical) suggestion. overall LGTM.

AlexAxthelm · 2024-01-23T14:22:04Z

config.yml

  data_prep_outputs_path: "/outputs"
+  asset_impact_data_path: "/inputs"
+  factset_data_path: ""


Suggested change

factset_data_path: ""

factset_data_path: "/inputs"

if this is "", then the script would expect to find those files at the current working directory (which is unlikely). defaulting to /inputs mirrors the current behavior of "throw everything into a single dir, and mount it to /inputs."

so the strategy with the config file is:

list every possible parameter in the default config, but if it's a parameter that does not have a logical default value, e.g. one that necessarily is specific to a given quarter, then leave the default value null/blank/whatever. The intention there is that if one has not overrided the default value in their selected config, then it should fail, because there's no way this should work properly unless that has been explicitly set

in each of the quarter specific configs, strive to minimize the amount of parameters set by inheriting as much as possible from the default config, but for parameters that are necessarily specific to that quarter/config, explicitly set them here in GH so that if one day someone comes back to question "which version of FactSet is supposed to be used to build 2022Q4 data?" it will be clearly recorded here in the config. (Though, if one manually edited that before running data.prep, the setting that they used would be captured in the manifest file... so if one was working backwards in the other direction, e.g. "what version of the FactSet data was used to build the PACTA data in this folder?", they could determine that by looking at the manifest in that folder)

With the last few commits I made here, the factset_data_path is required for any given quarter. That's the only way I see at the moment to have it clearly known precisely which version of the FactSet data was/should be used. Otherwise they're just files from who knows where.

For further explanation.... I'd prefer that the filenames of the FactSet data files have a/the timestamp integrated into them, but in lieu of that the directory name from which they come is the only place where that is currently defined.

That makes sense. I've opened RMI-PACTA/workflow.factset#16, which can make good use of the changes in this PR.

cjyetman · 2024-01-24T09:34:43Z

@AlexAxthelm @jdhoffa so I think this is ready to go now, with the caveat that once RMI-PACTA/workflow.factset#16 lands I'd prefer to adapt to start using the FactSet data filenames as the source of truth for the FactSet timestamp

I believe the documentation has been updated/reorganized as per Jackson's suggestion

AlexAxthelm · 2024-01-24T12:31:35Z

Approved on my side.

jdhoffa · 2024-01-24T14:59:03Z

Approved! sorry missed this somehow

This was removed in #73 when the FactSet database pulling info was removed, but `ent_entity_affiliates_last_update` is still created above and its value/info is still worthwhile to have in the manifest.

cjyetman added 4 commits January 15, 2024 23:31

remove FactSet database code

3b7a54b

remove FactSet database related parameters

fd9d062

don't load removed FactSet database related parameters

00d8c2e

remove FactSet database parameters from README

78ab40b

cjyetman marked this pull request as ready for review January 15, 2024 22:38

cjyetman requested review from jdhoffa and AlexAxthelm January 15, 2024 22:39

AlexAxthelm previously approved these changes Jan 16, 2024

View reviewed changes

cjyetman marked this pull request as draft January 16, 2024 11:39

jdhoffa requested changes Jan 16, 2024

View reviewed changes

Merge branch 'main' into remove-fact-set-database-code

f385b64

AlexAxthelm dismissed their stale review via f385b64 January 19, 2024 12:05

Point to workflow.factset for file definitions

68f1f41

AlexAxthelm previously approved these changes Jan 19, 2024

View reviewed changes

README.md Show resolved Hide resolved

Update run_pacta_data_preparation.R

600797d

cjyetman dismissed AlexAxthelm’s stale review via 600797d January 21, 2024 17:44

AlexAxthelm marked this pull request as ready for review January 21, 2024 21:04

AlexAxthelm previously approved these changes Jan 21, 2024

View reviewed changes

AlexAxthelm marked this pull request as draft January 21, 2024 21:04

jdhoffa previously requested changes Jan 22, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

Reorder documentation

962b974

Merge branch 'main' into remove-fact-set-database-code

584c73b

AlexAxthelm mentioned this pull request Jan 23, 2024

Feature/azure container #75

Closed

cjyetman added 3 commits January 23, 2024 15:03

introduce factset_data_path and asset_impact_data_path parameters

de735bb

replace data_prep_input_path with asset_impact_data_path and `fac…

66c14d6

…tset_data_path`

Update config.yml

1f25f76

AlexAxthelm approved these changes Jan 23, 2024

View reviewed changes

AlexAxthelm mentioned this pull request Jan 23, 2024

Add timestamp to exported file names. RMI-PACTA/workflow.factset#16

Closed

cjyetman added this pull request to the merge queue Jan 24, 2024

Merged via the queue into main with commit b58da87 Jan 24, 2024

cjyetman deleted the remove-fact-set-database-code branch January 24, 2024 14:33

cjyetman mentioned this pull request Jan 29, 2024

use FactSet filenames as the source of truth for the FactSet timestamp #82

Closed

cjyetman mentioned this pull request Feb 26, 2024

include ent_entity_affiliates_last_update in manifest #170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove FactSet database code #73

remove FactSet database code #73

cjyetman commented Jan 15, 2024 •

edited

Loading

jdhoffa commented Jan 16, 2024 •

edited

Loading

cjyetman commented Jan 16, 2024 •

edited

Loading

jdhoffa commented Jan 16, 2024

jdhoffa left a comment

jdhoffa commented Jan 16, 2024 •

edited

Loading

AlexAxthelm commented Jan 16, 2024

cjyetman commented Jan 19, 2024

jdhoffa commented Jan 19, 2024

cjyetman commented Jan 19, 2024

jdhoffa commented Jan 19, 2024

jdhoffa left a comment

cjyetman commented Jan 22, 2024

jdhoffa commented Jan 23, 2024

cjyetman commented Jan 23, 2024

jdhoffa commented Jan 23, 2024

cjyetman commented Jan 23, 2024

jdhoffa commented Jan 23, 2024 •

edited

Loading

jdhoffa commented Jan 23, 2024

cjyetman commented Jan 23, 2024

jdhoffa commented Jan 23, 2024 •

edited

Loading

AlexAxthelm commented Jan 23, 2024

cjyetman commented Jan 23, 2024

AlexAxthelm left a comment

AlexAxthelm Jan 23, 2024

cjyetman Jan 23, 2024 •

edited

Loading

cjyetman Jan 23, 2024

AlexAxthelm Jan 23, 2024

cjyetman commented Jan 24, 2024

AlexAxthelm commented Jan 24, 2024

jdhoffa commented Jan 24, 2024

remove FactSet database code #73

remove FactSet database code #73

Conversation

cjyetman commented Jan 15, 2024 • edited Loading

jdhoffa commented Jan 16, 2024 • edited Loading

cjyetman commented Jan 16, 2024 • edited Loading

jdhoffa commented Jan 16, 2024

jdhoffa left a comment

Choose a reason for hiding this comment

jdhoffa commented Jan 16, 2024 • edited Loading

AlexAxthelm commented Jan 16, 2024

cjyetman commented Jan 19, 2024

jdhoffa commented Jan 19, 2024

cjyetman commented Jan 19, 2024

jdhoffa commented Jan 19, 2024

jdhoffa left a comment

Choose a reason for hiding this comment

cjyetman commented Jan 22, 2024

jdhoffa commented Jan 23, 2024

cjyetman commented Jan 23, 2024

jdhoffa commented Jan 23, 2024

cjyetman commented Jan 23, 2024

jdhoffa commented Jan 23, 2024 • edited Loading

jdhoffa commented Jan 23, 2024

cjyetman commented Jan 23, 2024

jdhoffa commented Jan 23, 2024 • edited Loading

AlexAxthelm commented Jan 23, 2024

cjyetman commented Jan 23, 2024

AlexAxthelm left a comment

Choose a reason for hiding this comment

AlexAxthelm Jan 23, 2024

Choose a reason for hiding this comment

cjyetman Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

cjyetman Jan 23, 2024

Choose a reason for hiding this comment

AlexAxthelm Jan 23, 2024

Choose a reason for hiding this comment

cjyetman commented Jan 24, 2024

AlexAxthelm commented Jan 24, 2024

jdhoffa commented Jan 24, 2024

cjyetman commented Jan 15, 2024 •

edited

Loading

jdhoffa commented Jan 16, 2024 •

edited

Loading

cjyetman commented Jan 16, 2024 •

edited

Loading

jdhoffa commented Jan 16, 2024 •

edited

Loading

jdhoffa commented Jan 23, 2024 •

edited

Loading

jdhoffa commented Jan 23, 2024 •

edited

Loading

cjyetman Jan 23, 2024 •

edited

Loading