Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove FactSet database code #73

Merged
merged 18 commits into from
Jan 24, 2024
Merged

remove FactSet database code #73

merged 18 commits into from
Jan 24, 2024

Conversation

cjyetman
Copy link
Member

@cjyetman cjyetman commented Jan 15, 2024

closes #72

FactSet database is gone, so this simplifies things a bit.

@cjyetman cjyetman marked this pull request as ready for review January 15, 2024 22:38
@jdhoffa
Copy link
Member

jdhoffa commented Jan 16, 2024

I have so many questions lol. Hopefully they have simple answers:

This seems absurdly dangerous to me. The 'get_*' functions have a lot of information encoded in them.

Is this functionality replicated somewhere?
How can we ensure consistency between whatever new process pulls these tables and what is pulled here/ has been pulled in the past?
Where is the new functionality stored and documented?

In any case, this PR at least requires:

  • an update to the README that it is expected that these datasets are required as input
  • specific documentation for exactly how to reproduce each dataset, or pointer to a repo containing the updated pulling functionality

When/ why did we stop using the FactSet database?

Otherwise LGTM /s 😭

@cjyetman
Copy link
Member Author

cjyetman commented Jan 16, 2024

As I understand it, the database is gone and it ain't coming back. The FactSet get_* functions live in pacta.data.preparation, so they are not removed in this PR.

Not exactly sure, but I guess the process of getting FactSet data is now encoded in a new repo that @AlexAxthelm created.

an update to the README that it is expected that these datasets are required as input

Fair enough. I updated the README to remove the relevant FactSet settings, but I can add a note about the prep-prepared FactSet data being available in the inputs directory once I figure out exactly what that looks like.

specific documentation for exactly how to reproduce each dataset, or pointer to a repo containing the updated pulling functionality

not sure replication instructions are feasible to detail in the README, but yeah we can link to the new FactSet downloader repo

@jdhoffa
Copy link
Member

jdhoffa commented Jan 16, 2024

Yeah fair enough I think I overreacted a bit, but mainly I want to be certain that we have a smooth process for pulling/ preparing these datasets and stitching them in as inputs, since that's what they have to be now.

So yeah then I guess minimal changes required are:

  • explain FS datasets that are necessary as input in README
  • link to whatever repo now pulls them, @AlexAxthelm ?

AlexAxthelm
AlexAxthelm previously approved these changes Jan 16, 2024
@cjyetman cjyetman marked this pull request as draft January 16, 2024 11:39
Copy link
Member

@jdhoffa jdhoffa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment thread

@jdhoffa
Copy link
Member

jdhoffa commented Jan 16, 2024

Yeah fair enough I think I overreacted a bit, but mainly I want to be certain that we have a smooth process for pulling/ preparing these datasets and stitching them in as inputs, since that's what they have to be now.

So yeah then I guess minimal changes required are:

  • explain FS datasets that are necessary as input in README
  • link to whatever repo now pulls them, @AlexAxthelm ?

It seems that the repo that now pulls this is here: https://github.com/RMI-PACTA/workflow.factset

@AlexAxthelm
Copy link
Collaborator

explain FS datasets that are necessary as input in README
@jdhoffa I agree that we should document the input format of the files that are exported by workflow.factset.

AlexAxthelm
AlexAxthelm previously approved these changes Jan 19, 2024
README.md Show resolved Hide resolved
@cjyetman
Copy link
Member Author

@AlexAxthelm I don't have the bandwidth today (literally and figuratively) to fix it, but...

I think these paths will need to be changed for this to work (assuming the pre-prepared FactSet data files would be placed in the same input folder as the AI files)

# pre-flight filepaths ---------------------------------------------------------
scenarios_analysis_input_path <- file.path(data_prep_inputs_path, "Scenarios_AnalysisInput.csv")
scenario_regions_path <- file.path(data_prep_inputs_path, "scenario_regions.csv")
currencies_data_path <- file.path(data_prep_inputs_path, "currencies.rds")
factset_financial_data_path <- file.path(data_prep_inputs_path, "factset_financial_data.rds")
factset_entity_info_path <- file.path(data_prep_inputs_path, "factset_entity_info.rds")
factset_entity_financing_data_path <- file.path(data_prep_inputs_path, "factset_entity_financing_data.rds")
factset_fund_data_path <- file.path(data_prep_inputs_path, "factset_fund_data.rds")
factset_isin_to_fund_table_path <- file.path(data_prep_inputs_path, "factset_isin_to_fund_table.rds")
factset_iss_emissions_data_path <- file.path(data_prep_inputs_path, "factset_iss_emissions.rds")

@jdhoffa
Copy link
Member

jdhoffa commented Jan 19, 2024

@AlexAxthelm I don't have the bandwidth today (literally and figuratively) to fix it, but...

I think these paths will need to be changed for this to work (assuming the pre-prepared FactSet data files would be placed in the same input folder as the AI files)

# pre-flight filepaths ---------------------------------------------------------
scenarios_analysis_input_path <- file.path(data_prep_inputs_path, "Scenarios_AnalysisInput.csv")
scenario_regions_path <- file.path(data_prep_inputs_path, "scenario_regions.csv")
currencies_data_path <- file.path(data_prep_inputs_path, "currencies.rds")
factset_financial_data_path <- file.path(data_prep_inputs_path, "factset_financial_data.rds")
factset_entity_info_path <- file.path(data_prep_inputs_path, "factset_entity_info.rds")
factset_entity_financing_data_path <- file.path(data_prep_inputs_path, "factset_entity_financing_data.rds")
factset_fund_data_path <- file.path(data_prep_inputs_path, "factset_fund_data.rds")
factset_isin_to_fund_table_path <- file.path(data_prep_inputs_path, "factset_isin_to_fund_table.rds")
factset_iss_emissions_data_path <- file.path(data_prep_inputs_path, "factset_iss_emissions.rds")

I think the file paths would be the same? The process just drops everything into the inputs folder, so it would stay as the inputs folder.

However, I guess we could move the lines of code logically from "pre-flight" to "input"

@cjyetman
Copy link
Member Author

ah crud, you're right... data_prep_inputs_path seems appropriate... I'm confusing myself in circles.... I was remembering some former state where I used an "intermediary" directory or something like that for the "pre-flight" files

carry on 😉

@jdhoffa
Copy link
Member

jdhoffa commented Jan 19, 2024

ah crud, you're right... data_prep_inputs_path seems appropriate... I'm confusing myself in circles.... I was remembering some former state where I used an "intermediary" directory or something like that for the "pre-flight" files

carry on 😉

no problem. We tend to change things: fast, frequently, and profoundly
so i'd say it's pretty fair that you forget the current set-up hahahaha

still, i do think a minor improval would be to put the lines under the "input" heading, since there no longer really 'pre-flight'

@AlexAxthelm AlexAxthelm marked this pull request as ready for review January 21, 2024 21:04
AlexAxthelm
AlexAxthelm previously approved these changes Jan 21, 2024
@AlexAxthelm AlexAxthelm marked this pull request as draft January 21, 2024 21:04
Copy link
Member

@jdhoffa jdhoffa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 change requested in documentation, otherwise LGTM

README.md Outdated Show resolved Hide resolved
@cjyetman
Copy link
Member Author

factset_dataset could probably work, though it looks like it's just in the config file but not used anywhere in the script

you're correct that it's not referenced in the run_data_preparation.R script, but it is in the copy_raw_data.R:

factset_files <- list.files(
path = file.path(
cfg[["factset-extracted_path"]],
cfg[["factset_dataset"]]
),
include.dirs = FALSE,
full.names = TRUE
)
logger::log_trace("factset_file: {factset_files}")
files_to_copy <- c(
masterdata_debt,
masterdata_ownership,
ar_fs_bridge,
factset_files
)

Not to get too much into an explanation of #75 , but since I was trying to not change the logic of the core script in that PR, I added a script to copy the raw datafiles into a data prep inputs directory (from their home locations on AFS), and then run run_data_preparation.R as-is (+ some more logging). So that config param determines which FS file get copied into data prep inputs.

I'm pretty concerned about changes like that. As the person that originally created this repo, I've implicitly been the go-to person for figuring out what happened when something goes wrong, and the more we deviate from the path that I laid out, the less I feel comfortable with that responsibility, so you're going to have to start taking over responsibility of that if we keep moving away from a process that I'm comfortable with towards one that you're comfortable with.

@jdhoffa
Copy link
Member

jdhoffa commented Jan 23, 2024

Seems like we're so close here

factset_dataset could probably work, though it looks like it's just in the config file but not used anywhere in the script

you're correct that it's not referenced in the run_data_preparation.R script, but it is in the copy_raw_data.R:

factset_files <- list.files(
path = file.path(
cfg[["factset-extracted_path"]],
cfg[["factset_dataset"]]
),
include.dirs = FALSE,
full.names = TRUE
)
logger::log_trace("factset_file: {factset_files}")
files_to_copy <- c(
masterdata_debt,
masterdata_ownership,
ar_fs_bridge,
factset_files
)

Not to get too much into an explanation of #75 , but since I was trying to not change the logic of the core script in that PR, I added a script to copy the raw datafiles into a data prep inputs directory (from their home locations on AFS), and then run run_data_preparation.R as-is (+ some more logging). So that config param determines which FS file get copied into data prep inputs.

I'm pretty concerned about changes like that. As the person that originally created this repo, I've implicitly been the go-to person for figuring out what happened when something goes wrong, and the more we deviate from the path that I laid out, the less I feel comfortable with that responsibility, so you're going to have to start taking over responsibility of that if we keep moving away from a process that I'm comfortable with towards one that you're comfortable with.

Sorry to be nit-picky, but we are talking about code changes in a different PR now. It is totally valid to be critical of that change, but can we keep the conversation focused on this PR for now to avoid confusion? (for my sake haha)

@cjyetman
Copy link
Member Author

Sorry to be nit-picky, but we are talking about code changes in a different PR now. It is totally valid to be critical of that change, but can we keep the conversation focused on this PR for now to avoid confusion? (for my sake haha)

The relevance to this PR is that it is still uncertain how to specify which set of FactSet data one is using in a way that is traceable and repeatable.

@jdhoffa
Copy link
Member

jdhoffa commented Jan 23, 2024

Sorry to be nit-picky, but we are talking about code changes in a different PR now. It is totally valid to be critical of that change, but can we keep the conversation focused on this PR for now to avoid confusion? (for my sake haha)

The relevance to this PR is that it is still uncertain how to specify which set of FactSet data one is using in a way that is traceable and repeatable.

And there is a proposed way to handle this in an unmerged PR? If so, I think maybe we move the whole conversation to that PR (and flag it as a dependency for this one)

@cjyetman
Copy link
Member Author

And there is a proposed way to handle this in an unmerged PR? If so, I think maybe we move the whole conversation to that PR (and flag it as a dependency for this one)

There is a proposed way in a different PR that I don't really agree with, so seems a bit strange to let this PR be dependent on something else that may or may not ultimately exist as-is, no?

@jdhoffa
Copy link
Member

jdhoffa commented Jan 23, 2024

Well, that's up to you I think.

If you think that

  1. The AI datasets have their filenames encoded in the parameters... etc.

is something that needs to be addressed before this PR is merged, and if there is no suitable way to do it in the current set-up, then this PR automatically has to depend on something that hasn't been done yet.

If you don't want to address it before merging, then we can avoid that dependency, but we may be merging something half-baked

@jdhoffa
Copy link
Member

jdhoffa commented Jan 23, 2024

But, in the case that we are discussing "whatever the PR is that tries to solve data provenance". I think discussions about that PR should occur on that PR?

@cjyetman
Copy link
Member Author

I think that this PR is not fully baked, i.e. ready to merge, if it does not have a way of tracking the provenance of the FactSet data, which would be a regression from main.

@jdhoffa
Copy link
Member

jdhoffa commented Jan 23, 2024

Yup agreed.

Then this PR depends on either #75 or #78 (or some as-of-yet unopened PR), and discussions about how that is solved should happen there.

@AlexAxthelm
Copy link
Collaborator

I've added a comment in #75 outlining a change to that PR which might address this discussion, but basically the idea is:

  • Change from specifying a single "inputs" directory with all of the files in it, to specifying the path to each type of data (AI, FS)
  • if we want to make a copy of the data inputs, use a copy_inputs config option.

That would let us keep the current behavior by setting the AI and FS paths both to /inputs, but point to the actual raw files on Azure Files when we're running on ACI (without copying to an intermediate directory in between).

@cjyetman
Copy link
Member Author

Implemented asset_impact_data_path and factset_data_path as @AlexAxthelm suggested here

So for the way I use this repo, the expectation is that one would download the appropriate FactSet archive from Azure and unzip the tar.gz and set the factset_data_path to the name of the unzipped directory, e.g. "factset-pacta_timestamp-20221231T000000Z_pulled-20231221T195325Z", from which the provenance of the FactSet files can be inferred and recorded in the manifest file. Hence:

2022Q4:
factset_data_path: "factset-pacta_timestamp-20221231T000000Z_pulled-20231221T195325Z"

and

timestamps = list(
imf_quarter_timestamp = imf_quarter_timestamp,
factset_data_identifier = basename(factset_data_path),
pacta_financial_timestamp = pacta_financial_timestamp
),

Copy link
Collaborator

@AlexAxthelm AlexAxthelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (non-critical) suggestion. overall LGTM.

data_prep_outputs_path: "/outputs"
asset_impact_data_path: "/inputs"
factset_data_path: ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
factset_data_path: ""
factset_data_path: "/inputs"

if this is "", then the script would expect to find those files at the current working directory (which is unlikely). defaulting to /inputs mirrors the current behavior of "throw everything into a single dir, and mount it to /inputs."

Copy link
Member Author

@cjyetman cjyetman Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the strategy with the config file is:

  • list every possible parameter in the default config, but if it's a parameter that does not have a logical default value, e.g. one that necessarily is specific to a given quarter, then leave the default value null/blank/whatever. The intention there is that if one has not overrided the default value in their selected config, then it should fail, because there's no way this should work properly unless that has been explicitly set
  • in each of the quarter specific configs, strive to minimize the amount of parameters set by inheriting as much as possible from the default config, but for parameters that are necessarily specific to that quarter/config, explicitly set them here in GH so that if one day someone comes back to question "which version of FactSet is supposed to be used to build 2022Q4 data?" it will be clearly recorded here in the config. (Though, if one manually edited that before running data.prep, the setting that they used would be captured in the manifest file... so if one was working backwards in the other direction, e.g. "what version of the FactSet data was used to build the PACTA data in this folder?", they could determine that by looking at the manifest in that folder)

With the last few commits I made here, the factset_data_path is required for any given quarter. That's the only way I see at the moment to have it clearly known precisely which version of the FactSet data was/should be used. Otherwise they're just files from who knows where.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For further explanation.... I'd prefer that the filenames of the FactSet data files have a/the timestamp integrated into them, but in lieu of that the directory name from which they come is the only place where that is currently defined.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I've opened RMI-PACTA/workflow.factset#16, which can make good use of the changes in this PR.

@cjyetman
Copy link
Member Author

@AlexAxthelm @jdhoffa so I think this is ready to go now, with the caveat that once RMI-PACTA/workflow.factset#16 lands I'd prefer to adapt to start using the FactSet data filenames as the source of truth for the FactSet timestamp

@cjyetman cjyetman dismissed jdhoffa’s stale review January 24, 2024 09:36

I believe the documentation has been updated/reorganized as per Jackson's suggestion

@AlexAxthelm
Copy link
Collaborator

Approved on my side.

@cjyetman cjyetman added this pull request to the merge queue Jan 24, 2024
Merged via the queue into main with commit b58da87 Jan 24, 2024
@cjyetman cjyetman deleted the remove-fact-set-database-code branch January 24, 2024 14:33
@jdhoffa
Copy link
Member

jdhoffa commented Jan 24, 2024

Approved! sorry missed this somehow

cjyetman added a commit that referenced this pull request Feb 26, 2024
This was removed in #73 when the FactSet database pulling info was removed, but `ent_entity_affiliates_last_update` is still created above and its value/info is still worthwhile to have in the manifest.
cjyetman added a commit that referenced this pull request Feb 26, 2024
This was removed in #73 when the FactSet database pulling info was
removed, but `ent_entity_affiliates_last_update` is still created above
and its value/info is still worthwhile to have in the manifest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove FactSet database code
3 participants