-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove FactSet database code #73
Conversation
I have so many questions lol. Hopefully they have simple answers: This seems absurdly dangerous to me. The 'get_*' functions have a lot of information encoded in them. Is this functionality replicated somewhere? In any case, this PR at least requires:
When/ why did we stop using the FactSet database? Otherwise LGTM /s 😭 |
As I understand it, the database is gone and it ain't coming back. The FactSet Not exactly sure, but I guess the process of getting FactSet data is now encoded in a new repo that @AlexAxthelm created.
Fair enough. I updated the README to remove the relevant FactSet settings, but I can add a note about the prep-prepared FactSet data being available in the inputs directory once I figure out exactly what that looks like.
not sure replication instructions are feasible to detail in the README, but yeah we can link to the new FactSet downloader repo |
Yeah fair enough I think I overreacted a bit, but mainly I want to be certain that we have a smooth process for pulling/ preparing these datasets and stitching them in as inputs, since that's what they have to be now. So yeah then I guess minimal changes required are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment thread
It seems that the repo that now pulls this is here: https://github.com/RMI-PACTA/workflow.factset |
|
@AlexAxthelm I don't have the bandwidth today (literally and figuratively) to fix it, but... I think these paths will need to be changed for this to work (assuming the pre-prepared FactSet data files would be placed in the same input folder as the AI files) workflow.data.preparation/run_pacta_data_preparation.R Lines 72 to 82 in 7c9feed
|
I think the file paths would be the same? The process just drops everything into the inputs folder, so it would stay as the inputs folder. However, I guess we could move the lines of code logically from "pre-flight" to "input" |
ah crud, you're right... carry on 😉 |
no problem. We tend to change things: fast, frequently, and profoundly still, i do think a minor improval would be to put the lines under the "input" heading, since there no longer really 'pre-flight' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 change requested in documentation, otherwise LGTM
I'm pretty concerned about changes like that. As the person that originally created this repo, I've implicitly been the go-to person for figuring out what happened when something goes wrong, and the more we deviate from the path that I laid out, the less I feel comfortable with that responsibility, so you're going to have to start taking over responsibility of that if we keep moving away from a process that I'm comfortable with towards one that you're comfortable with. |
Seems like we're so close here
Sorry to be nit-picky, but we are talking about code changes in a different PR now. It is totally valid to be critical of that change, but can we keep the conversation focused on this PR for now to avoid confusion? (for my sake haha) |
The relevance to this PR is that it is still uncertain how to specify which set of FactSet data one is using in a way that is traceable and repeatable. |
And there is a proposed way to handle this in an unmerged PR? If so, I think maybe we move the whole conversation to that PR (and flag it as a dependency for this one) |
There is a proposed way in a different PR that I don't really agree with, so seems a bit strange to let this PR be dependent on something else that may or may not ultimately exist as-is, no? |
Well, that's up to you I think. If you think that
is something that needs to be addressed before this PR is merged, and if there is no suitable way to do it in the current set-up, then this PR automatically has to depend on something that hasn't been done yet. If you don't want to address it before merging, then we can avoid that dependency, but we may be merging something half-baked |
But, in the case that we are discussing "whatever the PR is that tries to solve data provenance". I think discussions about that PR should occur on that PR? |
I think that this PR is not fully baked, i.e. ready to merge, if it does not have a way of tracking the provenance of the FactSet data, which would be a regression from main. |
I've added a comment in #75 outlining a change to that PR which might address this discussion, but basically the idea is:
That would let us keep the current behavior by setting the AI and FS paths both to |
Implemented So for the way I use this repo, the expectation is that one would download the appropriate FactSet archive from Azure and unzip the tar.gz and set the workflow.data.preparation/config.yml Lines 87 to 88 in 1f25f76
and workflow.data.preparation/run_pacta_data_preparation.R Lines 785 to 789 in 1f25f76
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One (non-critical) suggestion. overall LGTM.
data_prep_outputs_path: "/outputs" | ||
asset_impact_data_path: "/inputs" | ||
factset_data_path: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factset_data_path: "" | |
factset_data_path: "/inputs" |
if this is ""
, then the script would expect to find those files at the current working directory (which is unlikely). defaulting to /inputs
mirrors the current behavior of "throw everything into a single dir, and mount it to /inputs
."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the strategy with the config file is:
- list every possible parameter in the default config, but if it's a parameter that does not have a logical default value, e.g. one that necessarily is specific to a given quarter, then leave the default value null/blank/whatever. The intention there is that if one has not overrided the default value in their selected config, then it should fail, because there's no way this should work properly unless that has been explicitly set
- in each of the quarter specific configs, strive to minimize the amount of parameters set by inheriting as much as possible from the default config, but for parameters that are necessarily specific to that quarter/config, explicitly set them here in GH so that if one day someone comes back to question "which version of FactSet is supposed to be used to build 2022Q4 data?" it will be clearly recorded here in the config. (Though, if one manually edited that before running data.prep, the setting that they used would be captured in the manifest file... so if one was working backwards in the other direction, e.g. "what version of the FactSet data was used to build the PACTA data in this folder?", they could determine that by looking at the manifest in that folder)
With the last few commits I made here, the factset_data_path
is required for any given quarter. That's the only way I see at the moment to have it clearly known precisely which version of the FactSet data was/should be used. Otherwise they're just files from who knows where.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For further explanation.... I'd prefer that the filenames of the FactSet data files have a/the timestamp integrated into them, but in lieu of that the directory name from which they come is the only place where that is currently defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I've opened RMI-PACTA/workflow.factset#16, which can make good use of the changes in this PR.
@AlexAxthelm @jdhoffa so I think this is ready to go now, with the caveat that once RMI-PACTA/workflow.factset#16 lands I'd prefer to adapt to start using the FactSet data filenames as the source of truth for the FactSet timestamp |
I believe the documentation has been updated/reorganized as per Jackson's suggestion
Approved on my side. |
Approved! sorry missed this somehow |
This was removed in #73 when the FactSet database pulling info was removed, but `ent_entity_affiliates_last_update` is still created above and its value/info is still worthwhile to have in the manifest.
This was removed in #73 when the FactSet database pulling info was removed, but `ent_entity_affiliates_last_update` is still created above and its value/info is still worthwhile to have in the manifest.
closes #72
FactSet database is gone, so this simplifies things a bit.