-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out the JSON file structure to export the data and make it "cloud optimized" #18
Comments
For CSV, we would need to split out the entries for a particular day into multiple csvs, one for each type of data. This is definitely lower priority than getting the JSON working end to end, but I wanted to document it for the record. |
one potential solution for the JSON is: concatenate the input parameters for every retrieval call to represent the filename that has the data. So if we retrieve data for:
then we create a JSON file called we go through and do this for all API calls that we currently execute |
we can then move the So two main changes:
|
the script can be in |
Design Document - File-based Data Retrieval(
|
I would be a little careful about terminology. Also, as we discussed today, please have a |
Spent some time today thinking through this -- is this updated design doc satisfactory? I realized having an Design Document - File-based Data Retrieval(
|
I think this is enough to be going on with. have you already started implementing in parallel? |
Notes regarding the decision to parse command line arguments using the following pattern: if any(arg in sys.argv for arg in ["--key", "--user", "--start-ts", "--end-ts"]):
parser.add_argument("--key", type=str, required=True)
parser.add_argument("--user", type=str, required=True)
parser.add_argument("--start-ts", type=float, required=True)
parser.add_argument("--end-ts", type=float, required=True)
This approach also eliminates the need for an |
A few updates regarding the current implementation of the script:
|
It is good to refactor the existing spec details into two very similar classes:
both of these will have very similar functions, including:
However, I expect that the first script can stop at the |
Refactored the script significantly today to accommodate unit testing. Much of this has involved reorganizing the script into functions that represent individual chunks of the pipeline parallel to the function calls occurring in |
My suggestion is to heavily re-use the existing phoneview class. Concretely, phoneview already fills in data from the spec details class. The spec details class has retrieval code and a bunch of other code to find and match ground truth. So we can create an abstract class for SpecDetails which has one function
In the script, you can then use PhoneView to retrieve all the data and then just dump the entries into files What am I missing here? |
there are two ways to dump the files in the script:
e.g. in
you should be able to walk the tree and just dump out the |
Concretely:
Is there a problem with this?
Basically, add a |
@singhish I also don't see "Small aside: there is a rudimentary batching solution" (from #18 (comment)) addressed in the current PR |
at least for
|
I volunteered to set up the scaffolding for the test harness that @singhish can use for his code. In the e-mission-server repo, my structure is:
I don't remember why the tests are not in a separate top-level directory. I double checked the repo, and this has been true since the initial push Let's keep the tests in a separate top-level directory this time, and use it as a template to go back and fix the old code. |
Doesn't look too hard - just need to set the Couple of design thoughts:
Going to stick with unittest for now |
As mentioned over Teams -- note to work through design decision regarding how to name the |
ok, can you repeat the challenges here for context? Not everybody will have access to the discussion over Teams. |
Yes -- so as things currently stand, the |
the solution is fairly simple. You just need to map the phoneview keys to the REST API keys. The obvious way to do this is to the change the keys in the phone view so that the mapping can be determined programmatically - e.g. instead of Then you can have the file name during the load/save be something like
An alternative is to have a map between the PhoneView entry names and the REST API keys - e.g. A third alternative is indeed to dump the data in the |
The chosen solution was the first one, with the caveat that the keys would be listed as constants at the top of the file. @singhish can you confirm? Note that you can also add the dump code into Please remove the stopgap before checking in. |
@shankari yes. constants still need to be added but the first solution has been implemented. the stopgap has now been removed. |
|
This is because the current This is not a consideration for the file, since we dump all the entries (after batching) to one file and the file read does not have/require a batching mechanism. Note that this is another advantage of using the principled approach for dumping data, because otherwise we would have ended up with multiple batches of files dumped, which would have been more confusing to the user. To fix this, you need to call
|
We want to host this in OEDI, which will use AWS resources.
So they basically need a easy to access machine-readable format.
We discussed two formats:
e-mission can already export data from mongodb in JSON format
However, our scripts currently don't load one day's worth of data at a time. Instead, we download the spec, and then use the spec to pull chunks of data across multiple days (that represent multiple repetitions of the timeline) into one notebook for comparison.
We need to:
The text was updated successfully, but these errors were encountered: