Figure out the JSON file structure to export the data and make it "cloud optimized" #18

shankari · 2021-02-19T20:05:18Z

We want to host this in OEDI, which will use AWS resources.
So they basically need a easy to access machine-readable format.

We discussed two formats:

JSON
CSV

e-mission can already export data from mongodb in JSON format

However, our scripts currently don't load one day's worth of data at a time. Instead, we download the spec, and then use the spec to pull chunks of data across multiple days (that represent multiple repetitions of the timeline) into one notebook for comparison.

We need to:

design a JSON file format that is compatible with a spec-based access method,
generate files in that format
change the scripts to load from the new file formats instead of the old API methods.

shankari · 2021-02-19T20:08:05Z

For CSV, we would need to split out the entries for a particular day into multiple csvs, one for each type of data.
We would also need to flatten the object structure.
Examples of flattening the object structures are here:
https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L269
which calls
https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L149

This is definitely lower priority than getting the JSON working end to end, but I wanted to document it for the record.

shankari · 2021-03-09T16:32:41Z

one potential solution for the JSON is: concatenate the input parameters for every retrieval call to represent the filename that has the data.

So if we retrieve data for:

manual/evaluation_transition
1111111
222222

then we create a JSON file called data_manual-evaluation_transition_111111_22222

we go through and do this for all API calls that we currently execute
we can then replace the name of retrieve_data_from_server to read_data_from_file and replace the implementation from a REST API call to a file read.

shankari · 2021-03-09T16:34:01Z

we can then move the retrieve_data_from_server code to a script that pulls data from a server and creates files

So two main changes:

script that pulls data from the server (since the data will be collected on a server and is in a server now) and saves it to files
change all the notebooks to use a new method that loads data from the files

shankari · 2021-03-09T16:35:29Z

the script can be in bin - I think there may be a similar script that pulls data and stores it to a local mongodb. we would want to store to a file instead.

singhish · 2021-03-10T23:56:49Z

Design Document - File-based Data Retrieval

(`bin/dump_data_to_file.py`)

Input

Required flags:

--user: the user string with which to retrieve data. This could be either [email protected] or a device name, depending on the endpoint.
--endpoint: the endpoint to retrieve data from, e.g. manual/evaluation_transitions.

Optional flags:

--datastore-url: the location where E-Mission server is running. This defaults to a local instance running at http://localhost:8080, though this could also be a public URL.
--start-ts: the starting timestamp from where data is queried. The default value here is 0.
--end-ts: the ending timestamp to where data is queried. The default value is the result of arrow.get().timestamp.

Output

The output of the server call as specified by the input. This will be an output file in the format data_{endpoint}_{start timestamp}_{end timestamp}.json that contains the JSON data returned from the server call.

E.g. if your input flags are --user [email protected] and --endpoint config/evaluation_spec, the output file will be named data_config~evaluation-spec_0_16593840.json.

Valid Endpoints

manual/evaluation_transition
background/battery
background/motion_activity
background/location
background/filtered_location
statemachine/transition

Design Choices

It is worth noting that the original data retrieval mechanism in spec_details.py makes use of an object to maintain information about the spec ID and the datastore URL. The spec ID is not needed to retrieve data from the server, so it is omitted from this script.

shankari · 2021-03-11T00:13:26Z

I would be a little careful about terminology. manual/evaluation_transitions is not really an endpoint in the classic REST API sense (we don't call GET /foo/bar/endpoint), it is a timeseries key that, along with the start and end timestamps, is a parameter to the POST command.

Also, as we discussed today, please have a --all argument that retrieves all the data associated with a spec in the way the current code does.

singhish · 2021-03-11T21:56:49Z

Spent some time today thinking through this -- is this updated design doc satisfactory? I realized having an --all flag complicates things; we can just control the behavior of the script depending on the parameters that are passed in:

Design Document - File-based Data Retrieval

(`bin/dump_data_to_file.py`)

Parameters

dump_data_to_file.py can accept optional inputs as command line parameters to control the data being dumped to JSON files. They are classified into independent parameters and co-dependent parameters, the latter of which must all be specified if one is specified.

Independent parameters:

--datastore-url: the location where E-Mission Server is running. This defaults to a local instance running at http://localhost:8080, though this could also be a public URL.
--spec-id: the spec ID for which to pull data for. If not specified, data for every spec ID associated with the E-Mission Server instance being used is retrieved.

Dependent parameters:

--key: the specific time series key to pull data from via E-Mission Server's REST API interface. If not specified, the script will pull data from every single time series key, as done during the initialization of a PhoneView object.
--user: the user string with which to pull data with. This parameter is dependent on the time series key, and consequently must be specified if --key is specified.
--start-ts: the starting timestamp from which data is queried.
--end-ts: the ending timestamp from which data is queried.

Output

dump_data_to_file.py outputs JSON files with the following naming format:

data_{spec id}_{time series key}_{start timestamp}_{end timestamp}.json

The number of JSON files in this format is dependent on the parameters that are passed in. Specifically, a JSON file is outputted for each combination of spec ID and time series key in conjunction with the timestamp ranges that are explored at each step of the process. The algorithms to accomplish this are the same those used in the SpecDetails and PhoneView classes.

shankari · 2021-03-11T23:44:40Z

I think this is enough to be going on with. have you already started implementing in parallel?

singhish · 2021-03-15T16:50:58Z

Notes regarding the decision to parse command line arguments using the following pattern:

if any(arg in sys.argv for arg in ["--key", "--user", "--start-ts", "--end-ts"]):
        parser.add_argument("--key", type=str, required=True)
        parser.add_argument("--user", type=str, required=True)
        parser.add_argument("--start-ts", type=float, required=True)
        parser.add_argument("--end-ts", type=float, required=True)

argparse does provide support for subparsers, as discussed last week. However, the issue with subparsers is that an extra flag must typically specified for them to work (as seen here: https://docs.python.org/3/library/argparse.html). They are moreso suited as switches to separate different pieces of functionality within a program, rather than as a means of specifying a subset of intended functionality as is being done here. This is why I am choosing to implement the additional parameters as done above.

This approach also eliminates the need for an --all command, as not specifying either of the above four keys would assume "--all" functionality.

singhish · 2021-03-15T23:29:17Z

A few updates regarding the current implementation of the script:

Added an object called Spec to handle the information returned by the initial config/evaluation_spec REST API call. This will probably evolve to some semblance of the original SpecDetails object as I wrap up the phone_view-related calls tomorrow. The important thing to note, however, is that this Spec object decouples API calls from the flow of data, as SpecDetails by design handled both.
It is worth noting that E-Mission stores every version of a spec that is uploaded to it. Thus, to handle specs with the same spec_id, I am selecting the most recently uploaded spec based on the value of ["metadata"]["write_ts"].

shankari · 2021-03-16T15:32:47Z

It is good to refactor the existing spec details into two very similar classes:

retrieves data from server and stores to files
retrieves data from files into memory for notebooks to read from

both of these will have very similar functions, including:

retrieving specs
retrieving transitions
converting transitions to ranges
etc

However, I expect that the first script can stop at the evaluation_range level. The second script will go down to the eval_section_range level.

singhish · 2021-03-16T22:56:58Z

Refactored the script significantly today to accommodate unit testing. Much of this has involved reorganizing the script into functions that represent individual chunks of the pipeline parallel to the function calls occurring in PhoneView. Will keep at adding functionality to it tomorrow.

shankari · 2021-03-17T15:25:47Z

My suggestion is to heavily re-use the existing phoneview class. Concretely, phoneview already fills in data from the spec details class. The spec details class has retrieval code and a bunch of other code to find and match ground truth.

So we can create an abstract class for SpecDetails which has one function retrieve_data

we can then have two subclasses for ServerSpecDetails and FileSpecDetails
in the download script, phoneview can be passed in a ServerSpecDetails
in the notebooks, phoneview can be passed in FileSpecDetails

In the script, you can then use PhoneView to retrieve all the data and then just dump the entries into files
Small aside: there is a rudimentary batching solution in PhoneView (_read_until_done) to workaround psf/requests#4937 which should get moved into ServerSpecDetails as part of the refactor

What am I missing here?

shankari · 2021-03-17T15:35:16Z

there are two ways to dump the files in the script:

you can have ServerSpecDetails dump the files at the end of every call. This is not terrible since we do not anticipate anything other than the script using ServerSpecDetails.
an alternative is to dump the data from the tree once it has been created in memory
- I just double checked and we do store the raw entries in the tree
  - search for retrieve_data_from_server in phone view
  - they are all in fillXXX and
  - the entries are stored as XXX_entries

e.g. in fill_battery_df, the entries are stored as battery_entries

    def fill_battery_df(self, storage_key):
        for phoneOS, phone_map in self.phone_view_map.items():
            print("Processing data for %s phones" % phoneOS)
            for phone_label in phone_map:
                curr_calibration_ranges = phone_map[phone_label]["{}_ranges".format(storage_key)]
                for r in curr_calibration_ranges:
                    battery_entries = self.spec_details.retrieve_data_from_server(phone_label, ["background/battery"], r["start_ts"], r["end_ts"])
                    ...
                    r["battery_entries"] = battery_entries
                    ...

you should be able to walk the tree and just dump out the XXX_entries lists
This has the minor advantage that we leave open the option of using ServerSpecDetails in the notebook if we ever go back to making the data available through a server.

shankari · 2021-03-18T15:30:54Z

Concretely:

you would walk the tree in the way that we do in the Data Exploration Template.
As we can see from the code above, each evaluation_range or calibration_range will have XXX_entries - e.g. (battery_entries, location_entries... etc) as fields in them.
you already have the phone_os, phone_label, evaluation_range details (including id, role, start_ts and end_ts) available
store the entries into a JSON file

Is there a problem with this?

    "for phone_os, phone_map in pv.map().items():\n",
    "    print(15 * \"=*\")\n",
    "    print(phone_os, phone_map.keys())\n",
    "    for phone_label, phone_detail_map in phone_map.items():\n",
    "        print(4 * ' ', 15 * \"-*\")\n",
    "        print(4 * ' ', phone_label, phone_detail_map.keys())\n",
    "        # this spec does not have any calibration ranges, but evaluation ranges are actually cooler\n",
    "        for r in phone_detail_map[\"evaluation_ranges\"]:\n",
    "            print(8 * ' ', 30 * \"=\")\n",
    "            print(8 * ' ',r.keys())\n",
    "            print(8 * ' ',r[\"trip_id\"], r[\"eval_common_trip_id\"], r[\"eval_role\"], len(r[\"evaluation_trip_ranges\"]), arrow.get(r[\"start_ts\"]))\n",

Basically, add a save_json_file line just below the print statement.

shankari · 2021-03-18T15:42:32Z

@singhish I also don't see "Small aside: there is a rudimentary batching solution" (from #18 (comment)) addressed in the current PR

shankari · 2021-03-19T15:25:33Z

at least for transitions_to_ranges, we want to test:

complete ranges
incomplete ranges
invalid ranges (e.g. START, START, END, END) should generate assert
fix in line 138 works (e.g. START, START, END, END) because of identical write_ts succeeds

shankari · 2021-03-22T05:18:30Z

I volunteered to set up the scaffolding for the test harness that @singhish can use for his code.
The e-mission server code uses unittest and it has worked pretty well so far.

In the e-mission-server repo, my structure is:

emission
 - tests
 - .....

I don't remember why the tests are not in a separate top-level directory. I double checked the repo, and this has been true since the initial push
e-mission/e-mission-server@118e5f4

Let's keep the tests in a separate top-level directory this time, and use it as a template to go back and fix the old code.

shankari · 2021-03-22T05:30:20Z

Doesn't look too hard - just need to set the PYTHONPATH correctly
https://stackoverflow.com/questions/1896918/running-unittest-with-typical-test-directory-structure

Couple of design thoughts:

python -m unittest will just work in python3, which seems pretty straightforward
there's a plug for pytest. Should we try it out or stick to unittest?
- pytest answer is from 2017, unittest answer is from 2020; not sure there's a strong reason to switch
- Comparison https://www.slant.co/versus/9148/9149/~unittest_vs_pytest, again don't see a strong reason to switch

Going to stick with unittest for now

singhish · 2021-03-22T22:52:52Z

As mentioned over Teams -- note to work through design decision regarding how to name the key folders that are dumped while traversing a PhoneView map

shankari · 2021-03-22T23:00:58Z

ok, can you repeat the challenges here for context? Not everybody will have access to the discussion over Teams.

singhish · 2021-03-22T23:35:34Z

Yes -- so as things currently stand, the key directory that gets dumped during the PhoneView map traversal section of the script is a key in the form XXX_entries. The issue here is that these keys get passed into FileSpecDetails's retrieve_data function, which is called by PhoneView to gather the requisite data that was originally done with the old retrieve_data_from_server function in the original SpecDetails class. However, the key_lists that were originally being passed in corresponded to time series keys that are passed in as an argument during POST calls to emission server. Thus a design decision needs to be made regarding the naming of the key folders that are dumped.

shankari · 2021-03-23T04:04:45Z

the solution is fairly simple. You just need to map the phoneview keys to the REST API keys.

The obvious way to do this is to the change the keys in the phone view so that the mapping can be determined programmatically - e.g. instead of location_entries, have the key be background_location_entries

Then you can have the file name during the load/save be something like "/".join(pv_key.split("_")[:-1])

>>> "/".join("background_location_entries".split("_")[:-1])
'background/location'

An alternative is to have a map between the PhoneView entry names and the REST API keys - e.g. location_entries -> background/location ,and have the PhoneView (through a function) return the correct key depending on the type of the SpecDetails. This is suboptimal since we would need to maintain that map.

A third alternative is indeed to dump the data in the retrieve_data_from_server call. This is also suboptimal because it is not modular - we retrieve and dump as part of the same call.

shankari · 2021-03-23T21:14:20Z

The chosen solution was the first one, with the caveat that the keys would be listed as constants at the top of the file. @singhish can you confirm?

Note that you can also add the dump code into retrieve_data_from_server as a stopgap measure to ensure that the phoneview based dump actually does dump all the entries properly.

Please remove the stopgap before checking in.

singhish · 2021-03-23T22:00:34Z

@shankari yes. constants still need to be added but the first solution has been implemented. the stopgap has now been removed.

singhish · 2021-03-24T22:02:53Z

FileSpecDetails and PhoneView mostly work now, but running into problems with the read_until_done function -- after double checking that both evaluation_ranges and calibration_ranges were being dumped by the script, data for start_ts=1564026015 and end_ts=1564026017 for key=background/location and user= ucb-sdb-android-1 is showing up as nonexistent for train_bus_ebike_mtv_ucb. Will address this tomorrow in our meeting @shankari

shankari · 2021-03-25T14:26:37Z

This is because the current retrieve_data_from_server retrieves one batch of data. The server limits responses to 10,000 entries. So at least for the accuracy controls, we may need to pull multiple batches from the server.

This is not a consideration for the file, since we dump all the entries (after batching) to one file and the file read does not have/require a batching mechanism. Note that this is another advantage of using the principled approach for dumping data, because otherwise we would have ended up with multiple batches of files dumped, which would have been more confusing to the user.

To fix this, you need to call read_until_done only for the ServerSpecDetails as outlined in MobilityNet/mobilitynet-analysis-scripts#54 (comment)

Concretely, you would rename the current retrieve_data of ServerSpecDetails to be retrieve_one_batch.
read_until_done would then become retrieve_data and would wrap retrieve_one_batch
You would remove all read_until_done calls from the phoneview.

singhish mentioned this issue Mar 10, 2021

Issue 18 MobilityNet/mobilitynet-analysis-scripts#54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out the JSON file structure to export the data and make it "cloud optimized" #18

Figure out the JSON file structure to export the data and make it "cloud optimized" #18

shankari commented Feb 19, 2021

shankari commented Feb 19, 2021

shankari commented Mar 9, 2021

shankari commented Mar 9, 2021 •

edited

Loading

shankari commented Mar 9, 2021

singhish commented Mar 10, 2021 •

edited

Loading

shankari commented Mar 11, 2021

singhish commented Mar 11, 2021

shankari commented Mar 11, 2021

singhish commented Mar 15, 2021

singhish commented Mar 15, 2021

shankari commented Mar 16, 2021

singhish commented Mar 16, 2021

shankari commented Mar 17, 2021

shankari commented Mar 17, 2021 •

edited

Loading

shankari commented Mar 18, 2021 •

edited

Loading

shankari commented Mar 18, 2021 •

edited

Loading

shankari commented Mar 19, 2021 •

edited

Loading

shankari commented Mar 22, 2021 •

edited

Loading

shankari commented Mar 22, 2021

singhish commented Mar 22, 2021

shankari commented Mar 22, 2021

singhish commented Mar 22, 2021

shankari commented Mar 23, 2021 •

edited

Loading

shankari commented Mar 23, 2021

singhish commented Mar 23, 2021

singhish commented Mar 24, 2021

shankari commented Mar 25, 2021 •

edited

Loading

Figure out the JSON file structure to export the data and make it "cloud optimized" #18

Figure out the JSON file structure to export the data and make it "cloud optimized" #18

Comments

shankari commented Feb 19, 2021

shankari commented Feb 19, 2021

shankari commented Mar 9, 2021

shankari commented Mar 9, 2021 • edited Loading

shankari commented Mar 9, 2021

singhish commented Mar 10, 2021 • edited Loading

Design Document - File-based Data Retrieval

(bin/dump_data_to_file.py)

Input

Output

Valid Endpoints

Design Choices

shankari commented Mar 11, 2021

singhish commented Mar 11, 2021

Design Document - File-based Data Retrieval

(bin/dump_data_to_file.py)

Parameters

Output

shankari commented Mar 11, 2021

singhish commented Mar 15, 2021

singhish commented Mar 15, 2021

shankari commented Mar 16, 2021

singhish commented Mar 16, 2021

shankari commented Mar 17, 2021

shankari commented Mar 17, 2021 • edited Loading

shankari commented Mar 18, 2021 • edited Loading

shankari commented Mar 18, 2021 • edited Loading

shankari commented Mar 19, 2021 • edited Loading

shankari commented Mar 22, 2021 • edited Loading

shankari commented Mar 22, 2021

singhish commented Mar 22, 2021

shankari commented Mar 22, 2021

singhish commented Mar 22, 2021

shankari commented Mar 23, 2021 • edited Loading

shankari commented Mar 23, 2021

singhish commented Mar 23, 2021

singhish commented Mar 24, 2021

shankari commented Mar 25, 2021 • edited Loading

shankari commented Mar 9, 2021 •

edited

Loading

singhish commented Mar 10, 2021 •

edited

Loading

(`bin/dump_data_to_file.py`)

(`bin/dump_data_to_file.py`)

shankari commented Mar 17, 2021 •

edited

Loading

shankari commented Mar 18, 2021 •

edited

Loading

shankari commented Mar 18, 2021 •

edited

Loading

shankari commented Mar 19, 2021 •

edited

Loading

shankari commented Mar 22, 2021 •

edited

Loading

shankari commented Mar 23, 2021 •

edited

Loading

shankari commented Mar 25, 2021 •

edited

Loading