Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out the JSON file structure to export the data and make it "cloud optimized" #18

Open
shankari opened this issue Feb 19, 2021 · 27 comments

Comments

@shankari
Copy link
Contributor

We want to host this in OEDI, which will use AWS resources.
So they basically need a easy to access machine-readable format.

We discussed two formats:

  • JSON
  • CSV

e-mission can already export data from mongodb in JSON format

However, our scripts currently don't load one day's worth of data at a time. Instead, we download the spec, and then use the spec to pull chunks of data across multiple days (that represent multiple repetitions of the timeline) into one notebook for comparison.

We need to:

  • design a JSON file format that is compatible with a spec-based access method,
  • generate files in that format
  • change the scripts to load from the new file formats instead of the old API methods.
@shankari
Copy link
Contributor Author

For CSV, we would need to split out the entries for a particular day into multiple csvs, one for each type of data.
We would also need to flatten the object structure.
Examples of flattening the object structures are here:
https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L269
which calls
https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L149

This is definitely lower priority than getting the JSON working end to end, but I wanted to document it for the record.

@shankari
Copy link
Contributor Author

shankari commented Mar 9, 2021

one potential solution for the JSON is: concatenate the input parameters for every retrieval call to represent the filename that has the data.

So if we retrieve data for:

  • manual/evaluation_transition
  • 1111111
  • 222222

then we create a JSON file called data_manual-evaluation_transition_111111_22222

we go through and do this for all API calls that we currently execute
we can then replace the name of retrieve_data_from_server to read_data_from_file and replace the implementation from a REST API call to a file read.

@shankari
Copy link
Contributor Author

shankari commented Mar 9, 2021

we can then move the retrieve_data_from_server code to a script that pulls data from a server and creates files

So two main changes:

  • script that pulls data from the server (since the data will be collected on a server and is in a server now) and saves it to files
  • change all the notebooks to use a new method that loads data from the files

@shankari
Copy link
Contributor Author

shankari commented Mar 9, 2021

the script can be in bin - I think there may be a similar script that pulls data and stores it to a local mongodb. we would want to store to a file instead.

@singhish
Copy link

singhish commented Mar 10, 2021

Design Document - File-based Data Retrieval

(bin/dump_data_to_file.py)

Input

Required flags:

  • --user: the user string with which to retrieve data. This could be either [email protected] or a device name, depending on the endpoint.
  • --endpoint: the endpoint to retrieve data from, e.g. manual/evaluation_transitions.

Optional flags:

  • --datastore-url: the location where E-Mission server is running. This defaults to a local instance running at http://localhost:8080, though this could also be a public URL.
  • --start-ts: the starting timestamp from where data is queried. The default value here is 0.
  • --end-ts: the ending timestamp to where data is queried. The default value is the result of arrow.get().timestamp.

Output

The output of the server call as specified by the input. This will be an output file in the format data_{endpoint}_{start timestamp}_{end timestamp}.json that contains the JSON data returned from the server call.

E.g. if your input flags are --user [email protected] and --endpoint config/evaluation_spec, the output file will be named data_config~evaluation-spec_0_16593840.json.

Valid Endpoints

  • manual/evaluation_transition
  • background/battery
  • background/motion_activity
  • background/location
  • background/filtered_location
  • statemachine/transition

Design Choices

It is worth noting that the original data retrieval mechanism in spec_details.py makes use of an object to maintain information about the spec ID and the datastore URL. The spec ID is not needed to retrieve data from the server, so it is omitted from this script.

@shankari
Copy link
Contributor Author

I would be a little careful about terminology. manual/evaluation_transitions is not really an endpoint in the classic REST API sense (we don't call GET /foo/bar/endpoint), it is a timeseries key that, along with the start and end timestamps, is a parameter to the POST command.

Also, as we discussed today, please have a --all argument that retrieves all the data associated with a spec in the way the current code does.

@singhish
Copy link

Spent some time today thinking through this -- is this updated design doc satisfactory? I realized having an --all flag complicates things; we can just control the behavior of the script depending on the parameters that are passed in:


Design Document - File-based Data Retrieval

(bin/dump_data_to_file.py)

Parameters

dump_data_to_file.py can accept optional inputs as command line parameters to control the data being dumped to JSON files. They are classified into independent parameters and co-dependent parameters, the latter of which must all be specified if one is specified.

Independent parameters:

  • --datastore-url: the location where E-Mission Server is running. This defaults to a local instance running at http://localhost:8080, though this could also be a public URL.
  • --spec-id: the spec ID for which to pull data for. If not specified, data for every spec ID associated with the E-Mission Server instance being used is retrieved.

Dependent parameters:

  • --key: the specific time series key to pull data from via E-Mission Server's REST API interface. If not specified, the script will pull data from every single time series key, as done during the initialization of a PhoneView object.
  • --user: the user string with which to pull data with. This parameter is dependent on the time series key, and consequently must be specified if --key is specified.
  • --start-ts: the starting timestamp from which data is queried.
  • --end-ts: the ending timestamp from which data is queried.

Output

dump_data_to_file.py outputs JSON files with the following naming format:

data_{spec id}_{time series key}_{start timestamp}_{end timestamp}.json

The number of JSON files in this format is dependent on the parameters that are passed in. Specifically, a JSON file is outputted for each combination of spec ID and time series key in conjunction with the timestamp ranges that are explored at each step of the process. The algorithms to accomplish this are the same those used in the SpecDetails and PhoneView classes.

@shankari
Copy link
Contributor Author

I think this is enough to be going on with. have you already started implementing in parallel?

@singhish
Copy link

Notes regarding the decision to parse command line arguments using the following pattern:

if any(arg in sys.argv for arg in ["--key", "--user", "--start-ts", "--end-ts"]):
        parser.add_argument("--key", type=str, required=True)
        parser.add_argument("--user", type=str, required=True)
        parser.add_argument("--start-ts", type=float, required=True)
        parser.add_argument("--end-ts", type=float, required=True)

argparse does provide support for subparsers, as discussed last week. However, the issue with subparsers is that an extra flag must typically specified for them to work (as seen here: https://docs.python.org/3/library/argparse.html). They are moreso suited as switches to separate different pieces of functionality within a program, rather than as a means of specifying a subset of intended functionality as is being done here. This is why I am choosing to implement the additional parameters as done above.

This approach also eliminates the need for an --all command, as not specifying either of the above four keys would assume "--all" functionality.

@singhish
Copy link

A few updates regarding the current implementation of the script:

  1. Added an object called Spec to handle the information returned by the initial config/evaluation_spec REST API call. This will probably evolve to some semblance of the original SpecDetails object as I wrap up the phone_view-related calls tomorrow. The important thing to note, however, is that this Spec object decouples API calls from the flow of data, as SpecDetails by design handled both.

  2. It is worth noting that E-Mission stores every version of a spec that is uploaded to it. Thus, to handle specs with the same spec_id, I am selecting the most recently uploaded spec based on the value of ["metadata"]["write_ts"].

@shankari
Copy link
Contributor Author

It is good to refactor the existing spec details into two very similar classes:

  1. retrieves data from server and stores to files
  2. retrieves data from files into memory for notebooks to read from

both of these will have very similar functions, including:

  • retrieving specs
  • retrieving transitions
  • converting transitions to ranges
    etc

However, I expect that the first script can stop at the evaluation_range level. The second script will go down to the eval_section_range level.

@singhish
Copy link

Refactored the script significantly today to accommodate unit testing. Much of this has involved reorganizing the script into functions that represent individual chunks of the pipeline parallel to the function calls occurring in PhoneView. Will keep at adding functionality to it tomorrow.

@shankari
Copy link
Contributor Author

My suggestion is to heavily re-use the existing phoneview class. Concretely, phoneview already fills in data from the spec details class. The spec details class has retrieval code and a bunch of other code to find and match ground truth.

So we can create an abstract class for SpecDetails which has one function retrieve_data

  • we can then have two subclasses for ServerSpecDetails and FileSpecDetails
  • in the download script, phoneview can be passed in a ServerSpecDetails
  • in the notebooks, phoneview can be passed in FileSpecDetails

In the script, you can then use PhoneView to retrieve all the data and then just dump the entries into files
Small aside: there is a rudimentary batching solution in PhoneView (_read_until_done) to workaround psf/requests#4937 which should get moved into ServerSpecDetails as part of the refactor

What am I missing here?

@shankari
Copy link
Contributor Author

shankari commented Mar 17, 2021

there are two ways to dump the files in the script:

  • you can have ServerSpecDetails dump the files at the end of every call. This is not terrible since we do not anticipate anything other than the script using ServerSpecDetails.
  • an alternative is to dump the data from the tree once it has been created in memory
    • I just double checked and we do store the raw entries in the tree
      • search for retrieve_data_from_server in phone view
      • they are all in fillXXX and
      • the entries are stored as XXX_entries

e.g. in fill_battery_df, the entries are stored as battery_entries

    def fill_battery_df(self, storage_key):
        for phoneOS, phone_map in self.phone_view_map.items():
            print("Processing data for %s phones" % phoneOS)
            for phone_label in phone_map:
                curr_calibration_ranges = phone_map[phone_label]["{}_ranges".format(storage_key)]
                for r in curr_calibration_ranges:
                    battery_entries = self.spec_details.retrieve_data_from_server(phone_label, ["background/battery"], r["start_ts"], r["end_ts"])
                    ...
                    r["battery_entries"] = battery_entries
                    ...

you should be able to walk the tree and just dump out the XXX_entries lists
This has the minor advantage that we leave open the option of using ServerSpecDetails in the notebook if we ever go back to making the data available through a server.

@shankari
Copy link
Contributor Author

shankari commented Mar 18, 2021

Concretely:

  • you would walk the tree in the way that we do in the Data Exploration Template.
  • As we can see from the code above, each evaluation_range or calibration_range will have XXX_entries - e.g. (battery_entries, location_entries... etc) as fields in them.
  • you already have the phone_os, phone_label, evaluation_range details (including id, role, start_ts and end_ts) available
  • store the entries into a JSON file

Is there a problem with this?

    "for phone_os, phone_map in pv.map().items():\n",
    "    print(15 * \"=*\")\n",
    "    print(phone_os, phone_map.keys())\n",
    "    for phone_label, phone_detail_map in phone_map.items():\n",
    "        print(4 * ' ', 15 * \"-*\")\n",
    "        print(4 * ' ', phone_label, phone_detail_map.keys())\n",
    "        # this spec does not have any calibration ranges, but evaluation ranges are actually cooler\n",
    "        for r in phone_detail_map[\"evaluation_ranges\"]:\n",
    "            print(8 * ' ', 30 * \"=\")\n",
    "            print(8 * ' ',r.keys())\n",
    "            print(8 * ' ',r[\"trip_id\"], r[\"eval_common_trip_id\"], r[\"eval_role\"], len(r[\"evaluation_trip_ranges\"]), arrow.get(r[\"start_ts\"]))\n",

Basically, add a save_json_file line just below the print statement.

@shankari
Copy link
Contributor Author

shankari commented Mar 18, 2021

@singhish I also don't see "Small aside: there is a rudimentary batching solution" (from #18 (comment)) addressed in the current PR

@shankari
Copy link
Contributor Author

shankari commented Mar 19, 2021

at least for transitions_to_ranges, we want to test:

  • complete ranges
  • incomplete ranges
  • invalid ranges (e.g. START, START, END, END) should generate assert
  • fix in line 138 works (e.g. START, START, END, END) because of identical write_ts succeeds

@shankari
Copy link
Contributor Author

shankari commented Mar 22, 2021

I volunteered to set up the scaffolding for the test harness that @singhish can use for his code.
The e-mission server code uses unittest and it has worked pretty well so far.

In the e-mission-server repo, my structure is:

emission
 - tests
 - .....

I don't remember why the tests are not in a separate top-level directory. I double checked the repo, and this has been true since the initial push
e-mission/e-mission-server@118e5f4

Let's keep the tests in a separate top-level directory this time, and use it as a template to go back and fix the old code.

@shankari
Copy link
Contributor Author

Doesn't look too hard - just need to set the PYTHONPATH correctly
https://stackoverflow.com/questions/1896918/running-unittest-with-typical-test-directory-structure

Couple of design thoughts:

  • python -m unittest will just work in python3, which seems pretty straightforward
  • there's a plug for pytest. Should we try it out or stick to unittest?

Going to stick with unittest for now

@singhish
Copy link

As mentioned over Teams -- note to work through design decision regarding how to name the key folders that are dumped while traversing a PhoneView map

@shankari
Copy link
Contributor Author

ok, can you repeat the challenges here for context? Not everybody will have access to the discussion over Teams.

@singhish
Copy link

Yes -- so as things currently stand, the key directory that gets dumped during the PhoneView map traversal section of the script is a key in the form XXX_entries. The issue here is that these keys get passed into FileSpecDetails's retrieve_data function, which is called by PhoneView to gather the requisite data that was originally done with the old retrieve_data_from_server function in the original SpecDetails class. However, the key_lists that were originally being passed in corresponded to time series keys that are passed in as an argument during POST calls to emission server. Thus a design decision needs to be made regarding the naming of the key folders that are dumped.

@shankari
Copy link
Contributor Author

shankari commented Mar 23, 2021

the solution is fairly simple. You just need to map the phoneview keys to the REST API keys.

The obvious way to do this is to the change the keys in the phone view so that the mapping can be determined programmatically - e.g. instead of location_entries, have the key be background_location_entries

Then you can have the file name during the load/save be something like "/".join(pv_key.split("_")[:-1])

>>> "/".join("background_location_entries".split("_")[:-1])
'background/location'

An alternative is to have a map between the PhoneView entry names and the REST API keys - e.g. location_entries -> background/location ,and have the PhoneView (through a function) return the correct key depending on the type of the SpecDetails. This is suboptimal since we would need to maintain that map.

A third alternative is indeed to dump the data in the retrieve_data_from_server call. This is also suboptimal because it is not modular - we retrieve and dump as part of the same call.

@shankari
Copy link
Contributor Author

The chosen solution was the first one, with the caveat that the keys would be listed as constants at the top of the file. @singhish can you confirm?

Note that you can also add the dump code into retrieve_data_from_server as a stopgap measure to ensure that the phoneview based dump actually does dump all the entries properly.

Please remove the stopgap before checking in.

@singhish
Copy link

@shankari yes. constants still need to be added but the first solution has been implemented. the stopgap has now been removed.

@singhish
Copy link

FileSpecDetails and PhoneView mostly work now, but running into problems with the read_until_done function -- after double checking that both evaluation_ranges and calibration_ranges were being dumped by the script, data for start_ts=1564026015 and end_ts=1564026017 for key=background/location and user= ucb-sdb-android-1 is showing up as nonexistent for train_bus_ebike_mtv_ucb. Will address this tomorrow in our meeting @shankari

@shankari
Copy link
Contributor Author

shankari commented Mar 25, 2021

This is because the current retrieve_data_from_server retrieves one batch of data. The server limits responses to 10,000 entries. So at least for the accuracy controls, we may need to pull multiple batches from the server.

This is not a consideration for the file, since we dump all the entries (after batching) to one file and the file read does not have/require a batching mechanism. Note that this is another advantage of using the principled approach for dumping data, because otherwise we would have ended up with multiple batches of files dumped, which would have been more confusing to the user.

To fix this, you need to call read_until_done only for the ServerSpecDetails as outlined in MobilityNet/mobilitynet-analysis-scripts#54 (comment)

  • Concretely, you would rename the current retrieve_data of ServerSpecDetails to be retrieve_one_batch.
  • read_until_done would then become retrieve_data and would wrap retrieve_one_batch
  • You would remove all read_until_done calls from the phoneview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants