🗃️ Add an interface to support returning the cleaned section <-> inferred section mapping for a set of cleaned sections #970

shankari · 2023-09-07T22:27:48Z

This will allow us to have a generic interface for use by the dashboards while optimizing the implementation later.
This is currently needed for
e-mission/op-admin-dashboard@6cdf8e6#diff-1c6b8e6d103286796ce21a8276c4a4d8b258e29d6b9cc6df516a92accf4674d1R201

The desired interface would be something like: cleaned2inferred_section_list, similar to the current cleaned2inferred_section but with a list passed in. The initial implementation could be the simple loop at: e-mission/op-admin-dashboard@6cdf8e6#diff-1c6b8e6d103286796ce21a8276c4a4d8b258e29d6b9cc6df516a92accf4674d1R201-R206

A performance optimization would be the original implementation with
e-mission/op-admin-dashboard@6cdf8e6#diff-1c6b8e6d103286796ce21a8276c4a4d8b258e29d6b9cc6df516a92accf4674d1L199-L202

Although, given our data model, I would prefer an optimization in which we retrieved potentially matching inferred modes by time range or geo-range and then matched them up in memory. In general, with the timeseries data model, we want to avoid using the linkages (the foreign keys) between collections because they would not necessarily be searchable in a real timeseries database. They are more of a relational data model.

If we did go with the timeseries approach, we could also close
e-mission/e-mission-server#934

@TTalex

The text was updated successfully, but these errors were encountered:

shankari · 2023-09-07T22:50:41Z

One problem with using a range for this specific use case is that we get the list of section ids from the list of points that were within a polygon. This could span a wide date range (e.g. months) or a wide geographic range (the trajectories passed through a location but the start and end could be anywhere).

An intermediate tradeoff could be that we still use the time range, but split it up so that we don't have to read too many sections at a time. Note that this is similar to the $in approach, where we might also want to chunk by 100 values at a time.

# Note: for performance reasons, it is not recommended to use '$in' a list bigger than ~100 values
# In our use case, this could happen on popular trips, but the delay is deemed acceptable

we can then continue to use the timeseries-based data model but not have to make $O(n)$ queries to the server.
Other, more complex approaches are to:

backfill the inferred section ID into the recreated location (which breaks the forward flow of data through the pipeline by adding a backwards dependency)
add a convenience mapping between cleaned section id and inferred section id that we can read in completely and then work with as needed. This will ensure that we isolate the performance hit of the mapping to the use cases that actually need it.

shankari · 2023-09-07T23:18:04Z

@MukuFlash03 here's the next issue for you to work on

shankari · 2023-09-15T20:09:55Z

I see that the current implementation with the loop requires a user_id and a section_id to be passed in. For the batch method, you can take in a list of user_ids and section_ids or a list of {user_id, section_id} dictionaries. Essentially you can go from one of those representations to the other either by doing zip or a list comprehension that splits it out.

or take only section ids and just implement the performance optimization for now

I think we will have to tweak the interface a bit over time and polish it depending on new use case that come in
For now, I just want to have a reasonable implementation and a way to find all uses for when we polish later.

MukuFlash03 · 2023-09-19T17:42:23Z

Since the initial code implementation was ready, I thought of first adding the required functionality before optimizing and also worked on the tests.

I saw that the functionality involved the keys analysis/inferred_section and analysis/cleaned_section and doing a grep in the tests/data folder for analysis/inferred_section returned one file:

$ grep -nr “analysis/inferred_section” emission/tests/data
=> jack_untracked_time_2023-03-12.inferred_section.expected_composite_trips

My doubt is whether this is the right file that I should be using for testing the section queries?

I have this concern since the sample data format does not match query being formed to fetch data in _get_inference_entry_for_section() which is the function that is being used by cleaned2inferred_section().

With respect to this data file,

The current format in this sample data file is JSON data where the actual "sections" info with the metadata.key = analysis/inferred_section is present in a nested block.
{id, user_id, metadata, data: {sections: [{analysis/inferred_key and section_id present here}]}
Currently, _get_inference_entry_for_section() runs this query on the outermost JSON data which doesn’t have “sections”.
And the metadata.key for outer parent dictionary is “analysis/composite_trip” which doesn’t match with “analysis/inferred_section”.

Will work on code implementation again for now, then move back to testing.

MukuFlash03 · 2023-09-19T18:05:59Z

I also see that the code uses the analysis timeseries db to query for data.
I did try using the data file mentioned above by using etc.setUpRealExample() but this loads data into the timeseries db and not the analysis timeseries db as this function itself loads into the timeseries db specifically.

So, I believe this is not the right way to test functionality involving analysis timeseries db.

The other functions in emission/storage/decorations/section_queries.py like get_sections_for_trip() and get_sections_for_trip_list() are tested by creating a new section and inserting it into the analysis timeseries db using builtin_timeseries.insert() which uses the metadata.key to insert into the appropriate database which in this case is the analysis timeseries db.

So, now also considering this testing approach but need to see how sensed_mode is to be set and accessed.

shankari · 2023-09-19T19:55:49Z

after setting up the example, you need to run the pipeline. That will create the analysis results.
Please also see chapter 5 of my thesis to understand how the pipeline works

MukuFlash03 · 2023-09-22T17:12:32Z

The issue I am facing is, after setting up example and running intake pipeline, the analysis timeseries data contains only analysis/cleaned_section keys with sections but no analysis/inferred_section keys.

I also think the function cleaned2inferred_section may actually do what its name suggests, however cleaned2inferred_section code uses analysis/inferred_section as the key to use when querying data. But the data it filters on which is present in the analysis db, doesn't contain this key. It only contains analysis/cleaned_section

I am unsure whether I first have to manually convert cleaned_section data to inferred_section? How would I first create my test data containing inferred_sections?

So, I have been trying to understand the entire data flow from setting up example datasets to obtaining the analysis data in the appropriate timeseries dbs.
I studied that there are some pipeline stages followed: filtering for accuracy (if required), trip and section segmentation, smoothing sections, cleaning and resampling data, mode inference.

I found the pipeline implementation in emission/pipeline/intake_stage.py.
Also, emission.analysis.classification.inference.mode.pipeline.py contains code for mode inference stage, which includes inserting data into timeseries dbs.

However, Comments here say that the intake pipeline for mode inference testing may not be correct.
This testing in this file creates the mode inference data step by step through interdependent tests.

I do see that a sample dataset exists which progresses towards getting inferred_section keys, but I’m not sure how the inferred_section file was created. The 1st one is the raw data, while 2nd one results from running intake pipeline.

jack_untracked_time_2023-03-12
jack_untracked_time_2023-03-12.expected_composite_trips 
jack_untracked_time_2023-03-12.inferred_section.expected_composite_trips

Also, I do see emission.run model pipeline for mode inference but the code looks incomplete and unused anywhere else.

Still trying to understand how to generate inferred_sections.

MukuFlash03 · 2023-09-22T21:50:21Z

We currently have two mode inference algorithms - one based on Random Forest from sensor data (speed, acceleration...)
Another one based on GIS integration - make queries to Overpass API to read osm data.
Then see whether section starts and ends are near a bus stop.
Walk, Bus, Walk -> Look for bus tops within 100 m of motorized path.
If there is a matching bus, we know whether we took transit and/or which specific mode bus/train taken.

Seed_model.json is saved random forest model.
It is typically not checked in since the initial version was collected from an informal class-environment.
For inferred_sections, in unit testing directories, there is a seed_model.json built from some test data.
Should be able to copy over this data, run the test and then remove it.

Alternative, use GIS based testing branch, which may eventually become master branch and current branch becomes random-forest branch.

shankari · 2023-10-01T23:26:57Z

Fixed in e-mission/e-mission-server#937

shankari mentioned this issue Sep 7, 2023

Feat/Segment trip time page e-mission/op-admin-dashboard#61

Merged

MukuFlash03 added this to OpenPATH Tasks Overview Sep 15, 2023

MukuFlash03 moved this to Issues being worked on in OpenPATH Tasks Overview Sep 15, 2023

shankari closed this as completed Oct 1, 2023

github-project-automation bot moved this from Issues being worked on to Tasks completed in OpenPATH Tasks Overview Oct 1, 2023

shankari removed this from OpenPATH Tasks Overview Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🗃️ Add an interface to support returning the cleaned section <-> inferred section mapping for a set of cleaned sections #970

🗃️ Add an interface to support returning the cleaned section <-> inferred section mapping for a set of cleaned sections #970

shankari commented Sep 7, 2023

shankari commented Sep 7, 2023

shankari commented Sep 7, 2023

shankari commented Sep 15, 2023

MukuFlash03 commented Sep 19, 2023

MukuFlash03 commented Sep 19, 2023

shankari commented Sep 19, 2023

MukuFlash03 commented Sep 22, 2023

MukuFlash03 commented Sep 22, 2023 •

edited

Loading

shankari commented Oct 1, 2023 •

edited

Loading

🗃️ Add an interface to support returning the cleaned section <-> inferred section mapping for a set of cleaned sections #970

🗃️ Add an interface to support returning the cleaned section <-> inferred section mapping for a set of cleaned sections #970

Comments

shankari commented Sep 7, 2023

shankari commented Sep 7, 2023

shankari commented Sep 7, 2023

shankari commented Sep 15, 2023

MukuFlash03 commented Sep 19, 2023

MukuFlash03 commented Sep 19, 2023

shankari commented Sep 19, 2023

MukuFlash03 commented Sep 22, 2023

MukuFlash03 commented Sep 22, 2023 • edited Loading

shankari commented Oct 1, 2023 • edited Loading

MukuFlash03 commented Sep 22, 2023 •

edited

Loading

shankari commented Oct 1, 2023 •

edited

Loading