-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/Segment trip time page #61
Conversation
@TTalex I think that this is a great feature, and super well documented! We just deployed a release, but are planning a new one in a couple of weeks for the API upgrade, and it would be great to include this as well. I'll review and merge this weekend. |
FYI, map matching has been postponed in favor of helping finish up "count every trip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since `$near` doesn't seem to work with it e-mission/op-admin-dashboard#61 (comment) ``` pymongo.errors.OperationFailure: $geoNear, $near, and $nearSphere are not allowed in this context, full error: {'ok': 0.0, 'errmsg': '$geoNear, $near, and $nearSphere are not allowed in this context', 'code': 2, 'codeName': 'BadValue'} ``` `$geoWithin` seems to be fine. test passes
…and end zones Rewrote data fetching to use the geoquery sdk
Quoting myself from the first message of this PR:
This could be solved by having a I believe it could be implemented in + import itertools
# [...]
def add_dist_heading_speed(points_df):
# type: (pandas.DataFrame) -> pandas.DataFrame
"""
Returns a new dataframe with an added "speed" column.
The speed column has the speed between each point and its previous point.
The first row has a speed of zero.
"""
point_list = [ad.AttrDict(row) for row in points_df.to_dict('records')]
zipped_points_list = list(zip(point_list, point_list[1:]))
distances = [pf.calDistance(p1, p2) for (p1, p2) in zipped_points_list]
distances.insert(0, 0)
+ distances_from_start = list(itertools.accumulate(distances))
speeds = [pf.calSpeed(p1, p2) for (p1, p2) in zipped_points_list]
speeds.insert(0, 0)
headings = [pf.calHeading(p1, p2) for (p1, p2) in zipped_points_list]
headings.insert(0, 0)
with_distances_df = pd.concat([points_df, pd.Series(distances, name="distance")], axis=1)
+ with_distances_from_start_df = pd.concat([with_distances_df, pd.Series(distances_from_start, name="distance_from_start")], axis=1)
- with_speeds_df = pd.concat([with_distances_df, pd.Series(speeds, name="speed")], axis=1)
+ with_speeds_df = pd.concat([with_distances_from_start_df, pd.Series(speeds, name="speed")], axis=1)
if "heading" in with_speeds_df.columns:
with_speeds_df.drop("heading", axis=1, inplace=True)
with_headings_df = pd.concat([with_speeds_df, pd.Series(headings, name="heading")], axis=1)
return with_headings_df Recreated locations would then look like the following examples: Computing distance from the second point (idx 1) to the last one (idx 3) would only require both points, skipping fetching idx 2: Maybe it's a bit too specific for this use case to induce a change to the Location model. (and might require a patch on existing database entries for consistency 😕) |
@TTalex we have created an interface for the We can then change the implementation at will depending on the scalability vs. reuse tradeoff. |
This is an interesting thought. Adding new entries to the data model and patching existing entries is work but fairly straightforward conceptually. If we can come up with a second use case that needs this functionality (maybe map matching), I am happy to include it. Not sure if we want to do a one-off change before that though... |
Sweet, thanks, I've made the swap. |
I wouldn't change it either if I were you, that's why I didn't bother doing a PR :) There might be some use cases in end users UI for it, where the change would induce slight performance improvements. For example, with point by point visualization such as this one: But I'm not confident it would be an improvement at all, since the full list of points has to be loaded anyway. |
While testing against my own data, ran into the following error
|
Looking at the logs, we have
so it must be one of these sections |
Bingo!
This is almost certainly due to My current guess is that it might be due to multiple calls to overpass failing, but in that previous issue, I see:
|
It's really weird that there are still exactly 807 matching entries. Maybe we can spend a little time today to investigate (at least on the side) |
times = pd.to_datetime(df['start_fmt_time'], errors='coerce', utc=True) | ||
duration_per_hour = format_duration_df( | ||
df.groupby(times.dt.hour).agg({'duration': 'median', 'section': 'count'}), | ||
time_column_name='Hour', | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to use this myself, and having the hours in UTC is very annoying. We already have the split out components in local time in the data.start_local_dt
field
From emission/core/wrapper/section.py
"start_local_dt": ecwb.WrapperBase.Access.WORM, # ********searchable datatime in local time of start location
I will spend ~ 10 mins trying to fix this myself while merging the change, but will file a cleanup issue if I can't get that to work.
After resetting and re-running the pipeline, we get 14k points at the start and 824 points at the end.
|
Ran into the duplicate entries for another user We might want to write a check for this and run it on production before pushing it out. Let's switch to open access for a bit to see if things are better. |
@achasmita I don't see how big these zones are. How did you pick them? |
While selecting zone, I checked trip table to find the best areas based on latitude and longitude that had the most trips:
|
Expanded version, dictated to @shankari
This shows us that the trips that are within the start and end polygons are shown in the trip time table. Concretely, if the trips that you found were My concern is not that the trip time table is inaccurate, but that it is incomplete. |
I tried exploring start and end zone with different segment size:
|
@achasmita Thank you for adding additional examples with the results of your investigation. I have some more questions.
Can you also expand on what you did in "I also explored data in both zone(start/end)"? How did you explore the data, and what were the results? |
The above screeshot was just to make sure if i am selecting correct size of zone, I will find the other screenshot for 25-30 trips and post it soon. For data, I printed tops 50 and bottom 50 data after removing duplicates and compared it with data in trip table to see if i can figure out if any data is missing. |
Can you expand on this? |
|
My main concern with this is we were getting very few trips displayed for basically any start/end combo.
@achasmita was not able to get (1) without making the polygons really large and wasn't able to come up with (2). Aa a concrete example, on staging, I would expect to see a lot of trips from my house to the library or to the grocery store nearby or to my kids' school. In particular, I would expect to see at least 100 trips from my house to the local school. Similarly, in the Denver area, you could see the locations that are hotspots in the heatmap and try to see if there are trips between them. |
While trying this branch, I initially got this error:
I realized this is because this branch was using an old image of Updated to the most recent (shankari/e-mission-server:master_2024-02-10--19-38) and rebuilt. Now there's a different error:
There are 807 inferred sections for one cleaned section?? I inspected the logs to see which UUID + section this is happening for. It happens on: Sure enough, there are 807 inferred section entries for that UUID and that cleaned section. query = {
'metadata.key': 'analysis/inferred_section',
'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'),
'data.cleaned_section': ObjectId('644df8edea199f1d0473e301')
}
r = edb.get_analysis_timeseries_db().find(query)
for i in r:
print(i)
I removed those duplicates and tried again, but there appears to be duplicates for all the other sections too. I am unsure why there are so many duplicates. |
I removed the duplicates manually with this script: import emission.core.get_database as edb
# get cleaned sections
cleaned = edb.get_analysis_timeseries_db().find({
'metadata.key': 'analysis/cleaned_section',
})
cleaned_ct = 0
for c in cleaned:
cleaned_ct += 1
query_inferred = {
'metadata.key': 'analysis/inferred_section',
'user_id': c['user_id'],
'data.cleaned_section': c['_id']
}
first_inferred = edb.get_analysis_timeseries_db().find_one(query_inferred)
if first_inferred is None:
print(f"cleaned section {cleaned_ct} had no inferred sections")
continue
# remove all of those entries unless the ID is the first inferred section
dedup_query = {
'metadata.key': 'analysis/inferred_section',
'user_id': c['user_id'],
'data.cleaned_section': c['_id'],
'_id': {'$ne': first_inferred['_id']}
}
delete_result = edb.get_analysis_timeseries_db().delete_many(dedup_query)
print(f"removed {delete_result.deleted_count} duplicates from cleaned section {cleaned_ct}") It took awhile to run. Now I am finally able to test the Segment Trip Time page.
For trips from home to school, I found 222 trips spanning from July 2022 to December 2023. This seems to align with expectations because there are about 180 school days in 1 year. The boxes I used were about the size of 1 block. I will follow up with smaller boxes |
Home to school<image> <image> The trips span from July 2022 to December 2023. This seems to align with expectations because there are about 180 school days in 1 year. So I do think this is probably a pretty comprehensive measure of this repeated trip Home to viola class<image> <image> <image> Methodology for drawing boxesI found these usage guidelines quite helpful and accurate: For reference, below are the heatmaps around those 3 areas of interest. I drew the boxes considering where the locus of activity seems to be for each area (and considering the guidelines above) (I wonder if this tool could potentially be even more useful + easy to use if a heatmap was actually overlayed on the start/end selection area? I found myself switching back and forth often) <image> ConclusionBased on this, the tool does appear to work as expected. It captured a fairly comprehensive, if not fully comprehensive, picture of the above recurring trips. I also briefly validated the tool against my own travel data. The only changes I might suggest would include a heatmap overlay to make it easier to identify places of activity while drawing boxes, and potentially a toggle to "swap" start and end locations. (I see a common use case where the user has seen the duration from A to B - now they want to see the duration from B to A, but they don't want to have to re-draw the boxes) |
@JGreenlee thanks for the comprehensive review! Given the length of time that this has been pending, I will merge the changes now for the next round and we can address the UX improvements in a subsequent round. |
@shankari Great!
|
Actually it looks like you just did that a few days ago! |
I resolved the merge conflicts for this feature and updated it to observe the global filters (which were added since this feature was created).
@shankari |
Hey,
This feature adds a new page helping users compute average trip duration between two selected points.
I wanted to experiment with a simple level of service indicator based on e-mission data.
The use case comes from feedback from a local authority in France. They expressed a need for alternative ways of gathering travel time information that doesn't rely on buying data from historic actors.
I do believe that e-mission can be pretty good for this, since trip completeness isn't required to compute average durations. We can get pretty good results with low amounts of data.
The initial idea sparked a discussion around map matching. While results would probably be more accurate with map matching, simple curve-fitting, as done when creating
analysis/recreated_locations
entries, seems to already be performing well enough.User interaction
Here is a quick demo of the new page (the database only has one user data, mine).
The user is asked to
Using this information, queries are made to fetch recreated_locations matching either points. Resulting data is then displayed.
The proof of concept includes the following tables:
More useful stats could be added in the future, for example separating between weekdays and weekends.
A commonly used statistic is average vehicle speed. This requires information on true distance travelled. However, distance is complex to compute with recreated_locations, since we only fetch start and end points, we lose the distance information in intermediary points. This would require further queries to sum all intermediary points distances. The same is true for speeds. This is not complex, but could be heavy on the database / memory. In reality, the distance is most likely already known by the user, at least for the "usual" path.
New project requirements
I have added the
dash_leaflet
library because built-in mapbox plots aren't great:Permissions
Three configurations are linked to this new page:
segment_trip_time
: User can view this page. (defaulttrue
)segment_trip_time_full_trips
: User can see the table containing non-aggregated data (defaulttrue
)segment_trip_time_min_users
: Minimal number of distinct users in data required to display anything (value is a number, default0
). This parameter should help with guaranteeing some kind of anonymity, otherwise a user could target a specific house as a start point, and leak personal travel data that way.Dev notes
Up until now, the admin-dashboard was querying data once at startup. This PR behaves differently, with database queries on user actions. The code includes a few comments on why this was done this way, and the performances implications.
Hope this is useful for someone else :)