-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make localdate queries behave like canonical datetime range filters #968
Conversation
The existing implementation that was here was addressing the localdate fields separately and treating them like a set of filters rather than one date range from start -> end. The "set of filters" was good enough for some things, like filtering to a specific month, but not for the canonical datetime range filtering where you want everything between an arbitrary start and end datetime. The complexity in the logic to form the necessary NoSQL queries is more than expected, but after the dust settles it's about the same amount of code as the old implementation was. In short, this implementation finds what units are used by both start and end localdates, walks down them (from largest unit to smallest unit), and combines $and and $or blocks to determine the upper and lower bound
The failing test: def testLocalRangeRolloverQuery(self):
"""
Search for all entries between 8:18 and 9:08 local time, both inclusive
"""
start_local_dt = ecwl.LocalDate({'year': 2015, 'month': 8, 'hour': 8, 'minute': 18})
end_local_dt = ecwl.LocalDate({'year': 2015, 'month': 8, 'hour': 9, 'minute': 8})
final_query = {"user_id": self.testUUID}
final_query.update(esdl.get_range_query("data.local_dt", start_local_dt, end_local_dt))
entries = edb.get_timeseries_db().find(final_query).sort('data.ts', pymongo.ASCENDING)
self.assertEqual(448, edb.get_timeseries_db().count_documents(final_query))
entries_list = list(entries)
# Note that since this is a set of filters, as opposed to a range, this
# returns all entries between 18 and 8 in both hours.
# so 8:18 is valid, but so is 9:57
self.assertEqual(ecwe.Entry(entries_list[0]).data.local_dt.hour, 8)
self.assertEqual(ecwe.Entry(entries_list[0]).data.local_dt.minute, 18)
self.assertEqual(ecwe.Entry(entries_list[-1]).data.local_dt.hour, 9)
self.assertEqual(ecwe.Entry(entries_list[-1]).data.local_dt.minute, 57) Since the query returned from |
The most recent run of the test found 232 entries, but expected 448. That is because it was expecting entries after 9:08. I am going to modify the test to expect 232 entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JGreenlee I am fine with merging this for now to unblock the dashboard changes, but this is fairly complex, so could you please add in some comments, and maybe examples of the comparison query outputs so we can understand the code better.
if len(gte_lte_query) > 0: | ||
query_result.update({curr_field: gte_lte_query}) | ||
def get_range_query(field_prefix, start_ld, end_ld): | ||
units = [u for u in ecwl.DATETIME_UNITS if u in start_ld and u in end_ld] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, what if u
is in start_ld
but not in end_ld
. If I specify a month in start_ld
but no month in end_ld
, I intuitively want to get all trips after the start but be open-ended wrt the end. Alternatively, you should check that there always is a matching pair as I did before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, at a high level, this converts the local date queries into ranges, but drops the old filtering capability completely. For local dates, I think we actually want both - I want to be able to see data for a range based on local time, but I also want to be able to see only weekday trips versus weekend trips (for example).
Again, the current implementation was a weird mishmash of range and filter, so I am glad we fixed it. I am merging this now, but I do think we should add back filter support at some point, maybe when we actually need it.
Will address these comments when I get a chance – trying to finish the metrics implementation first |
I recently rewrote the local dt queries used by TimeComponentQuery (e-mission#968) to make them behave as datetime range queries. But this results in some pretty complex queries which could have nested $and and $or conditions. It felt overengineered but that was the logic required if we were to query date ranges by looking at local dt objects where year, month, day, etc are all recorded separately. Unfortunately, those queries run super slowly on production. I think this is because the $and / $or conditions prevented MongoDB from optimizing efficiently via indexing Looked for a different solution, I found something better and simpler. At first I didn't think using fmt_time fields with $lt and $gt comparisons would work becuase they're ISO strings, not numbers. But luckily MongoDB can compare strings this way. it performs a lexicographical comparison where 0-9 < A-Z < a-z (like ASCII). ISO has the standard format 0000-00-00T00:00:00 The dashes and the T will always be in the same places, so effectively, only the numbers will be compared. And the timezone info is at the end, so it doesn't get considered as long as we don't include it in the start and end inputs. The only thing that specifically needs to be handled is making the end range inclusive. If I query from "2024-06-01" to "2024-06-03", an entry like "2024-06-03T23:59:59-04:00" should match. But lexicographically, that comes after "2024-06-03". Appending a "Z" to the end range solves this. The result of all this is that the TimeQuery class (timequery.py) is now able to handle either timestamp or ISO dates, and this is what metrics.py will use for summarize_by_yyyy_mm_dd.
e-mission#970 (comment) reverts e-mission#968 back to the original implementation since we now have a faster way to do "range" queries via FmtTimeQuery. To make it clearer that FmtTimeQuery is for ranges and TimeComponentQuery is for filters, I renamed some functions and added more descriptive comments/docstrings
The existing implementation that was here was addressing the localdate fields separately and treating them like a set of filters rather than one date range from start -> end. The "set of filters" was good enough for some things, like filtering to a specific month, but not for the canonical datetime range filtering where you want everything between an arbitrary start and end datetime. The complexity in the logic to form the necessary NoSQL queries is more than expected, but after the dust settles it's about the same amount of code as the old implementation was.
In short, this implementation finds what units are used by both start and end localdates, walks down them (from largest unit to smallest unit), and combines $and and $or blocks to determine the upper and lower bound