Skip to content

Commit

Permalink
for yyyy_mm_dd metrics, query by fmt_time instead of local_dt
Browse files Browse the repository at this point in the history
I recently rewrote the local dt queries used by TimeComponentQuery (e-mission#968) to make them behave as datetime range queries. But this results in some pretty complex queries which could have nested $and and $or conditions. It felt overengineered but that was the logic required if we were to query date ranges by looking at local dt objects where year, month, day, etc are all recorded separately.
Unfortunately, those queries run super slowly on production. I think this is because the $and / $or conditions prevented MongoDB from optimizing efficiently via indexing

Looked for a different solution, I found something better and simpler.
At first I didn't think using fmt_time fields with $lt and $gt comparisons would work becuase they're ISO strings, not numbers.
But luckily MongoDB can compare strings this way. it performs a lexicographical comparison where 0-9 < A-Z < a-z (like ASCII).
ISO has the standard format 0000-00-00T00:00:00
The dashes and the T will always be in the same places, so effectively, only the numbers will be compared. And the timezone info is at the end, so it doesn't get considered as long as we don't include it in the start and end inputs.
The only thing that specifically needs to be handled is making the end range inclusive. If I query from "2024-06-01" to "2024-06-03", an entry like "2024-06-03T23:59:59-04:00" should match. But lexicographically, that comes after "2024-06-03".
Appending a "Z" to the end range solves this.

The result of all this is that the TimeQuery class (timequery.py) is now able to handle either timestamp or ISO dates, and this is what metrics.py will use for summarize_by_yyyy_mm_dd.
  • Loading branch information
JGreenlee committed Jun 7, 2024
1 parent 5c20021 commit 2da8918
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 18 deletions.
11 changes: 3 additions & 8 deletions emission/net/api/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import emission.analysis.result.metrics.simple_metrics as earms
import emission.storage.decorations.analysis_timeseries_queries as esda
import emission.storage.decorations.local_date_queries as esdl
import emission.storage.timeseries.tcquery as esttc
import emission.storage.timeseries.timequery as esttq

import emcommon.metrics.metrics_summaries as emcms

Expand All @@ -25,14 +25,9 @@ def summarize_by_local_date(user_id, start_ld, end_ld, freq_name, metric_list, i
return _call_group_fn(earmt.group_by_local_date, user_id, start_ld, end_ld,
local_freq, metric_list, include_aggregate)

def summarize_by_yyyy_mm_dd(user_id, start_ymd, end_ymd, freq, metric_list, include_agg, app_config):
time_query = esttc.TimeComponentQuery(
"data.start_local_dt",
esdl.yyyy_mm_dd_to_local_date(start_ymd),
esdl.yyyy_mm_dd_to_local_date(end_ymd)
)
def summarize_by_yyyy_mm_dd(user_id, start_ymd, end_ymd, freq, metric_list, include_agg, app_config):
time_query = esttq.TimeQuery("data.start_fmt_time", start_ymd, end_ymd)
trips = esda.get_entries(esda.COMPOSITE_TRIP_KEY, None, time_query)
print('found ' + str([e for e in trips]))
return emcms.generate_summaries(metric_list, trips, app_config)

def _call_group_fn(group_fn, user_id, start_time, end_time, freq, metric_list, include_aggregate):
Expand Down
27 changes: 17 additions & 10 deletions emission/storage/timeseries/timequery.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,26 @@
from builtins import object
class TimeQuery(object):
"""
Object that encapsulates a query for a particular time (read_ts, write_ts, or processed_ts)
Object that encapsulates a query for a range of time [start_time, end_time]
Can query by Unix timestamps with a '*_ts' time_type (like "metadata.write_ts", "data.ts", or "data.start_ts")
e.g. TimeQuery("metadata.write_ts", 1234567890, 1234567900)
Or, can query by ISO datetime strings with a '*_fmt_time' time_type (like "data.fmt_time" or "data.start_fmt_time")
This is useful for querying based on the local date/time at which data was collected
e.g. TimeQuery("data.fmt_time", "2024-06-03T08:00", "2024-06-03T16:59")
"""
def __init__(self, timeType, startTs, endTs):
self.timeType = timeType
self.startTs = startTs
self.endTs = endTs
def __init__(self, time_type, start_time, end_time):
self.time_type = time_type
self.start_time = start_time
# if end_time is an ISO string, append 'Z' to make the end range inclusive
# (because Z is greater than any other character that can appear in an ISO string)
self.end_time = end_time + 'Z' if isinstance(end_time, str) else end_time

def get_query(self):
time_key = self.timeType
ret_query = {time_key : {"$lte": self.endTs}}
if (self.startTs is not None):
ret_query[time_key].update({"$gte": self.startTs})
time_key = self.time_type
ret_query = {time_key : {"$lte": self.end_time}}
if (self.start_time is not None):
ret_query[time_key].update({"$gte": self.start_time})
return ret_query

def __repr__(self):
return f"TimeQuery {self.timeType} with range [{self.startTs}, {self.endTs})"
return f"TimeQuery {self.time_type} with range [{self.start_time}, {self.end_time})"

0 comments on commit 2da8918

Please sign in to comment.