-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Time-Series Support for Counting Items in the Database #933
Comments
The admin dashboard (https://github.com/e-mission/op-admin-dashboard) displays a summary of users and trips and so on while logging in. The naive way to get this count, would be to retrieve all data and then do mongodb supports Remember software engineering. Avoid tight coupling and have abstract interfaces that we interact with. The implementation of this interface is in mongodb calls. `...find({"metadata.key": ..., "data.ts": {"$gt": start_ts, "$lt": end_ts}). Later if we want to switch to a more principled timeseries database or to postgres or to files we can just build in additional implementations of the abstract interface. Problem: abstract interface doesn't include We need to:
|
There would be a few places to put in the code implementation for this:
In my opinion, built-in timeseries would be the ideal place for the code implementation since:
Additionally, while built-in timeseries works on an individual user_id level, aggregate timeseries works for all users. |
The builtin timeseries also has currently defined query templates for building queries based on a timeseries-like interface. While thinking about the interface - think: "if I replaced mongodb with postgres tomorrow, would this still work"? That query also handles |
From my understanding there are two timeseries dbs: 1) regular data timeseries, 2) analysis timeseries. To determine, the timeseries db required, this key can be passed in as a function parameter. _get_query() can be used to create query template. Now, the timeseries db can be fetched which uses edb, which connects to the MongoDB Client. TODO:
|
Some blocks faced during creating unit cases:
Brief answers after further exploration done below:
Some test cases I’ve come up with: (This test was incorrect: need to pass a specific key value instead of empty list)
|
I do see now that datasets are loaded using a script |
So, I seem to have understood some data flow for loading input test datasets. Testing Dataset However, the assertion was at first giving Error (E). I rectified this by passing for two different key values: |
Now, the assertion finally runs without syntax errors but starts to Fail (F). (emission) mmahadik-35383s:e-mission-server mmahadik$ ./runSingleTestDebug.sh > ../unittest_outputlog.txt FAIL: testCountData (TestTimeSeries.TestTimeSeries) Observations for the two keys:
Possible causes in my opinion (may be wrong): Some difference is there around 180 entries while fetching count for each of the following keys : background/filtered_locations and background/locations . The difference could be attributed to some extra processing that’s taking place on the original input which possibly converges some entries or removes duplicate entries, for instance. Checked docker container $ show collections Actual reason (in my case): I was manually loading this specific dataset which is also mentioned in the TestTimeSeries.py However, in addition to this, there’s also the For expected output, I am simply checking count manually in sample input data JSON file using search and find in VS Code IDE for a specific key, for instance |
Yes, you are running a mongodb on localhost using docker. You have full access to it. You cannot and will not have access to the production database because we are not going to run unit tests against the production database.
You don't pass in a dataset as a parameter to your function, you connect to the database to read it
You don't need to do this - it is not wrong, but you can just invoke the test python file using
we should decide whether the key_list is mandatory here or not. For |
With reference to I did test this out by running a test for The outputs matched the expected counts and the assertions passed. So, overall, I feel the same count functionality could in a way be achieved by refactoring the return values from find_entries() as it also handles both databases - timeseries and analysis_timeseries. |
Good point. I don't want to delay this PR any further, because we have another, similar PR to support e-mission/op-admin-dashboard#61 which I want to get done so that we can merge that. But I am supportive of taking the time to refactor |
Alright, understood. I am thinking about the few ways to go about refactoring:
|
No. As I said, I want to finish and commit this PR. |
Right, thank you for the clarification. |
With reference to the TestTimeSeries.py file's testcase for Aggregate subclass under the testFindEntriesCount module, Problem: Possible issue: In case of running only current test module, only 1 distinct user is found and output matches expected counts. I have verified expected outputs using mongodb terminal to fetch the count values. Code comments contains details on my validation process. On the entire test suite, this user may not be present or the count data may not be matching for the user. Failed Assertion:
|
Now observing the database and entries, cannot seem to find the test users that I was observing yesterday. Now, I'm thinking that the two fixed UUIDs I had observed yesterday could be from some residual data load done by me and/or previously setup UUIDs and their data might not have been cleared. |
This is incorrect. You should use the aggregate timeseries directly. The whole point of the aggregate timeseries was to avoid making repeated queries to the database for every user.
Where did you get these expected values from? |
As I mentioned here, I had observed some data yesterday, which seemed to persist between test runs. Update: I have simplified my aggregate test case and am now seeing "aggregated" / "summed-up" values for the two users. |
I understand this, but from a process perspective, I want to make sure that you do not rely on previous results to get the expected values. Even if you had some prior data, it should not have affected your expected values, since you should have determined the expected values by looking at the data related to the test, not the data in the database. That's why I ask you "Where did you get these expected values from?" |
Identified this issue mentioned in this PR comment regarding unaccounted entry in analysis timeseries database. Summary of Debug process:
Following up with detailed comments for each step in more comments below. |
Detailed Debug process:
Entry found:
This looks like it could resolve the unaccounted entry. |
Yup! That is the principled solution. Good job tracking this down, and I hope we are able to fix it the right way soon. |
Fixed in e-mission/e-mission-server#935 |
Currently, we need to use edb and MongoDB's count_documents method to count time-series, as the time-series module does not contain the needed functions for counting documents.
There is an example of needed functionality here:
https://github.com/AlirezaRa94/op-admin-dashboard/blob/11f5af3e8eae8a54abfb21bc225ab88f6f86b8c7/utils/db_utils.py#L106C1-L122C46
The text was updated successfully, but these errors were encountered: