Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Time-Series Support for Counting Items in the Database #933

Closed
AlirezaRa94 opened this issue Jul 10, 2023 · 24 comments
Closed

Add Time-Series Support for Counting Items in the Database #933

AlirezaRa94 opened this issue Jul 10, 2023 · 24 comments

Comments

@AlirezaRa94
Copy link

AlirezaRa94 commented Jul 10, 2023

Currently, we need to use edb and MongoDB's count_documents method to count time-series, as the time-series module does not contain the needed functions for counting documents.

There is an example of needed functionality here:

https://github.com/AlirezaRa94/op-admin-dashboard/blob/11f5af3e8eae8a54abfb21bc225ab88f6f86b8c7/utils/db_utils.py#L106C1-L122C46

@shankari
Copy link
Contributor

The admin dashboard (https://github.com/e-mission/op-admin-dashboard) displays a summary of users and trips and so on while logging in.

The naive way to get this count, would be to retrieve all data and then do len on it. But that would be a performance overkill.

mongodb supports count_documents that returns the number of documents instead of the documents themselves.
However, we don't want to use mongodb calls directly in our codebase because that then ties our codebase to mongodb very tightly.

Remember software engineering. Avoid tight coupling and have abstract interfaces that we interact with.
So I have created an abstract interface (in emission.storage.timeseries.abstract_timeseries) that represents the way in which we access data - which is essentially a timeseries with multiple datastreams. So we have a stream for background/location and we ask for all entries in that stream between start_ts and end_ts

The implementation of this interface is in mongodb calls. `...find({"metadata.key": ..., "data.ts": {"$gt": start_ts, "$lt": end_ts}). Later if we want to switch to a more principled timeseries database or to postgres or to files we can just build in additional implementations of the abstract interface.

Problem: abstract interface doesn't include count - it only includes find, get which basically retrieve data.
As a workaround, the team building the admin dashboard made the mongodb call directly and filed this issue per my request for us to fix it.

We need to:

  • add a count interface to the abstraction,
  • implement the interface, and add unit tests which exercise the interface, and
  • then we need to tell them to change their implementation to call the abstraction instead of the mongodb call directly.

@shankari shankari moved this to Issues being worked on in OpenPATH Tasks Overview Aug 23, 2023
@MukuFlash03
Copy link
Contributor

There would be a few places to put in the code implementation for this:

  • Abstract timeseries
  • Built-in timeseries
  • Aggregate timeseries

In my opinion, built-in timeseries would be the ideal place for the code implementation since:

  • Abstract timeseries is just a template design and the actual code implementations should be in the concrete classes.
  • Aggregate timeseries inherits from Built-in time series itself and repetitive code can be avoided by placing the implementation primarily in Built-in timeseries itself.

Additionally, while built-in timeseries works on an individual user_id level, aggregate timeseries works for all users.

@shankari
Copy link
Contributor

The builtin timeseries also has currently defined query templates for building queries based on a timeseries-like interface.
I would caution you to not accept a "query" as an input which you pass through because in that case, there is insufficient decoupling between the interface and the implementation.

While thinking about the interface - think: "if I replaced mongodb with postgres tomorrow, would this still work"?

That query also handles user_id = None, so it will also make your implementation easier.

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Aug 25, 2023

From my understanding there are two timeseries dbs: 1) regular data timeseries, 2) analysis timeseries.
ts_enum_map uses edb calls to fetch these respective timeseries dbs.
ts_map takes in a key that has one of the two timeseries dbs mapped to it.

To determine, the timeseries db required, this key can be passed in as a function parameter.
This key would also serve as the value for the metadata.key key in the document.

_get_query() can be used to create query template.
This can take in extra_query_listas a function parameter so coupling is still there but reduced as other keys such as user, metadata are set already.

Now, the timeseries db can be fetched which uses edb, which connects to the MongoDB Client.
It hence has access to pymongo and count_documents() which can be used to compute total number of matching documents.

TODO:

  • Testing procedure
  • Work on test cases
  • Improvise code if required

@MukuFlash03 MukuFlash03 moved this from Issues being worked on to Ready for review in OpenPATH Tasks Overview Aug 26, 2023
@MukuFlash03 MukuFlash03 moved this from Ready for review to Issues being worked on in OpenPATH Tasks Overview Aug 27, 2023
@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Aug 29, 2023

Some blocks faced during creating unit cases:

  1. Can I get DB access somehow? At least read access if not write?
  2. I need to get count of documents directly on a particular connection so as to get expected output value. This is needed as the code implementation for testing goes through a series of functions, classes and forms a query and then executes count_document() on the filtered dataset.
  3. For instance, I see that real_examples are present, I can do a count on a particular JSON file dataset manually to get my expected output but how would I test it with my function which calls edb.get_analysis_timeseries_db(), edb.get_timeseries_db() ? How would I pass this sample real_example dataset as a parameter to my function?

Brief answers after further exploration done below:

  1. DB accessible via docker container running mongodb via CLI.
  2. Sample dataset can be loaded using a script (refer to README)
    Additionally, some sample dataset is passed and setup for use in TestTimeSeries.py itself, so no need to load your own data manually.

Some test cases I’ve come up with:

(This test was incorrect: need to pass a specific key value instead of empty list)
Test 1 : Specific Key parameters

  • key = "background/location" , extra_query_list = []
  • Results in query which matches all documents for a user for that specified key.

@MukuFlash03
Copy link
Contributor

I do see now that datasets are loaded using a script load_timeline_for_day_and_user.
Trying to see now how it connects to the DB classes for fetching the generated DBs by studying emission/core/get_database.py

@MukuFlash03
Copy link
Contributor

So, I seem to have understood some data flow for loading input test datasets.

Testing
I did not run all test files as it would take a lot time. I utilized the runSingleTestDebug.sh script and modified it to only run the TestTimeSeries.py test file.

Dataset
Initially, I tried to follow the commands to load dataset on the e-mission-server repo's README.md using:
/e-mission-py.bash bin/debug/load_timeline_for_day_and_user.py emission/tests/data/real_examples/shankari_2015-aug-21 test_aug_21

However, the assertion was at first giving Error (E).
This indicated a syntax error, and in this case was incorrect input passing to count_data().
I had gone ahead with empty list as value for key, however, the get_timeseries_db()needs a non-null value else returns a keyError.

I rectified this by passing for two different key values:
"background/filtered_locations", "background/locations"

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Aug 29, 2023

Now, the assertion finally runs without syntax errors but starts to Fail (F).

(emission) mmahadik-35383s:e-mission-server mmahadik$ ./runSingleTestDebug.sh > ../unittest_outputlog.txt
.F......


FAIL: testCountData (TestTimeSeries.TestTimeSeries)
Test 1 : Specific key with empty extra_queries
Traceback (most recent call last):
File "/Users/mmahadik/Documents/GitHub/e-mission-server/emission/tests/storageTests/TestTimeSeries.py", line 95, in testCountData
self.assertEqual(total_count, 508)
AssertionError: 327 != 508


Observations for the two keys:

  1. Key = “background/location”
    AssertionError: 555 != 738
    Difference = 183

  2. Key = “background/filtered_location”
    AssertionError: 327 != 508
    Difference = 181


Possible causes in my opinion (may be wrong):

Some difference is there around 180 entries while fetching count for each of the following keys : background/filtered_locations and background/locations .

The difference could be attributed to some extra processing that’s taking place on the original input which possibly converges some entries or removes duplicate entries, for instance.

Checked docker container
Next approach is to try and get into running mongoDB Docker container and run queries directly on the collections to fetch total count of documents.
Initially I tried mongosh which returned command not found error.
I found that, just mongo seems to run.
Now once inside mongo:
$ show dbs # This listed all the Databases

$ show collections
Should have returned collection containing uploaded sample dataset but I do not get any collections.
But perhaps since I’m only loading data and not then running my own test, it’s not completing the entire pipeline tasks or perhaps clearing out the collections too after processing is done.


Actual reason (in my case):

I was manually loading this specific dataset which is also mentioned in the TestTimeSeries.py
$ ./e-mission-py.bash bin/debug/load_timeline_for_day_and_user.py emission/tests/data/real_examples/shankari_2015-aug-21 test_aug_21

However, in addition to this, there’s also the shankari_2015-aug-27 sample dataset in the same python test file, which is set after the 08/15 dataset and is the one that the code uses as its current dataset when running count_data().

For expected output, I am simply checking count manually in sample input data JSON file using search and find in VS Code IDE for a specific key, for instance "background/location".
On comparing counts manually (using grep), with the returned output by the code, the output is matching and the assertion passes
with total count = 555 for this key for the shankari_2015-aug-27 sample dataset.

@shankari
Copy link
Contributor

Can I get DB access somehow? At least read access if not write?

Yes, you are running a mongodb on localhost using docker. You have full access to it. You cannot and will not have access to the production database because we are not going to run unit tests against the production database.

How would I pass this sample real_example dataset as a parameter to my function?

You don't pass in a dataset as a parameter to your function, you connect to the database to read it

I did not run all test files as it would take a lot time. I utilized the runSingleTestDebug.sh script and modified it to only run the TestTimeSeries.py test file.

You don't need to do this - it is not wrong, but you can just invoke the test python file using ./e-mission-py.bash and it will just work

I had gone ahead with empty list as value for key, however, the get_timeseries_db()needs a non-null value else returns a keyError.

we should decide whether the key_list is mandatory here or not. For find_entries it is not mandatory, we call both collections and merge the results. For get_data_df the key is in fact mandatory.

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Sep 1, 2023

With reference to key_list usage as mandatory or not:
I saw that find_entries() uses _get_entries_for_timeseries() which also returns counts of documents.
If I do make key/key_list as optional in my function, I will also need to handle it being None, or having multiple keys which, in my opinion, would lead to a lot of repetitive code which is already present in these two above-mentioned functions.

I did test this out by running a test for find_entries() with the same inputs that I passed for my new function.
I did however change the return values in find_entries() so as to fetch only count for test purposes in my local repo.
Current: return itertools.chain(orig_ts_db_result, analysis_ts_db_result)
New: return (orig_ts_db_count, analysis_ts_db_count)

The outputs matched the expected counts and the assertions passed.

So, overall, I feel the same count functionality could in a way be achieved by refactoring the return values from find_entries() as it also handles both databases - timeseries and analysis_timeseries.

@shankari
Copy link
Contributor

shankari commented Sep 1, 2023

So, overall, I feel the same count functionality could in a way be achieved by refactoring the return values from find_entries() as it also handles both databases - timeseries and analysis_timeseries.

Good point. I don't want to delay this PR any further, because we have another, similar PR to support e-mission/op-admin-dashboard#61 which I want to get done so that we can merge that. But I am supportive of taking the time to refactor find_entries to avoid duplicated code in a separate cleanup PR

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Sep 1, 2023

Alright, understood.
Would you want me to work on the find_entries() cleanup PR right away?

I am thinking about the few ways to go about refactoring:

  1. Modify what the function returns to include counts:
  • This will lead to changing code wherever find_entries() is called.
  1. Simply take the length len(list(returned_value)) which would effectively return count of documents:
  • This wouldn't need changing find_entries() existing code.
  • However, wherever find_entries() is called specifically to count entries, must ensure that we pass keys pertaining to one of the two timeseries databases and not both. This is because the current returned values returns a combined list of results from both timeseries collections.

@shankari
Copy link
Contributor

shankari commented Sep 1, 2023

Would you want me to work on the find_entries() cleanup PR right away?

No. As I said, I want to finish and commit this PR.
Then I want you to work on the PR to support e-mission/op-admin-dashboard#61
Then I want you to work on the label assist model scalability
Then we can revisit the cleanup

@MukuFlash03
Copy link
Contributor

Right, thank you for the clarification.
I have committed the latest code and moved the PR for review.

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Sep 6, 2023

With reference to the TestTimeSeries.py file's testcase for Aggregate subclass under the testFindEntriesCount module,
I am getting a Failed assertion for this test.

Problem:
Test passes on running only the specific test module or even the specific test file.
However, test fails when the entire suit of tests is run using runAllTests.sh both locally and through the GitHub CLI.

Possible issue:
I am fetching the distinct users under the aggregate class object, and for each user - fetching the timeseries and counting documents.

In case of running only current test module, only 1 distinct user is found and output matches expected counts. I have verified expected outputs using mongodb terminal to fetch the count values. Code comments contains details on my validation process.

On the entire test suite, this user may not be present or the count data may not be matching for the user.

Failed Assertion:

File "/Users/mmahadik/Documents/GitHub/e-mission-server/emission/tests/storageTests/TestTimeSeries.py", line 156, in testFindEntriesCount
   self.assertEqual(count_ts7, ([1476, 1016], [5]))
AssertionError: Tuples differ: ([0, 0], [0]) != ([1476, 1016], [5])

First differing element 0:
[0, 0]
[1476, 1016]

- ([0, 0], [0])
+ ([1476, 1016], [5])

----------------------------------------------------------------------
Ran 345 tests in 1327.825s
FAILED (failures=1)

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Sep 6, 2023

Now observing the database and entries, cannot seem to find the test users that I was observing yesterday.
To clarify, I had been observing 2 fixed users UUIDs that weren't changing on multiple runs of the test module.
While two of them were being dynamically changed since two UUIDs are being setup in the test case setup().

Now, I'm thinking that the two fixed UUIDs I had observed yesterday could be from some residual data load done by me and/or previously setup UUIDs and their data might not have been cleared.

@shankari
Copy link
Contributor

shankari commented Sep 6, 2023

I am fetching the distinct users under the aggregate class object, and for each user - fetching the timeseries and counting documents.

This is incorrect. You should use the aggregate timeseries directly. The whole point of the aggregate timeseries was to avoid making repeated queries to the database for every user.

self.assertEqual(count_ts7, ([1476, 1016], [5]))

Where did you get these expected values from?

@MukuFlash03
Copy link
Contributor

Now observing the database and entries, cannot seem to find the test users that I was observing yesterday. To clarify, I had been observing 2 fixed users UUIDs that weren't changing on multiple runs of the test module. While two of them were being dynamically changed since two UUIDs are being setup in the test case setup().

Now, I'm thinking that the two fixed UUIDs I had observed yesterday could be from some residual data load done by me and/or previously setup UUIDs and their data might not have been cleared.

As I mentioned here, I had observed some data yesterday, which seemed to persist between test runs.
I had some doubt, and today I reset the environment and started from scratch again.
I no longer see the data I was seeing and hence my understanding of the functioning and usage of aggregate time series was wrong.

Update: I have simplified my aggregate test case and am now seeing "aggregated" / "summed-up" values for the two users.
Will rectify with tests and push latest code.

@shankari
Copy link
Contributor

shankari commented Sep 6, 2023

As I mentioned here, I had observed some data yesterday, which seemed to persist between test runs.
I had some doubt, and today I reset the environment and started from scratch again.
I no longer see the data I was seeing and hence my understanding of the functioning and usage of aggregate time series was wrong.

I understand this, but from a process perspective, I want to make sure that you do not rely on previous results to get the expected values. Even if you had some prior data, it should not have affected your expected values, since you should have determined the expected values by looking at the data related to the test, not the data in the database. That's why I ask you "Where did you get these expected values from?"

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Sep 12, 2023

Identified this issue mentioned in this PR comment regarding unaccounted entry in analysis timeseries database.

Summary of Debug process:

  • Fetched assertion failed entry in TestTimeSeries.testFindEntriesCount aggregate user test case.
  • Identified UUID in question.
  • Searched for matching terms in the entry in the code base.
  • Investigated matching files by adding print() statements to see where the UUID is being created and used.
  • Found whether problem existed before the PR code implementation or was introduced during this PR.
  • Found possible cause: Absence of tearDown() function in some test files.

Following up with detailed comments for each step in more comments below.

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Sep 12, 2023

Detailed Debug process:

  1. Fetched failed assertion entry
  • For debugging purposes, modified builtin_timeseries.findEntriesCount() to return both entries and counts.
orig_tsdb_count = self._get_entries_for_timeseries(orig_tsdb, orig_tsdb_keys, time_query, geo_query, extra_query_list, None)
analysis_tsdb_count = self._get_entries_for_timeseries(analysis_tsdb, analysis_tsdb_keys, time_query, geo_query, extra_query_list, None)

# total_matching_count = orig_tsdb_count[0] + analysis_tsdb_count[0]
return (orig_tsdb_count, analysis_tsdb_count)
  • Failed assertion for aggregate users prints entries:
try:
    print("Starting aggregate test")
    ts_agg = esta.TimeSeries.get_aggregate_time_series()
    key_list4=[]
    count_ts8 = ts_agg.find_entries_count(key_list=key_list4)
    self.assertEqual(count_ts8[0][0] + count_ts8[1][0], 3607)
except AssertionError as e:
    print(f"Assertion failed for aggregate users...")
    for ct in count_ts8[1][1]:
        cte = ecwe.Entry(ct)
        print(cte.user_id)
        print(cte.metadata.key)
        print(cte)

Entry found:

uuid =  3f205663-671b-3333-91d8-347d74a88f83
key = segmentation/raw_stop

Entry({'_id': ObjectId('64fbb82b7b555615e043ef49'), 'user_id': UUID('3f205663-671b-3333-91d8-347d74a88f83'), 'metadata': {'key': 'segmentation/raw_stop', 'platform': 'server', 'write_ts': 1694218283.960188, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 9, 'day': 8, 'hour': 17, 'minute': 11, 'second': 23, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-09-08T17:11:23.960188-07:00'}, 'data': {'enter_ts': 5, 'exit_ts': 6, 'trip_id': 'test_trip_id'}})

  1. Identified UUID in question:
  • Unique data found from entry includes these keys:
'user_id': UUID('3f205663-671b-3333-91d8-347d74a88f83')
'trip_id': 'test_trip_id'
  • This UUID was unchanged on every test run and was not being generated randomly.
  • Hence uuid.uuid4() wasn't used as this is used for random uuid generation.
  • Found some files using uuid.uuid3() which generates same uuid based on parameters passed.

  1. Searched for matching terms in the code base:
  • UUID didn't match directly but found uuid.uuid3() in 4 test files in emission/tests/storageTests/
    TestAnalysisTimeseries.py, TestSectionQueries.py, TestStopQueries.py, TestPlaceQueries.py
  • "test_trip_id" match was found in the first 3 test files mentioned above.
    • Also, found matches in:
      analysisTests/modeinferTestsTestFeatureCalc.py -> Ruled out this file for debug as it just sets dim.trip_id = "test_trip_id”
      analysisTests/intakeTestsTestSectionSegmentation.py -> Ruled out this file for debug as it sets up and runs tests for Android/IOS UUIDs which didn't match UUID under question.

  1. Added print() statements to the four test files in emisison/tests/storageTests
  • In these files, uuid.uuid3() used which causes same constant UUID to occur on every test run.
    self.testUserId = uuid.uuid3(uuid.NAMESPACE_URL, "mailto:[email protected]")
  • Added print() in each of the 4 test files at different places to see occurrences and usage of the fixed UUID:
  • Format of print() -> print("TestStopQueries.setUp L31", self.testUserId)
  • This confirmed that these files were creating the UUID under question.

  1. Problem existed before the PR in question or not:
  • Utilized two separate sets of code repositories:
    • My forked code repo from the PR
    • Original e-mission-server:master branch's fresh pull without code from my PR
  • Included the same print() statements in the same files in both repos.
  • Ran all the tests using and logged output using:
    $ ./runAllTests.sh > ../unittest.log 2>&1
  • Got the same number of matches for the UUID printed in the output log file that matched exactly in both repos.
  • My forked repo had 2 extra prints for the UUID as I had printed it in the new function TestTimeseries.TestFindEntriesCount() , which does not exist in the original repo's current master branch.
  • This confirmed that issue existed in earlier implementation as well.

  1. Possible cause - Absence of tearDown()
  • All other tests in emission/tests/storageTests/ had setUp() and tearDown() functions which are special functions for Python unittest module.
  • These 4 tests files did not have a tearDown() method that tidies up after the test method has been run.
  • tearDown() in other test files clears database entries which may not be happening in these 4 tests:
    edb.get_timeseries_db().delete_many({"user_id": self.testUUID})
    edb.get_analysis_timeseries_db().delete_many({"user_id": self.testUUID})
  • Note for possible future issue: One of the tests TestTokenQueries.py did not have a setUp() but had a tearDown().

This looks like it could resolve the unaccounted entry.
Meanwhile will test it out by adding tearDown() and see if the UUID is being cleared and the original assertion for aggregate users passes.

@shankari
Copy link
Contributor

Meanwhile will test it out by adding tearDown() and see if the UUID is being cleared and the original assertion for aggregate users passes.

Yup! That is the principled solution. Good job tracking this down, and I hope we are able to fix it the right way soon.

@shankari
Copy link
Contributor

Fixed in e-mission/e-mission-server#935

@github-project-automation github-project-automation bot moved this from Issues being worked on to Tasks completed in OpenPATH Tasks Overview Oct 21, 2023
@MukuFlash03
Copy link
Contributor

Cleaned up some code in tests pertaining to try-catch block mentioned here.

Fixed in the same commit in this PR for another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants