Pull from master #858

shankari · 2022-06-08T22:44:22Z

Let's see the conflicts

shankari · 2022-06-08T22:47:05Z

entry: keep both
webapp: keep both (mostly) except for the line with `@post("/datastreams/find_entries/<time_type>", method=['OPTIONS', 'POST']) keep the zephyr version
timeseries: keep both
`

shankari · 2022-06-08T22:47:44Z

can you also file an issue to merge the emevalzephyr data types to master so we don't have to keep doing this

- Since we have already implemented many different smoothing algorithms, we pick POSDAP to use as backup - if we still have outliers after the first round, and the max value is over MACH1, we fall back to the backup algo - after implementing the backup algo, if we don't have outliers, the backup algo has succeeded and we use its results - if we do have outliers, but the max value is under MACH1, the backup algo has succeeded and we use its results - if we have outliers, and the max is high (> MACH1) the backup algo has failed With this change, both the tests also change to the correctly deleted values - [16 17 18 19 20] for use case 1 (e-mission/e-mission-docs#843 (comment)) - [11] for use case 2 (e-mission/e-mission-docs#843 (comment)) In this commit, we also check in the csv data files for the two test cases

…moothing file This addresses a long-term TODO https://github.com/e-mission/e-mission-server/blob/master/emission/analysis/intake/cleaning/cleaning_methods/jump_smoothing.py#L262 It also: - ensures that the individual algorithms are clean and modular and don't depend on other algorithms - we can swap in any algorithm for the backup algo - we can support more complex backups in the future Testing done: - modified the test to pass in the backup algo - tests pass

Added a new unit test for the case of `backup_algo == None`, which should return the original algo results. While testing, found that the ZigZag algo returns a pandas Series, while the Posdap algo returns a numpy array, which means that combining them could be problematic Changed ZigZag to also return a numpy array to unify the implementations. Testing done: - All tests now pass

Before this change, we only used one algorithm, so we hardcoded it into the result. However, we can now use either the main algorithm or the backup algorithm. So we return the algo also from `get_points_to_filter` and attribute it correctly. `get_points_to_filter` is used only in `location_smoothing` and in the tests. So also fix the tests to read both values and check the sel algo in each case Testing done: tests pass

- Unify algo outputs: `self.inlier_mask_ = self.inlier_mask_.to_numpy()` - remove `to_numpy()` from all the checks in the tests - Return two outputs -> `return (None, None)` Testing done: - All tests in this file pass

When we moved the second round checks to the calling function in cebb81f we caused a very subtle regression The filtering code had an early return if there were no jumps detected. So in that case, we would not try the second round of checks, or attempt to filter again. However, when we moved the second round checking to the outer function, we called the second round anyway even if the first round didn't detect any jumps And in this one case, we actually found an outlier in the second round, which caused the test to fail. Fixed by checking to see if there were no outliers in the first round and skipping the second round check in that case. Everything in the `else` for the `if outlier_arr[0].shape[0] == 0:` is unchanged, just moved in a bit, not changed. The check for the length was unexpectedly complicated and took many hours to debug, so I added it as a simple use case. Note also that it is not clear if this is the correct long-term approach. If there were no jumps, then why did using the backup change anything? Maybe we should always use the backup. But changing this to avoid the regression for now; will look at this the next time we look at smoothing Testing done: - `TestPipelineRealData.testIosJumpsAndUntrackedSquishing` passes - `TestLocationSmoothing` passes

`get_filtered_points` is not used anywhere else we don't need to print out the series and the numpy version any more now that we have added the unit test in 5a4ae3d

PR to filter big jumps even if all segments are in clusters

Testing done: Test passes

…d CI `TestExpectationPipeline` assumes that there are `mode_confirm` and `purpose_confirm` objects, because they will match and have a high confidence. If we add a new key `user_input`, then there are no matches for it, so we end up with the `-1` option (see details below). This is a regression caused by ae325ec We fix this by changing the config to only include mode and purpose confirm during setup using the built-in `override_keys` and by restoring the config at the end. Also add some code to help debug this (currently commented out) Testing done: - Tests pass Expectation test failure: - test is failing with trip 795 the date is `arrow.get("2021-06-08T20:00:00.000", tzinfo=tz),` we expect `795: {"type": "none"},` we get `{'type': 'randomDays', 'value': 2}` The only config which has `randomDays` is the relaxed configuration, and we should only see that for the `-1` trigger. In our case, though, we have a label ``` [{'labels': {'mode_confirm': 'bike', 'purpose_confirm': 'work'}, 'p': 0.96}, {'labels': {'mode_confirm': 'walk', 'purpose_confirm': 'shopping'}, 'p': 0.04}] ``` we also have ``` 'expectation': {'to_label': True}, 'confidence_threshold': 0.55} ``` which means that it should have returned ``` { "trigger": 1, "expect": { "type": "none" } } ``` why did that not happen? mode is ``` {'label': 'relaxed', 'confidenceThreshold': 0.55, 'rules': [{'trigger': -1, 'expect': {'type': 'randomDays', 'value': 2}, 'notify': {'type': 'dayEnd'}}, {'trigger': 1, 'expect': {'type': 'none'}}, {'trigger': 0.95, 'expect': {'type': 'randomFraction', 'value': 0.05}, 'notify': {'type': 'dayEnd'}}, {'trigger': 0.75, 'expect': {'type': 'randomDays', 'value': 2}, 'notify': {'type': 'dayEnd'}}]} ``` But the userinput key list (`for input in _keylist:`) https://github.com/e-mission/e-mission-server/blob/b7749d0a33f579eab8362416c32d2dcebd4d0b0c/emission/analysis/configs/expectation_notification_config.py#L106 populated from https://github.com/e-mission/e-mission-server/blob/b7749d0a33f579eab8362416c32d2dcebd4d0b0c/emission/analysis/configs/expectation_notification_config.py#L30 also includes `userinput`, which does not have any matches, so it matches the `-1` trigger

Before this, we attempted to stash the config before changing it by using code similar to `config_stash = <module>.config` Alas, due to the lack of a `copy.copy` around it, the stashed value was just a reference to the original value, and any further modifications to `<module.config` are also reflected in the stash. So restoring at the end using `<module>.config = config_stash` doesn't actually restore it. This is not visible while running the tests individually since we don't attempt to use the "restored" value after it is restored. However, while running multiple test in the same process, the overriden config is retained and causes other tests to fail. Concretely: - the `preprocess_trip` trip option from `TestExpectationPipeline.py` was not reverted, and ended up causing an unrelated test (TestPipelineReset) to fail with the following error ``` ====================================================================== ERROR: testResetToFuture (pipelineTests.TestPipelineReset.TestPipelineReset) - Load data for both days ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 412, in testResetToFuture etc.runIntakePipeline(self.testUUID) File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/common.py", line 196, in runIntakePipeline eaue.populate_expectations(uuid) File "/home/runner/work/e-mission-server/e-mission-server/emission/analysis/userinput/expectations.py", line 23, in populate_expectations expected_trip = _process_and_save_trip(user_id, inferred_trip, ts) File "/home/runner/work/e-mission-server/e-mission-server/emission/analysis/userinput/expectations.py", line 38, in _process_and_save_trip if _test_options["preprocess_trip"] is not None: _test_options["preprocess_trip"](expected_trip) File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/pipelineTests/TestExpectationPipeline.py", line 61, in <lambda> eaue._test_options["preprocess_trip"] = lambda trip: self.preprocess(trip) File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/pipelineTests/TestExpectationPipeline.py", line 67, in preprocess trip["data"]["end_ts"] = self.contrived_dates[self.fingerprint(trip)].float_timestamp KeyError: 566 ``` Change: Add `copy.copy` to all config stashes Testing done: ``` PYTHONPATH=. python -m unittest discover -s emission/tests/pipelineTests/ -p Test* ``` results in ``` ---------------------------------------------------------------------- Ran 35 tests in 294.002s OK ```

✅ Enable pipeline tests

To create confirmedplace.py, I also created expectedplace.py and inferredplace.py

Because these files are not needed as places do not need inferred and expected, i am removing them

…e-mission-server into conda_version_upgrade

@shankari

pipelinestate.py - Added a 17th ENUM to the PipelineStages named "CREATE_PLACE_OBJECTS" reset.py - Added a Pipeline stage to the stages_with_fuzz array for CREATE_PLACE_OBJECTS common.py - Added a "create_place_objects" function after all the other pipeline stage functions intake_stage.py - Added a stage for the intake to store the pipeline time for CREATE_PLACE_OBJECTS, and call the create_place_object function confirmedplace.py - Removed expected_place and primary_section because @shankari said in our last meeting that the confirmedplace object does not need these matcher.py - Created the create_place_objects function which follows the same format as create_confirmed_objects, but only for trips. It uses the same existing mark_confirmed_object_creation_done/failed functions, and the same get_time_range_for_confirmed_object_creation - Added a create_confirmed_places function which follows the same format as create_confirmed_trips; I removed any data which does not exist in the confirmed_places emission/core/wrapper/confirmedplace.py data model

-- additions matching function genericized for both trips and places

- this new stage in the pipeline will create composite trip objects, after confirmed objects have been created

Per shankari#4 (comment) we will rename this to confirmed_object since it does not specifically need to be a trip

Per shankari#4 (comment) we will remove inferred_labels, expectation, inferred_primary_mode, and user_input since confirmed_place will not need that for now

…g.debug Per shankari#4 (comment) we will remove Logging.info and replace with Logging.debug statements

In 73414b9, we added logging to track how long the section summaries take. The log prints out the trip id as the first line of the function ``` + logging.debug(f"get_section_summary({cleaned_trip['_id']}, {section_key}) called") ``` So the error when we pass in a fake trip is now that the `_id` doesn't exist, not that the `user_id` doesn't exist. Fixed by changing the expected exception Testing done: ``` ---------------------------------------------------------------------- Ran 1 test in 0.247s OK ```

In 0403ae0, we start precomputing the pipeline stage and storing it in this profile. This means that we need the user to have a profile to begin with. In the real world, users will always have profiles since we register them during `profile/create` In this test, we set the `testEmail` so that we register the users as part of the test as well. This implies that we also: - need to delete the profile_db() as part of cleanup - use the version of the pipeline in `epi` so that it will run the pre-population code Once we do that, the tests pass. However, when they were failing, we realized that we can't compare two tuples with `assertAlmostEqual` since the assertion can't subtract tuples to get the difference. Instead, we separate out the start and end and compare them independently with `assertAlmostEqual` Testing done: both tests pass ``` ---------------------------------------------------------------------- Ran 2 tests in 33.884s OK ```

⚡ Do nothing well (2023 server edition)

…ated + remove computing the cached diary objects using the OUTPUT_GEN pipeline state + diary screen is gone; we don't even use the cached values properly any more + no need to waste time iterating through trips, sections and trajectories, and fill in documents that nobody uses. This is a minimal change that should help get us to production. Removing all vestiges of OUTPUT_GEN from the codebase is left as a task for a future intern. Testing done: ``` ./e-mission-py.bash emission/tests/analysisTests/intakeTests/TestPipelineCornerCases.py ---------------------------------------------------------------------- Ran 3 tests in 12.238s OK ```

⚡ 🐛 Mark the pipeline range right after the composite trips a…

The following changes support e-mission-server-eval-private's TRB_label_assist, reducing dependence on custom branch.

Moved the `clusteringWay` based decision making while binning further upstream, thus generalising `similar` (in `similarity_metrics.py`) and `similarity` ( in `od_similarity.py`) functions. Can now be used across modules without the need for `clusteringWay` parameter.

Comment fixes for better readability.

Implemented code for issue #933 in e-mission-docs for adding functionality to count number of documents. I've determined that 'key' parameter can be passed to retrieve appropriate timeseries db collection. A query is generated with optional extra_query keys list which returns filtered data set.

Updated variable name in definition but missed changing it in its usage in 'total_entries'.

- Fixed previous commits in the actual count_data() function. I had ignored the ordering of function arguments passed to the _get_query() template method which takes in parameters other than "key, extra_query_list". So, I used the actual keywords of the parameter name "extra_query_list" to assign parameter. - Primarily, added a basic test case for the count_data() functionality. Using the "background/location" key value for the "shankari_2015-aug-27" sample dataset. TODO: - Working on edge cases.

Tests created to confirm configuration for trip clustering (origin, destination and origin-destination) work as expected inside the GreedySimilarityBinning class in `greedy_similarity_binning.py` file. In order to upgrade old tests, `generate_mock_trips` in `modellingTestAssets.py` was also changed. Previously, out of the n trips generated, m had both origin and destination either inside or outside threshold,thus allowing only 2 configs. Now, 4 configurations are possible, one among origin OR destination OR origin-and-destination or Neither-origin-nor-destination. Default is set to 'origin-and-destination' since this was the old default.

Checking `Similarity` behaves as expected when list of size 2 ( for only origin OR only destination ) or size 4 (for origin AND destination) are passed.

…ases - Updated the function name and the function signature to include other common parameters like time_query, geo_query along with key and extra_queries to maintain consistency with existing functions. - Modified test cases to include dataset samples with more specific keys. - Provided code comments to clarify validation process for inputs and outputs of test datasets.

- Repositioned assert statements below their respective computations.

1. improved logic based on this comment . 710d1a5#r1314065502 2.Created a utilities file for repetitive code required by multiple files. 3. clustering threshold back to 500 4. More in-code comments.

Code Logic - Updated the code to return counts from both original_timeseries and analysis_timeseries as a tuple of lists pertaining to each key for each timeseries db. Test cases: - Updated test cases to include counts from both DBs. - Included new test cases for edge conditions. - Test cases include: 1. Key_list of keys from both DBs 2. Key_list of keys from Original timeseries DB. 3. Key_list of keys from Analysis timeseries DB. 4. Empty key_list 5. Invalid keys 6. Aggregate timeseries DB data as input. 7. New user without any data

- Had misunderstood usage of aggregate timeseries class. - Corrected test case for aggregate class to check for sum of counts for each user without explicitly fetching each user.

- Added counts of grep outputs for sample inputs. - Explained expected scenarios for blank keys. - Removed miniconda.sh file added mistakenly post re-setup.

- Reused existing functionality from builtin_timeseries.py to fetch computed count of entries - Corrected return values to return combined sum of count of matching entries not segregated by keys or timeseries dataset.

Random trips are now generated like this : if certain trips is are to be binned together ( by 'o','d' or 'od' or '__' (meaning NONE)) they are generated in proximity of the previous in-bin trip. Otherwise, if they are not to be binned together, we keep generating a random trip unless we find one that would not bin with previously accepted trips.

- Commented out test case for aggregate subclass with input as empty key list. - Previous commit failed on running all tests together due to an unaccounted extra document entry in analysis_timeseries. - Ignoring this test for now.

`od_similarity.py` 1. Explicitly passing 'origin', 'destination', 'origin-destination' for similarity check in `similarity` `similarity_metric.py` 2. Passing the clustering_way parameter `greedy_similarity_binning.py` 3. Since this decision making is moved downstream to `similarity`, so removing it from here. `modellingTestAssets.py` 4. Removing both 2 line wrappers (SetModelConfig, setTripConfig ) from this file since this was parametrised using sub-Test 2 commits back. 5. Removed CalDistanceTest. This was introduced to keep calDistance of test separate from the calDistance being used by the one being used by `greedySimilaritybinning`. Unnecesary. 6. Using ref. coordinates whenever provided to generate trip coordinates. If not, use randomly generated coordinates as reference points. 7. receiving and passing origin and destination ref. points. in `generate_mock_trips' `TestGreedySimilarityBinning.py` 8. removed wrappers for trip and model generation. 9. Using just single threshold for generating trips and for binning. Removed two thresholds. `TestSimilarityMetric.py` 10. Removing the implicitness used in binning by passing this as a parameter.

…subclass test case in TestTimeSeries.py - Had initially removed aggregate subclass test case in TestTimeseries.py as the overall tests were failing. - Debugged the entire issue and found out that some test files in emission/tests/storageTests did not have a tearDown() to clear up test data initiated during that specific test. - These tests include: TestAnalysisTimeseries, TestPlaceQueries, TestSectionQueries, TestStopQueries - Added tearDown() with code to clear the database as was being done in all other storageTests/ .

Added count_documents() implementation to builtin_timeseries

Generating Random points from circle ( rather than Square) around ref_points. Better Explanations for random point generation. Whitespace fixes.

Comments and variable names fixed

Moving dependence from custom branch's tour_model to master's trip_model

shankari and others added 28 commits January 23, 2023 22:33

Fix regressions in tests

95f88c5

- Unify algo outputs: `self.inlier_mask_ = self.inlier_mask_.to_numpy()` - remove `to_numpy()` from all the checks in the tests - Return two outputs -> `return (None, None)` Testing done: - All tests in this file pass

🔥 Remove unused function and extraneous logs

29e78de

`get_filtered_points` is not used anywhere else we don't need to print out the series and the numpy version any more now that we have added the unit test in 5a4ae3d

Merge pull request #897 from shankari/fix_big_jump_all_cluster

b7749d0

PR to filter big jumps even if all segments are in clusters

Fix the expectation pipeline tests

267c145

Testing done: Test passes

Merge pull request #898 from shankari/enable_pipeline_tests

6b48957

✅ Enable pipeline tests

Create confirmedplace.py

8abd2ef

To create confirmedplace.py, I also created expectedplace.py and inferredplace.py

Updating Conda version

fd8526e

Delete miniconda.sh

81682dd

Added confirmedplace in entry.py and builtin_timeseries.py

743eca0

Removed expectedplace.py and inferredplace.py

64cab33

Because these files are not needed as places do not need inferred and expected, i am removing them

Fixing TestMongoGeoJSONQueries Test case

80c5829

Merge branch 'conda_version_upgrade' of https://github.com/swastis10/…

82c6a89

…e-mission-server into conda_version_upgrade

Delete .gitignore

29367c1

adding gitignore back

4b4d9f0

First draft

e8a910d

create confirmed_places during CREATE_CONFIRMED_OBJECTS

286f233

-- additions matching function genericized for both trips and places

implement CREATE_COMPOSITE_OBJECTS

9eb812d

- this new stage in the pipeline will create composite trip objects, after confirmed objects have been created

Change confirmed_trip variable to confirmed_object for consistency

5cdbdeb

Per shankari#4 (comment) we will rename this to confirmed_object since it does not specifically need to be a trip

Remove unnecessary confirmed_place properties

64f1f51

Per shankari#4 (comment) we will remove inferred_labels, expectation, inferred_primary_mode, and user_input since confirmed_place will not need that for now

Remove info log spew by replacing Logging.info changes back to Loggin…

3fbc72d

…g.debug Per shankari#4 (comment) we will remove Logging.info and replace with Logging.debug statements

shankari and others added 28 commits August 14, 2023 23:17

Merge pull request #931 from shankari/scalability_improvements

f92b00a

⚡ Do nothing well (2023 server edition)

Merge pull request #932 from shankari/scalability_improvements

94e7478

⚡ 🐛 Mark the pipeline range right after the composite trips a…

Moving Dependence from tour_model to trip_model

aa8cfd6

The following changes support e-mission-server-eval-private's TRB_label_assist, reducing dependence on custom branch.

Minor Comment fixes

b5595c9

Comment fixes for better readability.

Fixed typo in variable name

776ba50

Updated variable name in definition but missed changing it in its usage in 'total_entries'.

Testing upgraded Similarity functionality

6d9ea77

Checking `Similarity` behaves as expected when list of size 2 ( for only origin OR only destination ) or size 4 (for origin AND destination) are passed.

Moved assert statements

bb88af4

- Repositioned assert statements below their respective computations.

Correcting logic behind binning and non-binning while testing

7f6d754

1. improved logic based on this comment . 710d1a5#r1314065502 2.Created a utilities file for repetitive code required by multiple files. 3. clustering threshold back to 500 4. More in-code comments.

Fixed Aggregate class test case

177c45b

- Had misunderstood usage of aggregate timeseries class. - Corrected test case for aggregate class to check for sum of counts for each user without explicitly fetching each user.

Updated code comments for TestTimeSeries.py

0903cf7

- Added counts of grep outputs for sample inputs. - Explained expected scenarios for blank keys. - Removed miniconda.sh file added mistakenly post re-setup.

Simplified count computation and updated returned values

825d4ce

- Reused existing functionality from builtin_timeseries.py to fetch computed count of entries - Corrected return values to return combined sum of count of matching entries not segregated by keys or timeseries dataset.

Merge pull request #935 from MukuFlash03/master

f976ab2

Added count_documents() implementation to builtin_timeseries

[TESTED]Improved random point generation logic

1c0526b

Generating Random points from circle ( rather than Square) around ref_points. Better Explanations for random point generation. Whitespace fixes.

Minor fixes

21305df

Comments and variable names fixed

Merge pull request #933 from humbleOldSage/FirstPointNewMain#35

55704fc

Moving dependence from custom branch's tour_model to master's trip_model

shankari closed this Sep 23, 2023

shankari deleted the master branch September 23, 2023 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull from master #858

Pull from master #858

shankari commented Jun 8, 2022

shankari commented Jun 8, 2022 •

edited

Loading

shankari commented Jun 8, 2022

Pull from master #858

Pull from master #858

Conversation

shankari commented Jun 8, 2022

shankari commented Jun 8, 2022 • edited Loading

shankari commented Jun 8, 2022

shankari commented Jun 8, 2022 •

edited

Loading