Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull from master #858

Closed
wants to merge 594 commits into from
Closed

Pull from master #858

wants to merge 594 commits into from

Conversation

shankari
Copy link
Contributor

@shankari shankari commented Jun 8, 2022

Let's see the conflicts

@shankari
Copy link
Contributor Author

shankari commented Jun 8, 2022

  • entry: keep both
  • webapp: keep both (mostly) except for the line with `@post("/datastreams/find_entries/<time_type>", method=['OPTIONS', 'POST']) keep the zephyr version
  • timeseries: keep both
    `

@shankari
Copy link
Contributor Author

shankari commented Jun 8, 2022

can you also file an issue to merge the emevalzephyr data types to master so we don't have to keep doing this

shankari and others added 28 commits January 23, 2023 22:33
- Since we have already implemented many different smoothing algorithms, we
  pick POSDAP to use as backup
- if we still have outliers after the first round, and the max value is over
  MACH1, we fall back to the backup algo
- after implementing the backup algo, if we don't have outliers,
  the backup algo has succeeded and we use its results
- if we do have outliers, but the max value is under MACH1,
  the backup algo has succeeded and we use its results
- if we have outliers, and the max is high (> MACH1)
  the backup algo has failed

With this change, both the tests also change to the correctly deleted values
- [16 17 18 19 20] for use case 1 (e-mission/e-mission-docs#843 (comment))
- [11] for use case 2 (e-mission/e-mission-docs#843 (comment))

In this commit, we also check in the csv data files for the two test cases
…moothing file

This addresses a long-term TODO
https://github.com/e-mission/e-mission-server/blob/master/emission/analysis/intake/cleaning/cleaning_methods/jump_smoothing.py#L262

It also:
- ensures that the individual algorithms are clean and modular and don't depend on other algorithms
- we can swap in any algorithm for the backup algo
- we can support more complex backups in the future

Testing done:
- modified the test to pass in the backup algo
- tests pass
Added a new unit test for the case of `backup_algo == None`, which should
return the original algo results.

While testing, found that the ZigZag algo returns a pandas Series,
while the Posdap algo returns a numpy array, which means that combining them
could be problematic

Changed ZigZag to also return a numpy array to unify the implementations.
Testing done:
- All tests now pass
Before this change, we only used one algorithm, so we hardcoded it into the
result. However, we can now use either the main algorithm or the backup
algorithm. So we return the algo also from `get_points_to_filter` and attribute
it correctly.

`get_points_to_filter` is used only in `location_smoothing` and in the tests.
So also fix the tests to read both values and check the sel algo in each case

Testing done: tests pass
- Unify algo outputs: `self.inlier_mask_ = self.inlier_mask_.to_numpy()`
    - remove `to_numpy()` from all the checks in the tests
- Return two outputs -> `return (None, None)`

Testing done:
- All tests in this file pass
When we moved the second round checks to the calling function in
cebb81f
we caused a very subtle regression

The filtering code had an early return if there were no jumps detected.
So in that case, we would not try the second round of checks, or attempt to
filter again.

However, when we moved the second round checking to the outer function, we
called the second round anyway even if the first round didn't detect any jumps
And in this one case, we actually found an outlier in the second round, which
caused the test to fail.

Fixed by checking to see if there were no outliers in the first round and
skipping the second round check in that case.

Everything in the `else` for the
`if outlier_arr[0].shape[0] == 0:` is unchanged, just moved in a bit, not changed.

The check for the length was unexpectedly complicated and took many hours to
debug, so I added it as a simple use case.

Note also that it is not clear if this is the correct long-term approach.
If there were no jumps, then why did using the backup change anything?
Maybe we should always use the backup.

But changing this to avoid the regression for now; will look at this the next
time we look at smoothing

Testing done:
- `TestPipelineRealData.testIosJumpsAndUntrackedSquishing` passes
- `TestLocationSmoothing` passes
`get_filtered_points` is not used anywhere else
we don't need to print out the series and the numpy version any more now that we have added the unit test in
5a4ae3d
PR to filter big jumps even if all segments are in clusters
Testing done: Test passes
…d CI

`TestExpectationPipeline` assumes that there are `mode_confirm` and
`purpose_confirm` objects, because they will match and have a high confidence.
If we add a new key `user_input`, then there are no matches for it, so we end
up with the `-1` option (see details below).

This is a regression caused by
ae325ec

We fix this by changing the config to only include mode and purpose confirm
during setup using the built-in `override_keys` and by restoring the config at
the end.

Also add some code to help debug this (currently commented out)

Testing done:
- Tests pass

Expectation test failure:
- test is failing with trip 795
    the date is `arrow.get("2021-06-08T20:00:00.000", tzinfo=tz),`
    we expect `795: {"type": "none"},`
    we get `{'type': 'randomDays', 'value': 2}`

The only config which has `randomDays` is the relaxed configuration, and we
should only see that for the `-1` trigger. In our case, though, we have a label

```
[{'labels': {'mode_confirm': 'bike', 'purpose_confirm': 'work'}, 'p': 0.96}, {'labels': {'mode_confirm': 'walk', 'purpose_confirm': 'shopping'}, 'p': 0.04}]
```

we also have

```
'expectation': {'to_label': True}, 'confidence_threshold': 0.55}
```

which means that it should have returned

```
                {
                    "trigger": 1,
                    "expect": {
                        "type": "none"
                    }
                }
```

why did that not happen?

mode is

```
{'label': 'relaxed', 'confidenceThreshold': 0.55, 'rules': [{'trigger': -1, 'expect': {'type': 'randomDays', 'value': 2}, 'notify': {'type': 'dayEnd'}}, {'trigger': 1, 'expect': {'type': 'none'}}, {'trigger': 0.95, 'expect': {'type': 'randomFraction', 'value': 0.05}, 'notify': {'type': 'dayEnd'}}, {'trigger': 0.75, 'expect': {'type': 'randomDays', 'value': 2}, 'notify': {'type': 'dayEnd'}}]}
```

But the userinput key list (`for input in _keylist:`)
https://github.com/e-mission/e-mission-server/blob/b7749d0a33f579eab8362416c32d2dcebd4d0b0c/emission/analysis/configs/expectation_notification_config.py#L106
populated from
https://github.com/e-mission/e-mission-server/blob/b7749d0a33f579eab8362416c32d2dcebd4d0b0c/emission/analysis/configs/expectation_notification_config.py#L30

also includes `userinput`, which does not have any matches, so it matches the `-1` trigger
Before this, we attempted to stash the config before changing it
by using code similar to `config_stash = <module>.config`

Alas, due to the lack of a `copy.copy` around it, the stashed value was just a
reference to the original value, and any further modifications to
`<module.config` are also reflected in the stash.

So restoring at the end using `<module>.config = config_stash` doesn't actually
restore it.

This is not visible while running the tests individually since we don't attempt
to use the "restored" value after it is restored.

However, while running multiple test in the same process, the overriden config
is retained and causes other tests to fail.

Concretely:
- the `preprocess_trip` trip option from `TestExpectationPipeline.py` was not
  reverted, and ended up causing an unrelated test (TestPipelineReset) to fail
    with the following error

```
======================================================================
ERROR: testResetToFuture (pipelineTests.TestPipelineReset.TestPipelineReset)
- Load data for both days
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/pipelineTests/TestPipelineReset.py", line 412, in testResetToFuture
    etc.runIntakePipeline(self.testUUID)
  File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/common.py", line 196, in runIntakePipeline
    eaue.populate_expectations(uuid)
  File "/home/runner/work/e-mission-server/e-mission-server/emission/analysis/userinput/expectations.py", line 23, in populate_expectations
    expected_trip = _process_and_save_trip(user_id, inferred_trip, ts)
  File "/home/runner/work/e-mission-server/e-mission-server/emission/analysis/userinput/expectations.py", line 38, in _process_and_save_trip
    if _test_options["preprocess_trip"] is not None: _test_options["preprocess_trip"](expected_trip)
  File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/pipelineTests/TestExpectationPipeline.py", line 61, in <lambda>
    eaue._test_options["preprocess_trip"] = lambda trip: self.preprocess(trip)
  File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/pipelineTests/TestExpectationPipeline.py", line 67, in preprocess
    trip["data"]["end_ts"] = self.contrived_dates[self.fingerprint(trip)].float_timestamp
KeyError: 566
```

Change: Add `copy.copy` to all config stashes

Testing done:

```
PYTHONPATH=. python -m unittest discover -s emission/tests/pipelineTests/ -p Test*
```

results in

```
----------------------------------------------------------------------
Ran 35 tests in 294.002s

OK
```
To create confirmedplace.py, I also created expectedplace.py and inferredplace.py
Because these files are not needed as places do not need inferred and expected, i am removing them
pipelinestate.py
- Added a 17th ENUM to the PipelineStages named "CREATE_PLACE_OBJECTS"

reset.py
- Added a Pipeline stage to the stages_with_fuzz array for CREATE_PLACE_OBJECTS

common.py
- Added a "create_place_objects" function after all the other pipeline stage functions

intake_stage.py
- Added a stage for the intake to store the pipeline time for CREATE_PLACE_OBJECTS, and call the create_place_object function

confirmedplace.py
- Removed expected_place and primary_section because @shankari  said in our last meeting that the confirmedplace object does not need these

matcher.py
- Created the create_place_objects function which follows the same format as create_confirmed_objects, but only for trips. It uses the same existing mark_confirmed_object_creation_done/failed functions, and the same get_time_range_for_confirmed_object_creation
- Added a create_confirmed_places function which follows the same format as create_confirmed_trips; I removed any data which does not exist in the confirmed_places emission/core/wrapper/confirmedplace.py data model
-- additions matching function genericized for both trips and places
- this new stage in the pipeline will create composite trip objects, after confirmed objects have been created
Per shankari#4 (comment) we will rename this to confirmed_object since it does not specifically need to be a trip
Per shankari#4 (comment) we will remove inferred_labels, expectation, inferred_primary_mode, and user_input since confirmed_place will not need that for now
…g.debug

Per shankari#4 (comment) we will remove Logging.info and replace with Logging.debug statements
shankari and others added 28 commits August 14, 2023 23:17
In 73414b9, we added logging to track how long
the section summaries take. The log prints out the trip id as the first line of
the function

```
+    logging.debug(f"get_section_summary({cleaned_trip['_id']}, {section_key}) called")
```

So the error when we pass in a fake trip is now that the `_id` doesn't exist,
not that the `user_id` doesn't exist.

Fixed by changing the expected exception

Testing done:

```
----------------------------------------------------------------------
Ran 1 test in 0.247s

OK
```
In 0403ae0, we start precomputing the pipeline
stage and storing it in this profile. This means that we need the user to have
a profile to begin with.

In the real world, users will always have profiles since we register them
during `profile/create`

In this test, we set the `testEmail` so that we register the users as part of
the test as well. This implies that we also:
- need to delete the profile_db() as part of cleanup
- use the version of the pipeline in `epi` so that it will run the pre-population code

Once we do that, the tests pass. However, when they were failing, we realized
that we can't compare two tuples with `assertAlmostEqual` since the assertion
can't subtract tuples to get the difference. Instead, we separate out the start
and end and compare them independently with `assertAlmostEqual`

Testing done:
both tests pass

```
----------------------------------------------------------------------
Ran 2 tests in 33.884s

OK
```
⚡ Do nothing well (2023 server edition)
…ated

+ remove computing the cached diary objects using the OUTPUT_GEN pipeline state
+ diary screen is gone; we don't even use the cached values properly any more
+ no need to waste time iterating through trips, sections and trajectories, and
fill in documents that nobody uses.

This is a minimal change that should help get us to production.
Removing all vestiges of OUTPUT_GEN from the codebase is left as a task for a
future intern.

Testing done:
```
./e-mission-py.bash emission/tests/analysisTests/intakeTests/TestPipelineCornerCases.py
----------------------------------------------------------------------
Ran 3 tests in 12.238s

OK
```
⚡ 🐛 Mark the pipeline range right after the composite trips a…
The following changes support e-mission-server-eval-private's  TRB_label_assist, reducing dependence on custom branch.
Moved the `clusteringWay` based decision making while binning further upstream, thus generalising `similar` (in `similarity_metrics.py`) and `similarity` ( in `od_similarity.py`) functions. Can now be used across modules without the need for `clusteringWay` parameter.
Comment fixes for better readability.
Implemented code for issue #933 in e-mission-docs for adding functionality to count number of documents.
I've determined that 'key' parameter can be passed to retrieve appropriate timeseries db collection.
A query is generated with optional extra_query keys list which returns filtered data set.
Updated variable name in definition but missed changing it in its usage in 'total_entries'.
- Fixed previous commits in the actual count_data() function.
I had ignored the ordering of function arguments passed to the _get_query() template method which takes in parameters other than "key, extra_query_list".
So, I used the actual keywords of the parameter name "extra_query_list" to assign parameter.

- Primarily, added a basic test case for the count_data() functionality.
Using the "background/location" key value for the "shankari_2015-aug-27" sample dataset.

TODO:
- Working on edge cases.
Tests created to confirm  configuration for trip clustering (origin, destination and origin-destination) work as expected inside the GreedySimilarityBinning class  in `greedy_similarity_binning.py` file.

In order to upgrade old tests, `generate_mock_trips` in `modellingTestAssets.py` was also changed. Previously, out of the n trips generated, m had both origin and destination either inside or outside threshold,thus allowing only 2 configs. Now, 4 configurations are possible, one among origin OR destination OR origin-and-destination or Neither-origin-nor-destination. Default is set to 'origin-and-destination' since this was the old default.
Checking `Similarity` behaves as expected when list of size 2 ( for only origin OR only destination ) or size 4 (for origin AND destination) are  passed.
…ases

- Updated the function name and the function signature to include other common parameters like time_query, geo_query along with key and extra_queries to maintain consistency with existing functions.

- Modified test cases to include dataset samples with more specific keys.

- Provided code comments to clarify validation process for inputs and outputs of test datasets.
- Repositioned assert statements below their respective computations.
1. improved logic based on this comment . 710d1a5#r1314065502

2.Created a utilities file for repetitive code required by multiple files.

3. clustering threshold back to 500

4. More in-code comments.
Code Logic
- Updated the code to return counts from both original_timeseries and analysis_timeseries as a tuple of lists pertaining to each key for each timeseries db.

Test cases:
- Updated test cases to include counts from both DBs.
- Included new test cases for edge conditions.
- Test cases include:
1. Key_list of keys from both DBs
2. Key_list of keys from Original timeseries DB.
3. Key_list of keys from Analysis timeseries DB.
4. Empty key_list
5. Invalid keys
6. Aggregate timeseries DB data as input.
7. New user without any data
- Had misunderstood usage of aggregate timeseries class.
- Corrected test case for aggregate class to check for sum of counts for each user without explicitly fetching each user.
- Added counts of grep outputs for sample inputs.
- Explained expected scenarios for blank keys.
- Removed miniconda.sh file added mistakenly post re-setup.
- Reused existing functionality from builtin_timeseries.py to fetch computed count of entries
- Corrected return values to return combined sum of count of matching entries not segregated by keys or timeseries dataset.
Random trips are now generated like this :

if  certain trips is are to be binned together ( by 'o','d' or 'od' or '__' (meaning NONE)) they are generated in proximity of the previous in-bin trip.  Otherwise, if they are not to be binned together, we keep generating a random trip unless we find one that would not bin with previously accepted trips.
- Commented out test case for aggregate subclass with input as empty key list.
- Previous commit failed on running all tests together due to an unaccounted extra document entry in analysis_timeseries.
- Ignoring this test for now.
`od_similarity.py`
1.  Explicitly passing 'origin', 'destination', 'origin-destination' for similarity check  in `similarity`

`similarity_metric.py`
2.  Passing the clustering_way parameter

`greedy_similarity_binning.py`
3.  Since this decision making is moved downstream to `similarity`, so removing it from here.

`modellingTestAssets.py`
4. Removing both 2 line wrappers (SetModelConfig, setTripConfig ) from this file since this was parametrised using sub-Test 2 commits back.

5. Removed CalDistanceTest. This was introduced to keep calDistance of test separate from the calDistance being used by the one being used by `greedySimilaritybinning`.  Unnecesary.

6.  Using ref. coordinates whenever provided to generate trip coordinates. If not, use randomly generated coordinates as reference points.

7. receiving and passing origin and destination ref. points.  in `generate_mock_trips'

`TestGreedySimilarityBinning.py`

8. removed wrappers for trip and model generation.

9. Using just single threshold for generating trips and for binning. Removed two thresholds.

`TestSimilarityMetric.py`

10. Removing the implicitness used in binning by passing this as a parameter.
…subclass test case in TestTimeSeries.py

- Had initially removed aggregate subclass test case in TestTimeseries.py as the overall tests were failing.
- Debugged the entire issue and found out that some test files in emission/tests/storageTests did not have a tearDown() to clear up test data initiated during that specific test.
- These tests include:
TestAnalysisTimeseries, TestPlaceQueries, TestSectionQueries, TestStopQueries
- Added tearDown() with code to clear the database as was being done in all other storageTests/ .
Added count_documents() implementation to builtin_timeseries
Generating Random points from  circle ( rather than Square) around ref_points.

Better Explanations for  random point generation.

Whitespace fixes.
Comments and variable names fixed
Moving dependence from custom branch's tour_model to master's trip_model
@shankari shankari closed this Sep 23, 2023
@shankari shankari deleted the master branch September 23, 2023 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants