Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how to support running analyses on the dataset #24

Open
singhish opened this issue Mar 31, 2021 · 11 comments
Open

Determine how to support running analyses on the dataset #24

singhish opened this issue Mar 31, 2021 · 11 comments

Comments

@singhish
Copy link

Specifically, the notebooks ending in _master are failing due to how our current analysis pipeline is set up. A decision needs to be made regarding how to handle the analysis that is currently being done by using the analysis-related keys on E-Mission.

@shankari
Copy link
Contributor

shankari commented Mar 31, 2021

I can think of three options for how to do this:

  • first: keep the analysis in e-mission. To implement this, we would create a "development environment" that would include a docker-compose.yml file that would launch a mongodb instance, an e-mission-server instance, load the data and run the pipeline. Then we can just retrieve the analysis results from the newly created e-mission-server instance and either compare directly or save to a file or ???. if people wanted to compare two algorithms, they would need to run two instances of the development environment (maybe a port 2323 and port 4545) and then we could compare against them.

  • Pro: minimal dev work on our side

  • Con: (1) JSON file export is useless, (2) hard for users to do interactive analysis, (3) harder for users to work with a full system rather than just the ML component
    can't use notebooks for example

  • second: pull out the algorithms so they can run from files. So every algorithm reads input data from the database, processes it, and saves back to the database. The read/write is using the Timeseries interface (https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/abstract_timeseries.py). If we provided an alternate implementation of the timeseries interface that worked with file data, we could just instantiate the algorithms with that data source and make it easier to run them in a notebook, for example.

  • Con: more complex dev work on our side. @shankari will probably need to do at least some heavy lifting on the timeseries implementation. @singhish can change the references and fix bugs

  • Pro: more modular, makes it easier for us to publish a challenge later, makes it easier for other potential collaborators to work with our data....

  • third?: would be good to experiment with a potential third solution. I think that @singhish said that there was an embedded JSON database, similar to sqlite that works with JSON files. Could using that database simplify the process of creating a new timeseries interface?

    • Concretely, in that case, we would not have to create a new Timeseries implementation. Instead, we would simply load the data into the embedded DB and change the connection URL. The existing timeseries implementation would Just Work, and we would only need to move out the algorithms into a separate repo or something like that.
      • Questions to investigate for that:
        • can it work with files?
        • is it mongodb compatible (so can our existing mongodb queries in the implementation of the timeseries interface work with the new database without any changes)? so our change is as simple as replacing the connection URL?
        • how easy is it to load/retrieve data? Do we need a separate library?
    • Pro: it has all the pros of the (2) with less work on our part?

@shankari
Copy link
Contributor

shankari commented Mar 31, 2021

Playing around with NEDB: an example notebook where:

  • you load the data from the dumped files into NEDB
  • you run some of the queries in the Timeseries implementation against NEDB manually
    • from the current interface it can be a little tricky to see what the queries look like. you can look at /var/tmp/webserver.log to find query example - e.g.

2021-03-26 20:14:27,545:DEBUG:123145524649984:Found 14 messages in response to query {'user_id': UUID('4aebf2e0-f097-4845-8652-2ada3a76dadd'), '$or': [{'metadata.key': 'statemachine/transition'}], 'metadata.write_ts': {'$lte': 1581048094.196096, '$gte': 1581005599.791271}}


make sure to change the UUID and the timestamps to data that you have actually loaded.
- you use a NEDB connect URL and read from the Timeseries directly (examples of reading from the timeseries directly are in the e-mission server `Timeseries_Sample.ipynb`

@shankari
Copy link
Contributor

to use the timeseries. you will need to add the e-mission server directory to your PYTHONPATH

@shankari
Copy link
Contributor

shankari commented Apr 1, 2021

Let's do the easy design first.

  • the pipeline code, in addition to its own implementation, relies on the storage and core modules of e-mission. So if we use the pipeline elsewhere, we need to import those modules from somewhere else. There are a couple of ways to do this:
    • ask people to check out the server and include a PYTHONPATH, but this is lame
    • pull out the core and storage modules into actual pypi modules that you can pip install at least from a repo. Then we can just add them as dependencies to the emissioneval setup script, and everything will just work. This is the way to go
  • how will people get the existing algorithms? Will they just check them out? Presumably they will not be in the mobilitynet repo. We also can't put them into pypi because people probably want to look at the source code
    • how do other projects deal with their baseline implementations?

@shankari
Copy link
Contributor

shankari commented Apr 2, 2021

The goal of this project is to make it easier for others to come up with their own algorithms.

I think that the two options are:

  • use the existing notebook-based naive implementations as the baseline

    • focus on getting the data structure to be easier to work with
    • document the current baseline and the run instructions a lot better
    • use the docker-compose option to run the e-mission analysis pipelines and pull the analysis results
    • score existing e-mission algorithms using analysis results
  • rework the e-mission algorithms so that they can be the baseline

    • refactoring the e-mission codebase to pull out the core and storage into pypi
    • creating an alternate implementation of the Timeseries interface
    • figuring out how to publish the e-mission algorithms so that they can be accessed both from notebooks and from production code
    • change the algorithms to use the new publication structure

We have time to implement one option, not two.
Which should it be?

@singhish
Copy link
Author

singhish commented Apr 2, 2021

I'm leaning the second option -- having the e-mission algorithms published would make it easier for others to come up with their own algorithms based on our own provided implementation. Refactoring the e-mission codebase so that core and storage are more compartmentalized would probably also make things easier for us to work with on our end in the long run.

Am I headed in the right direction?

@shankari
Copy link
Contributor

shankari commented Apr 2, 2021

I'm actually leaning towards the first option. The main difference is that it is by no means clear to me that anybody wants to come up with new algorithms based on our provided implementation. I think that people want to start with the data, explore it, and try out ML libraries (keras, sklearn, etc) on it.

We have one customer request: did he ask for mongodump (which would have used the database and the existing algorithms) or files (which would be more in line with the ML library approach)?

@singhish
Copy link
Author

singhish commented Apr 2, 2021

I see, that makes sense. The customer asked for files @shankari.

@shankari
Copy link
Contributor

shankari commented Apr 2, 2021

@singhish ok, let's try to get a second data point with you pretending to be a customer since you are not as close to the data as I am.

Let's say you want to enter a challenge in which you need to segment a trip into multiple unimodal segments.
There is an existing analysis pipeline in which that is one step, and that looks like
https://github.com/e-mission/e-mission-server/blob/master/emission/pipeline/intake_stage.py
This stage is at lines 129 to 135.

Would you prefer to work with notebooks that had a simpler embedded baseline, or try to work with that code to understand and improve it?

@singhish
Copy link
Author

singhish commented Apr 2, 2021

This might be a personal thing, but as a developer, probably the latter, as working with code feels a lot nicer to me as opposed to dealing with the overhead associated with running a notebook. Data scientists might prefer the former though. @shankari

@shankari
Copy link
Contributor

shankari commented Apr 2, 2021

@singhish @jf87 if you are back from vacation, can you make the final call since you have actually tried to work with MobilityNet before.
If we don't hear back from @jf87 by Monday, we will go with option (1) by maintainer fiat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants