Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ml sequencing #282

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Ml sequencing #282

wants to merge 9 commits into from

Conversation

Theodore-Chatziioannou
Copy link
Contributor

@Theodore-Chatziioannou Theodore-Chatziioannou commented Jun 3, 2024

This is the start of a set of examples exploring the use of Neural Networks for activity scheduling. The aim of the series is to predict the duration of agents' activities, given the type and order of activities in their plan, and their characteristics.

We will start with some toy models, to review possible approaches from "first-principles" perspectives.

For example in the 01_scheduling.ipynb example, we perform the following tests on a sequence-to-sequence model, using in-sample predictions:

  • is the model able to replicate simple observed patterns?
  • do the predicted activity durations throughout the day add up to 24 hours?
  • do we get any non-zero durations after the end-of-sequence token?

The proposed model in the notebook (pam.planner.choice_scheduling.ScheduleModelSimple) is a sequence-to-sequence LSTM model, inspired by models used for language translation. The "encoder" part of the model trains word embeddings for each activity type, passes the activity sequences through an LSTM layer, and stores its final hidden state. The plans are encoded as sequences with the use of the pam.planner.encoder.PlansSequenceEncoder class, which maps activities to integers, adds padding, as well as tokens for the start and end of sequences.

The "decoder" part of the model uses the hidden state and observed activity durations as inputs, and shifted activity durations as outputs ("teacher forcing").

To predict the durations of a plan, we start by passing the set of activities and the "start token". The duration prediction of each step is then added to the input of the next step, until we predict the durations of all activities.

The model seems to be working well with the toy example, although it can be unstable some times.

However, its major weakness is the determinism of the predictions. For example, for every agent with a home->work->home plan, the model will be predicting the same durations. This is unlikely to be useful for simulating human travel behaviour, where we rather need distributions of durations (and possibly multimodal distributions). This has also been one of the main obstacles for achieving good validation of the aggregate time distributions against observed plans.

The second notebook (02_gaussian_mixture.ipynb) tries to look at this problem, by employing a Mixture Density Network (MDN) for probabilistic regression. To simplify the problem, we are looking at a predicting the duration of a single activity (work), without any context around the rest of activities performed in a day. We generate a PAM population with work activities, whose durations are distributed normally around 8 hours (for full-time workers) and 4 hours (for part-time workers). As expected, a simple Neural Network always outputs the same value when predicting in-sample.

To tackle this, the MDN example (using pam.planner.choice_scheduling.ActivityDurationMixture) includes a Gaussian Mixture layer as the output, inferring the mean, variance, and weights of the mixture distribution. The predictions sample from the estimated distribution, and an in-sample simulation shows that the results follow the distribution of our data generation process.

The next step (for a future PR) will be to combine the two approaches: can we build a Recurrent Mixture Density Network? Then, we can also look into including person attributes as part of the model input.

The first PRs are meant for exploration, so the architecture and tests remain light, until we finalise an approach.

Keen to hear any ideas or suggestions, either for this PR's examples, or for future exploration / next steps.

Checklist

Any checks which are not relevant to the PR can be pre-checked by the PR creator.
All others should be checked by the reviewer(s).
You can add extra checklist items here if required by the PR.

  • CHANGELOG updated
  • Tests added to cover contribution
  • Documentation updated

@Theodore-Chatziioannou
Copy link
Contributor Author

@fredshone you may be interested in this as well

@fredshone
Copy link
Contributor

thanks for sharing. I have feedback 😁

The mixture model is interesting. Assume this can be stacked on top of the LSTM units to do inference with more sequence "context". The primary critique is that you have to specify the number of components and/or this assumes normal distributions? In which case I can't recommend VAEs enough to introduce variance...

Silly comment but it's very traditional in ML to size layers using powers of 2. So rather than 50, use 64 (yes I know this sounds stupid but it will matter to certain audiences).

At the moment the models are not generative (not sure in the case of normal mixtures). So would expect to use a withheld test set for evaluation. Keras should also let you implement early stopping (ideally using a validation dataset), then you don't need to worry about specifying epochs.

It would be really interesting to see a roadmap of sorts for this work.

Engineering:

  • In the past I have found seaborn to be a dependencies maintenance pain because it lagged behind on core project dependencies like plt and np - but maybe this is old news.
  • adding keras and tensorflow as dependencies is very chonky and these models require a special skill set to maintain. I strongly suggest that they are developed in a separate project (pam-ml?) with pam as a dependency until there is some killer feature ready to add in later.

@Theodore-Chatziioannou
Copy link
Contributor Author

thanks @fredshone :)

The mixture model is interesting. Assume this can be stacked on top of the LSTM units to do inference with more sequence "context". The primary critique is that you have to specify the number of components and/or this assumes normal distributions? In which case I can't recommend VAEs enough to introduce variance...

The assumption of gaussian mixture distribution with pre-defined number of components was the intention here - it follows the review of observed distributions, and it feels like a good way to bound the solution space. However, I was also wondering - am I essentially moving towards a VAE? I guess the main difference is whether you choose to sample from a latent space or not. I feel the use cases are slightly different, but can't say what works better here. Maybe one of the next steps could be to solve the same problem (in notebook 2) with a VAE, and compare the results?

Silly comment but it's very traditional in ML to size layers using powers of 2. So rather than 50, use 64 (yes I know this sounds stupid but it will matter to certain audiences).

Good point - will update

At the moment the models are not generative (not sure in the case of normal mixtures). So would expect to use a withheld test set for evaluation. Keras should also let you implement early stopping (ideally using a validation dataset), then you don't need to worry about specifying epochs.

The aim is to combine the two models and end up with a generative model, that is recurrent, but is also able to sample results from a distribution.
Yes, I will introduce a test set, once we try this with more realistic data. For now, these are some toys for exploration and some basic checks - I am not formally reviewing performance yet.

Engineering:

  • In the past I have found seaborn to be a dependencies maintenance pain because it lagged behind on core project dependencies like plt and np - but maybe this is old news.
  • adding keras and tensorflow as dependencies is very chonky and these models require a special skill set to maintain. I strongly suggest that they are developed in a separate project (pam-ml?) with pam as a dependency until there is some killer feature ready to add in later.

Yeah, as you can see with the failing Windows tests, dependencies proved to be an issue here. I will see if there is a good way to resolve them with certain pinned versions. I don't have an answer yet; I am considering perhaps splitting to another repo, but we would still need to solve the problem if PAM is a dependency.

Copy link
Contributor

@brynpickering brynpickering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to spend more time on understanding the underlying methods. As it is, I can't really wade in on your conversation with @fredshone . I've left comments on the code itself, including a few points where explanations would be nice.

As Fred says, it might be worth moving this into its own repo. pam-ml could default to being only unix compatible since support for ML tools in Windows does seem to lag behind linux/macos. Windows users would need to work in WSL.

@@ -12,7 +12,10 @@ prettytable >= 3, < 4
python-Levenshtein >= 0.21, < 0.26
rich >= 12, < 14
Rtree >= 1, < 2
seaborn < 0.14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go into requirements/dev.txt as it is used in only in the example notebooks.

import tensorflow_probability as tfp
import tf_keras as tfk

tfd = tfp.distributions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do: from tensorflow_probability import distributions as tfpd, layers as tfpl and from tf_keras import layers as tfkl, model as tfkm although I think I prefer, for legibility, just using tfp.distributions, tfp.layers when they are needed

class ScheduleModelSimple:
def __init__(
self, population: Population, n_units: Optional[int] = 50, dropout: Optional[float] = 0.1
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring

decoder_output = keras.layers.Dense(1, activation="relu", name="decoder_output")(decoder_h2)
model = keras.models.Model(inputs=[input_acts, decoder_input], outputs=[decoder_output])

model.compile(loss="mean_squared_error", optimizer="adam", metrics=["accuracy"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should any of these arguments be user-configurable?

Comment on lines +139 to +140
h1 = tfkl.Dense(50, activation="relu")(inputs)
h2 = tfkl.Dense(20, activation="relu")(h1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, are 50 and 20 values that should be user-configurable? I can't say I understand the method well enough to know their significance.

"""

self.population = population
act_labels = ["NA", "SOS", "EOS"] + list(population.activity_classes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you leave a comment / add to the docstring what SOS and EOS mean?

self.encode_plans()

def encode_plans(self) -> None:
"""Encode sequencies of activities and durations into numpy arrays."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Encode sequencies of activities and durations into numpy arrays."""
"""Encode sequences of activities and durations into numpy arrays."""

**/tmp/
temp/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use tmp for temporary folders, then it would be covered by **/tmp/

"outputs": [
{
"data": {
"image/png": "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make each of these questions a subsubheading (## ...) and give the answer in the same markdown cell, to guide readers on what the figures are telling them w.r.t. the question asked.

],
"source": [
"sns.kdeplot(y_pred[:, 0]*8)\n",
"plt.legend([\"Observed\", \"Predicted\"])\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see Predicted in the plot or the legend and Observed is just a spike at ~2. If this is meant to be the result, it would be worth explaining why it is producing what looks like a strange result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is intended to show that there is no variation of the results (hence the spice). Doesn't show the observed though - need to correct this.

@Theodore-Chatziioannou
Copy link
Contributor Author

@fredshone I have added a very simplistic VAE example (03_vae_multimodal) that tries to solve the same problem as notebook 02_gaussian_mixture (ie predicting activity durations for an activity that follows a bi-modal distribution profile). In the spirit of this PR, I am trying to experiment with very minimalistic examples to build up our intuition.

After trying a few variations, my thoughts are:

  • you can get a similar outcome with a VAE as with the MDN model.
  • however, the VAE may require more training data and time
  • this probably makes sense: the VAE is essentially trying to solve a more difficult problem (with more degrees of freedom)
  • it may have more freedom to produce individual predictions that depart more from observed patterns
  • as you say, the main benefit is the ability to fit any kind of distribution without prior assumptions about its type

The comparison probably comes down to someone's use case:

  • what is the underlying temporal distributions of your data? Can they be approximated well with well-known distributions or do you want something more tightly-fit and agnostic?
  • how much do you want to constrain the solution space of your model?
  • how much do you care about interpretability?

Any thoughts appreciated @fredshone @panostsolerid @brynpickering !

@fredshone
Copy link
Contributor

thanks Theo, I think this is a nice conclusion.

Agree VAE has to do more work to reproduce a gaussian mixture, but as you say it can generalise more broadly. Related, my best models are big (0.5m params), which they use to reproduce the distributions in schedules. I am thinking how to combine the two approaches to make my models more parsimonious. For example if distributions can be incorporated in the vae decoder.

The comparison probably comes down to someone's use case:

what is the underlying temporal distributions of your data? Can they be approximated well with well-known distributions or do you want something more tightly-fit and agnostic?
how much do you want to constrain the solution space of your model?
how much do you care about interpret-ability?

My thinking is that once we get to sequence generation, the joint-distributions are gonna get pretty wild. We also need the models to incorporate less statistical rules, such as day duration, vehicle consistency and so on. Specifying and structuring all these will be tough. Expecting this to be interpret-able by looking at model params also not going to be a pleasant exercise...

If we want to chase structure then will tackle simpler things like total day duration. If we extend to household scheduling then interesting to think how these models could be structures to deal well with shared trips and shared activities.

I guess my inevitable answer is that I don't care about interpret-ability. Instead I want us to have a rigorous evaluation framework so we can just focus on results.

@Theodore-Chatziioannou
Copy link
Contributor Author

thanks @fredshone. I am taking these step-by-step :) . Some of the things you mention are what I am looking now as the next example (aiming for a summary of findings for Tuesday).

Joint distributions in plans: I am preparing an example now. It can indeed get out of hand, however, the large majority of plans tend to be quite simple with 1-2 non-home activities in the day. Trying to capture and interpret some of that interests me for scenario testing, and generally to understand what kind of signal the model is picking up.

Total day duration: I am also thinking about this. I consider a few approaches: a) use day duration as a metric: the results of a good model should predict durations that roughly up to 1. If that is true, simple adjust durations either proportionally, or according to some rule, b) The duration of the final activity is not predicted - it rather uses the remaining time budget in the day. c) The loss function penalises plans that don't add up well. Let me know if you have any other ideas.

Household structuring: I am probably far from that at the moment. I would expect a hybrid model that incorporates some sort of rule set or simulation to resolve the use of shared resources? That would be quite interesting.

@fredshone
Copy link
Contributor

fredshone commented Jul 3, 2024

total sequence duration seems tricky with RNNs - cannot deal with duration until first end of sequence token. I did try a combined loss function of activity type, duration and end time. The idea being that end time loss tries to keep total durations (so far) correct. But didn't see a useful result. Perhaps needs further work.

At the moment by models have average total duration error of only a few minutes in any case - so easy to fix after without too much stress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants