Categorical distribution and ground truth during training #18

ValterFallenius · 2022-02-16T15:42:38Z

ValterFallenius
Feb 16, 2022

Hello,

I am implementing this with radar-only data obtained from a local source here in Sweden for my master thesis, first I'd like to thank you for the implementation! It has helped me a great deal, but I have some questions about the ground truth the network is supposed to use and how the training is done.

The paper speaks of a 512 categorical distribution. That is 512 bins for 0.2 mm/h increments. Does this mean that our ground truth for one training sample will be a tensor of shape (512, width, height) with respective one-hot encoding from radar data representing which bucket we are in for each pixel? This seems like a very sparse way to represent the data, when I look at my data there very few occasions when there's more than a couple of pixels representing each bin at >5mm/h. What do you think?

In my case the model will have 5 minute resolution for lead times and 300 minutes into the future (60 different lead times). How do we handle training with different lead times? It seems like in your code we loop through all 60 possibilities for each batch and perform backprop in accordance. It's not clear to me if it's better to stick with one random lead time (forecast_steps=1) for each training sample in a given batch or to loop through every lead time. (I am afraid this might overfit the input nodes if we use almost same input vector for 60 samples)

I have already done preprocessing steps for my data in numpy with centercrop-split and space-to-depth (8 channels), as well as 3 channels for elevation, longitude and latitude, 4 channels for "time of year" and "time of day" as periodic representations. 15 channels total in input, I have therefore commented out the line "x = self.preprocessor(x)" inside encode_timestep(). Is there any bit of caution I have to exercise here or is it OK as long as the network doesn't give any error-messages? Sorry but I don't have much experience pytorch...

Thanks again /Valter

Answered by jacobbieker

Feb 16, 2022

Hello,

Glad you find this useful! Just curious, is the radar data openly available? Would love to have more training datasets other than the US one in the paper.

As for the training of the model, yeah, my understanding of the paper is that it is a tensor of 512, width, height as the target. It is quite a sparse representation, similar to the very sparse temporal representation used as well. One way is to subsample the data so that the rare, high-rainfall data is shown more often, another paper, the Deep Generative Model of Radar (DGMR) dataset does that and the dataset is public, I'm currently mirroring it on HuggingFace as well, and they sampled the rare events more often to make sure th…

View full answer

jacobbieker · 2022-02-16T17:37:51Z

jacobbieker
Feb 16, 2022
Maintainer

Hello,

Glad you find this useful! Just curious, is the radar data openly available? Would love to have more training datasets other than the US one in the paper.

As for the training of the model, yeah, my understanding of the paper is that it is a tensor of 512, width, height as the target. It is quite a sparse representation, similar to the very sparse temporal representation used as well. One way is to subsample the data so that the rare, high-rainfall data is shown more often, another paper, the Deep Generative Model of Radar (DGMR) dataset does that and the dataset is public, I'm currently mirroring it on HuggingFace as well, and they sampled the rare events more often to make sure the model learned about high rainfall events. The categorical representation does make it easier for the model to learn probabilistic forecasts though, as it can just predict the proability for each forecast "category".

The current code is somewhat predicated on training on all future lead times at the same time, but you could potentially train a model for each different lead time if you want to, for example. Or do what MetNet-2 did, where they kept a checkpoint for each of the future lead times, so could use the best model weights for a specific lead time during inference. But also, if you come up with a more efficient way of training on the different lead times, feel free to open a PR!

For your preprocessing, yeah, as long as there is no error messages, it should be fine, so I wouldn't worry about it.

0 replies

ValterFallenius · 2022-02-17T09:10:09Z

ValterFallenius
Feb 17, 2022
Author

Thanks a lot for your input and quick answer, I will check with my supervisor to see if we can publish the dataset (or part of it) but I'm not sure it will be possible.

I guess the representation makes sense but I will have to investigate the average rainfall for the different training samples. The hardest part I guess will be to parallelize before running it on some GPUs. Godspeed me.

0 replies

ValterFallenius · 2022-02-23T15:24:38Z

ValterFallenius
Feb 23, 2022
Author

I emailed authors of the paper and got some answers to my questions:

In each batch of training, did you loop through all lead-times or did you pick a random lead time for each training sample? If you loop through them all, did you parallelize this action?

During training we simply sample a random lead time and the corresponding target for each input in the batch. For evaluation we loop through all lead-times for each input in order to predict the full sequence.

The 512-categorical distribution seems like an extremely sparse representation, it seems like you use a 64x64x512 output tensor, is this correct? and you used a one-hot encoded tensor with the same shape as ground truth? It seems to me like this can lead to some unwanted properties unless this is accounted for in the loss function. For example did you treat a guess of 80 mm/h and 10 mm/h as "equally bad" when the ground truth was 79 mm/h? Or how did you encode this?

The answer to both of your questions are yes, however Categorical Cross-entropy and SGD works extremely well in practice for reasons that i do not think are fully understood. In theory encoding the ordinality of the data should work better (using logistic LL or the cumulative distribution ) should work better. However in practice we see no difference or worse performance and more unstable training.

For my project I have 15 channels, where 4 channels are downsampled + space-2-depth radar data, 4 channels are zoomed + space-2-depth, 3 channels covers elevation, longitude and latitude, 4 channels are periodical representations of "time-of-day" and "time-of-year". Finally I have 60 channels for 300 minutes of lead time (5 minute resolution). How long did it take to train your network? I have access to some super computer resources at "Berzelius". I am trying to figure out how many GPUs I need to be able to train a network of similar size as yours and to what extent the training needs to be parallelized.

We trained on 256 Google TPUs which are not really comparable to other accelerators. You should be able to train on way less resources though. My suggestion would be to initially train on a heavily downsampled dataset and get everything to work here.

Go to high resolution and extract the full valid and test sets. Starting with 1 GPU and a smallish training dataset size follow a tick-tock strategy where you

Double the training set size - keep hardware resources fixed maximize performance of model

Double Hardware resources - keep training set fixed: maximize performance of model

Importantly you always validate performance on the same valid and test set.

6 replies

ValterFallenius Mar 1, 2022
Author

Right now I am working on implementing their F1-score of different threshholds, so I need the probabillities of each rain_bin. Am I correct to assume they use an additional softmax() along the rain-bin categorical dimenision after self.head() in the original paper? Is there a reason you left this out from your implementation of the forward pass?

It seems like this is how the authors had their network:
out = self.head(x)
return torch.softmax(out, dim=1)

jacobbieker Mar 1, 2022
Maintainer

Yeah, I didn't include that softmax originally because I was using the network to predict satellite images instead of rainfall originally. So yeah, to match their implementation, you'd need to add the softmax

Hinode Mar 7, 2022

Yes, I think the same. @ValterFallenius should use softmax to translate the 512 categorical bins to match the ground truth radar reflectivity or rainfall.

For @jacobbieker, I doubt the combination of the satellite images and radar images although the original MetNet paper uses them together. Geostationary satellites (infrared) and radars (microwave) observe different things, physically.

jacobbieker Mar 8, 2022
Maintainer

@Hinode yeah, I agree, but the satellite and radar data should then be complimentary, the satellite data can give relatively high temporal resolution updates on cloud shape, and movement, which is enhanced by the radar data picking up the precipitation. Additionally, the satellite data is also in high resolution optical wavelengths, not just infrared, and the MetNet-2 paper found that certain bands were quite important for the shorter term forecasting.

Hinode Mar 8, 2022

@jacobbieker , yeah, the satellite data includes visible and near infrared bands. But those bands that measure the solar reflectivity are ONLY useful in the day time. So, should be very careful if your are training/nowcasting some weather in the night time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical distribution and ground truth during training #18

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Categorical distribution and ground truth during training #18

ValterFallenius Feb 16, 2022

Replies: 3 comments · 6 replies

jacobbieker Feb 16, 2022 Maintainer

ValterFallenius Feb 17, 2022 Author

ValterFallenius Feb 23, 2022 Author

ValterFallenius Mar 1, 2022 Author

jacobbieker Mar 1, 2022 Maintainer

Hinode Mar 7, 2022

jacobbieker Mar 8, 2022 Maintainer

Hinode Mar 8, 2022

ValterFallenius
Feb 16, 2022

Replies: 3 comments 6 replies

jacobbieker
Feb 16, 2022
Maintainer

ValterFallenius
Feb 17, 2022
Author

ValterFallenius
Feb 23, 2022
Author

ValterFallenius Mar 1, 2022
Author

jacobbieker Mar 1, 2022
Maintainer

jacobbieker Mar 8, 2022
Maintainer