Reducing dataset size to improve run time #223

fluentin44 · 2022-12-02T08:03:33Z

Hi,

I have a dataset of ~25k cells and 130 samples so computation time and memory to run fitGAM are going to be an issue for me. With respect to that I have seen reccomendations to reduce the number of genes put into the function just to the top 2k variable features, however can I clarify - is that reducing the whole counts matrix down to 2k features, or keeping the whole counts matrix and putting the names of the top 2k variable features into the genes argument?

Thanks,
Matt

koenvandenberge · 2022-12-08T17:23:38Z

Hi @fluentin44

We generally recommend supplying the entire count matrix (if possible given memory requirements) and then supply the genes you would like to fit using the genes argument.
This way, we still use the entire count matrix for normalization.

Hope this helps.

fluentin44 · 2022-12-09T07:54:02Z

Ok much appreciated!

Thanks,
Matt

castaway1990 · 2023-03-07T13:00:21Z

Hi,
Great tool!
I'd ask a couple questions related.

-Focusing on 2kgenes, it seems that subsetting counts prior to fitGAM and provoding full counts with genes=2kgenes gives different results, is that possible? Is there anything else happening a side from normalization which includes informations from other genes during the fitting?

-In the case that highly variable genes are scored as a consequence of capturing differences among lineages, wouldn't be that a source of bias during normalization?

Thanks a lot

koenvandenberge · 2023-12-04T15:12:46Z

Hi @castaway1990

Focusing on 2kgenes, it seems that subsetting counts prior to fitGAM and provoding full counts with genes=2kgenes gives different results, is that possible? Is there anything else happening a side from normalization which includes informations from other genes during the fitting?

Yes, that is possible. If you first subset the 2K genes and then run fitGAM, the normalization will only use the 2K genes to estimate normalization factors. Instead, if you provide the full count matrix and use the genes argument to identify the 2K genes you would like to fit, then the normalization will still use all genes to calculate the normalization factor. This should be the only difference.

In the case that highly variable genes are scored as a consequence of capturing differences among lineages, wouldn't be that a source of bias during normalization?

If there are large systematic differences between the groups you are comparing, this can indeed be an issue in normalization. In tradeSeq, we are relying on TMM normalization as described here. One of the main assumptions is that the majority of genes are not differentially expressed. I would advice against only providing the subsetted count matrix to fitGAM and instead would recommend to provide the full count matrix and use the genes argument to specify the genes you're interested in.

fluentin44 closed this as completed Dec 9, 2022

koenvandenberge reopened this Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing dataset size to improve run time #223

Reducing dataset size to improve run time #223

fluentin44 commented Dec 2, 2022

koenvandenberge commented Dec 8, 2022

fluentin44 commented Dec 9, 2022

castaway1990 commented Mar 7, 2023

koenvandenberge commented Dec 4, 2023

Reducing dataset size to improve run time #223

Reducing dataset size to improve run time #223

Comments

fluentin44 commented Dec 2, 2022

koenvandenberge commented Dec 8, 2022

fluentin44 commented Dec 9, 2022

castaway1990 commented Mar 7, 2023

koenvandenberge commented Dec 4, 2023