Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing dataset size to improve run time #223

Open
fluentin44 opened this issue Dec 2, 2022 · 4 comments
Open

Reducing dataset size to improve run time #223

fluentin44 opened this issue Dec 2, 2022 · 4 comments

Comments

@fluentin44
Copy link

Hi,

I have a dataset of ~25k cells and 130 samples so computation time and memory to run fitGAM are going to be an issue for me. With respect to that I have seen reccomendations to reduce the number of genes put into the function just to the top 2k variable features, however can I clarify - is that reducing the whole counts matrix down to 2k features, or keeping the whole counts matrix and putting the names of the top 2k variable features into the genes argument?

Thanks,
Matt

@koenvandenberge
Copy link
Member

Hi @fluentin44

We generally recommend supplying the entire count matrix (if possible given memory requirements) and then supply the genes you would like to fit using the genes argument.
This way, we still use the entire count matrix for normalization.

Hope this helps.

@fluentin44
Copy link
Author

Ok much appreciated!

Thanks,
Matt

@castaway1990
Copy link

Hi,
Great tool!
I'd ask a couple questions related.

-Focusing on 2kgenes, it seems that subsetting counts prior to fitGAM and provoding full counts with genes=2kgenes gives different results, is that possible? Is there anything else happening a side from normalization which includes informations from other genes during the fitting?

-In the case that highly variable genes are scored as a consequence of capturing differences among lineages, wouldn't be that a source of bias during normalization?

Thanks a lot

@koenvandenberge
Copy link
Member

Hi @castaway1990

Focusing on 2kgenes, it seems that subsetting counts prior to fitGAM and provoding full counts with genes=2kgenes gives different results, is that possible? Is there anything else happening a side from normalization which includes informations from other genes during the fitting?

Yes, that is possible. If you first subset the 2K genes and then run fitGAM, the normalization will only use the 2K genes to estimate normalization factors. Instead, if you provide the full count matrix and use the genes argument to identify the 2K genes you would like to fit, then the normalization will still use all genes to calculate the normalization factor. This should be the only difference.

In the case that highly variable genes are scored as a consequence of capturing differences among lineages, wouldn't be that a source of bias during normalization?

If there are large systematic differences between the groups you are comparing, this can indeed be an issue in normalization. In tradeSeq, we are relying on TMM normalization as described here. One of the main assumptions is that the majority of genes are not differentially expressed. I would advice against only providing the subsetted count matrix to fitGAM and instead would recommend to provide the full count matrix and use the genes argument to specify the genes you're interested in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants