-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch integration / Sample normalisation #97
Comments
This is a great question, but unfortunately something we only partly explored and didn't make it into our final publication simply for reasons of space. Accounting for BatchesIn the scheme of testing 3'UTR changes, comparisons will ultimately be made within genes (e.g., a Weighted Usage Index) rather than across genes (e.g., a TPM). Log-scaling is not needed for that. The Since everything in APA testing is about proportions of reads, batch effects that impact gene expression levels would not be expected to be so problematic. Also, if one is first using gene expression (and/or chromatin accessibility) in a batch-integrated space to derive cell-type annotations, then subsequently using those annotations on the uncorrected 3'UTR counts would be implicitly feeding that equating across batches back into the model. What is the Batch Effect in 3'UTR Counting?Anecdotally, what I've seen as the primary effect of "batch" in 3'UTR counting comes in the form of varying rates of internal priming, which can occasionally leak into shorter 3'UTR isoforms when there are A-rich regions slightly downstream of true cleavage sites. For this reason, I think a proper solution to accounting for batch in this space would be to compute for each batch an internal priming rate and use that number as a covariate in all dWUI or similar APA tests. However, this would require an additional layer of possibly curated counting of reads specifically in what we classify as internal priming peaks, something we just don't have at this point. I'd speculate that fraction of intronic reads might be a first-order approximation to this, but the technical work on this simply hasn't been done. Testing for Batch EffectsFor now, I can at least point you to some of the data that we had in the original preprint, where we ran pairwise tests within each cell type across batches, that had shown that there was minimal significant batch effects. While the plots here filtering for only a few genes, these were indeed the only was that showed as near-significant in the inter-batch testing. Text from Preprint
Select Statistical Tests Plots of Bootstrapped LUI per Batch-Celltype At some point I had run statistical testing across batches for all of Tabula Muris. We had found that this was minimal, especially when controlling for multiple hypothesis testing. However, I would have to dig through my archived data to find this. If you are interested in that, I can perhaps track that down. |
Hi!
Have you explored strategies for batch integration for larger datasets or sample normalisation in addition of the scaling/log normalisation?
Any advice?
Thanks!
The text was updated successfully, but these errors were encountered: