-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assessing prior model validity #26
Comments
Hi Nic, Thanks for your interest in our tool! And you are unfortunately right, the fits in the end are clearly off (in the plot „validation of expression probability model“), there seems to be a problem somewhere. Especially with 10X data, we got very good fits previously. I can’t say immediately, what could have gone wrong, the other plots you showed look normal. For the gamma mixed fits, it is intensional that the complete 0 values are sightly „underestimated“. For the genes with zero expression in the single cell experiments we can’t say whether these genes are really not expressed in the cell type or whether it is caused by the sparsity of the single cell technology. For this reason, we use a left-censored gamma model that models part of the zero values as lowly expressed genes instead. We saw that this is a good approximation in our subsampling experiments. We didn’t have such a deep read depth, but I would assume that our model is still working there. From our previous experience, slight deviations in the mean fits (including also the differences for the moderately expressed genes) should not have such a huge impact. Unfortunately, the model is currently not flexible enough to account for dispersion parameters when there is a relationship with the read depth. I will put it on our list that we look closer into that. But again, I don’t expect this to have such a huge impact, that you can find less than half as much expressed genes. What I am wondering in your last plot, it says that the expressed genes are above 40,000? This seems like a very large number for me for a single cell experiment. How did you calculate the number of expressed genes, did you use the code from the vignette or did you implement this part yourself?
And in general, how did you set the expression cutoffs to define an expressed genes in our functions ( Best, |
Hi Katharina, Thanks for getting back to you with such a detailed response.
That makes a lot of sense, thank you. Do we expect the underestimation to become larger as a function of downsampling scaling factor (e.g., 25% of reads)?
I agree that the number of genes detected is quite high. These are cell lines sequenced very deeply (160-318K reads/cell) and with many cells (12K cells post-QC with a median of 7.5K detected genes/cell). My primary motivation for using As for the calculation, I used the code you shared from the vignette, with expressed.genes.df.f <- NULL
rownames(annot.df) <- annot.df$cell
for (name in names(count.matrices)) {
tictoc::tic(name)
count.matrix <- count.matrices[[name]]
annot.df <- annot.df[colnames(count.matrix), ]
pseudo.bulk <- scPower::create.pseudobulk(count.matrix, annot.df, colName = "cell_type")
expressed.genes <- scPower::calculate.gene.counts(pseudo.bulk, min.counts = 3, perc.indiv = 0.5)
expressed.genes.df.f %<>% rbind(data.frame(matrix = name, num.cells = ncol(count.matrix), expressed.genes = nrow(expressed.genes)))
tictoc::toc()
}
Great catch! I had aligned Also, I'm not sure these cause any issues, but I thought I'd mention them in case I did anything wrong. 1/ I obtained UMI count matrices resulting from downsampled reads with 2/ I used
That would be great. Here is a link to the code. Thanks again for your help, it's very much appreciated. Best, |
Hi Nic, Very sorry that I am answering so slow at the moment, it is unfortunately a very busy time for me at the moment, but it should be better again from next week on. First, yes, I would expect that the underestimation becomes larger when you down sample more, as shown in your plots. One thing that is confusing me: you set for the gamma mixture model your nGenes to 34,000. This should only work in case you have at most 34,000 genes with mean values larger than 0. But this would mean that you can’t have over 34,000 expressed genes in your original count matrix. So either there is a problem with the If I understand your code correctly, you have cells for two different „cell types“ in your count matrix, called „celltype“ and „notcelltype“. When you create the pseudobulk, have you checked that the dimensions are correct - so should be 3 dimensions with ~45,000 genes times 4 individuals times 2 cell types? And when you run then Another thing that you could check: how many of your mean values after the NB fit are larger than 0? Less than 34,000 or more? I hope one of the suggestions helps to find the issue. Best, |
Hi Nic, |
Hello,
Thanks for publishing this great method.
I am looking to generate new priors from a pilot data set consisting of a large number (>10K) of deeply sequenced cells (>100K reads/cell) from Chromium 10X technology, and had some questions about how to fine-tune model training w.r.t recapitulation of the original data.
Comparison of gamma mixed fits with original means. Is it normal for the gamma mixed fit to underestimate 0s and overestimate moderately expressed genes?
Fit a function for UMI counts dependent on read depth:
Estimation of median dispersion function for each cell type. In the example in the vignette, no relation was found between dispersion and UMI counts. In my pilot data set, I am seeing one. Is there a way to account for this?
Validation of expression probability model. The predicted number of expressed genes are off by a significant margin.
Many thanks in advance!
Nic
The text was updated successfully, but these errors were encountered: