Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assessing prior model validity #26

Open
ndejay opened this issue Dec 1, 2023 · 5 comments
Open

Assessing prior model validity #26

ndejay opened this issue Dec 1, 2023 · 5 comments

Comments

@ndejay
Copy link

ndejay commented Dec 1, 2023

Hello,

Thanks for publishing this great method.

I am looking to generate new priors from a pilot data set consisting of a large number (>10K) of deeply sequenced cells (>100K reads/cell) from Chromium 10X technology, and had some questions about how to fine-tune model training w.r.t recapitulation of the original data.

Comparison of gamma mixed fits with original means. Is it normal for the gamma mixed fit to underestimate 0s and overestimate moderately expressed genes?

000019

Fit a function for UMI counts dependent on read depth:
1

Estimation of median dispersion function for each cell type. In the example in the vignette, no relation was found between dispersion and UMI counts. In my pilot data set, I am seeing one. Is there a way to account for this?

00000a

Validation of expression probability model. The predicted number of expressed genes are off by a significant margin.

2

Many thanks in advance!

Nic

@KatharinaSchmid
Copy link
Collaborator

Hi Nic,

Thanks for your interest in our tool! And you are unfortunately right, the fits in the end are clearly off (in the plot „validation of expression probability model“), there seems to be a problem somewhere. Especially with 10X data, we got very good fits previously. I can’t say immediately, what could have gone wrong, the other plots you showed look normal.

For the gamma mixed fits, it is intensional that the complete 0 values are sightly „underestimated“. For the genes with zero expression in the single cell experiments we can’t say whether these genes are really not expressed in the cell type or whether it is caused by the sparsity of the single cell technology. For this reason, we use a left-censored gamma model that models part of the zero values as lowly expressed genes instead. We saw that this is a good approximation in our subsampling experiments. We didn’t have such a deep read depth, but I would assume that our model is still working there. From our previous experience, slight deviations in the mean fits (including also the differences for the moderately expressed genes) should not have such a huge impact.

Unfortunately, the model is currently not flexible enough to account for dispersion parameters when there is a relationship with the read depth. I will put it on our list that we look closer into that. But again, I don’t expect this to have such a huge impact, that you can find less than half as much expressed genes.

What I am wondering in your last plot, it says that the expressed genes are above 40,000? This seems like a very large number for me for a single cell experiment. How did you calculate the number of expressed genes, did you use the code from the vignette or did you implement this part yourself?

#Reformat count matrix into pseudobulk matrix
pseudo.bulk<-create.pseudobulk(count.matrix,annot.df,colName="cell_type") 
#Calculate expressed genes in the pseudobulk matrix 
expressed.genes<-calculate.gene.counts(pseudo.bulk,min.counts=3, perc.indiv=0.5) 
#Get the number of expressed genes
num.expressed.genes<-nrow(expressed.genes)

And in general, how did you set the expression cutoffs to define an expressed genes in our functions (min.counts and perc.indiv)? It is important that this is matching for the estimated genes and the expressed genes. Another parameter that needs to match is the parameter nGenes of the function estimate.exp.prob.param and the parameter num.genes.kept of the function mixed.gamma.estimation. You can also share your code if you think this might help.

Best,

Katharina

@ndejay
Copy link
Author

ndejay commented Dec 6, 2023

Hi Katharina,

Thanks for getting back to you with such a detailed response.

For the gamma mixed fits, it is intensional that the complete 0 values are sightly „underestimated“. For the genes with zero expression in the single cell experiments we can’t say whether these genes are really not expressed in the cell type or whether it is caused by the sparsity of the single cell technology. For this reason, we use a left-censored gamma model that models part of the zero values as lowly expressed genes instead. We saw that this is a good approximation in our subsampling experiments. We didn’t have such a deep read depth, but I would assume that our model is still working there. From our previous experience, slight deviations in the mean fits (including also the differences for the moderately expressed genes) should not have such a huge impact.

That makes a lot of sense, thank you. Do we expect the underestimation to become larger as a function of downsampling scaling factor (e.g., 25% of reads)?

000008

What I am wondering in your last plot, it says that the expressed genes are above 40,000? This seems like a very large number for me for a single cell experiment. How did you calculate the number of expressed genes, did you use the code from the vignette or did you implement this part yourself?

I agree that the number of genes detected is quite high. These are cell lines sequenced very deeply (160-318K reads/cell) and with many cells (12K cells post-QC with a median of 7.5K detected genes/cell). My primary motivation for using scPower was to assess to which extent we could reduce sequencing depth without negatively impacting DGE readouts.

As for the calculation, I used the code you shared from the vignette, with min.counts = 3 and perc.indiv = 0.5:

expressed.genes.df.f <- NULL
rownames(annot.df) <- annot.df$cell

for (name in names(count.matrices)) {
  tictoc::tic(name)
  count.matrix <- count.matrices[[name]]
  annot.df <- annot.df[colnames(count.matrix), ]
  pseudo.bulk <- scPower::create.pseudobulk(count.matrix, annot.df, colName = "cell_type")
  expressed.genes <- scPower::calculate.gene.counts(pseudo.bulk, min.counts = 3, perc.indiv = 0.5)
  expressed.genes.df.f %<>% rbind(data.frame(matrix = name, num.cells = ncol(count.matrix), expressed.genes = nrow(expressed.genes)))
  tictoc::toc()
}
Screenshot 2023-12-06 at 2 08 50 PM

And in general, how did you set the expression cutoffs to define an expressed genes in our functions (min.counts and perc.indiv)? It is important that this is matching for the estimated genes and the expressed genes. Another parameter that needs to match is the parameter nGenes of the function estimate.exp.prob.param and the parameter num.genes.kept of the function mixed.gamma.estimation

Great catch! I had aligned min.counts and perc.indiv but not nGenes (which was set to 34000 in mixed.gamma.estimation but left to default in estimate.exp.prob.param). But this did not resolve the issue.

Also, I'm not sure these cause any issues, but I thought I'd mention them in case I did anything wrong.

1/ I obtained UMI count matrices resulting from downsampled reads with DropletUtils::downsampleReads, which uses CellRanger's molecule_info.h5 file instead of method prescribed in the vignette. I then scaled transcriptome.mapped.reads by the same factor.

2/ I used nbinom.estimation with sizeFactorMethod = "poscounts" due to sparsity issue in the count matrix:

Error in scPower::nbinom.estimation(count.matrices[[name]]) : 
  Size factor estimation failed! Probably due to sparsity of the matrix! Try poscounts instead as variable sizeFactors.

You can also share your code if you think this might help.

That would be great. Here is a link to the code.

Thanks again for your help, it's very much appreciated.

Best,
Nic

@KatharinaSchmid
Copy link
Collaborator

Hi Nic,

Very sorry that I am answering so slow at the moment, it is unfortunately a very busy time for me at the moment, but it should be better again from next week on.

First, yes, I would expect that the underestimation becomes larger when you down sample more, as shown in your plots.

One thing that is confusing me: you set for the gamma mixture model your nGenes to 34,000. This should only work in case you have at most 34,000 genes with mean values larger than 0. But this would mean that you can’t have over 34,000 expressed genes in your original count matrix. So either there is a problem with the create.pseudbulk / calculate.gene.counts part or with the mixed.gamma.estimation part.

If I understand your code correctly, you have cells for two different „cell types“ in your count matrix, called „celltype“ and „notcelltype“. When you create the pseudobulk, have you checked that the dimensions are correct - so should be 3 dimensions with ~45,000 genes times 4 individuals times 2 cell types? And when you run then calculate.gene.counts, you should get results for both cell types, could you check how the expressed.genes data.frame looks like? Make sure that you count the genes separately for the „celltype“ and „notceltype“ cells. In the tutorial, I had only one cell type, so I could do nrow(expressed.genes), but if you have multiple cell types, you need to do something else, e.g. table(expressed.genes$cell.type)

Another thing that you could check: how many of your mean values after the NB fit are larger than 0? Less than 34,000 or more?

I hope one of the suggestions helps to find the issue.

Best,
Katharina

@ndejay
Copy link
Author

ndejay commented Dec 12, 2023

Hi Katharina,

Very sorry that I am answering so slow at the moment, it is unfortunately a very busy time for me at the moment, but it should be better again from next week on.

No worries, I appreciate your help.

One thing that is confusing me: you set for the gamma mixture model your nGenes to 34,000. This should only work in case you have at most 34,000 genes with mean values larger than 0. But this would mean that you can’t have over 34,000 expressed genes in your original count matrix. So either there is a problem with the create.pseudbulk / calculate.gene.counts part or with the mixed.gamma.estimation part.

The highest number I can pick is 36 601, which corresponds to the number genes in the pseudobulk count matrix.

If I understand your code correctly, you have cells for two different „cell types“ in your count matrix, called „celltype“ and „notcelltype“. When you create the pseudobulk, have you checked that the dimensions are correct - so should be 3 dimensions with ~45,000 genes times 4 individuals times 2 cell types?

I am getting a 3D matrix with 36 601 genes, 4 donors and 2 cell types.

> str(pseudo.bulk)
 num [1:36601, 1:4, 1:2] 0 0 0 25 0 0 76 1 0 0 ...
 - attr(*, "dimnames")=List of 3
  ..$ gene.id : chr [1:36601] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009" ...
  ..$ indiv.id: chr [1:4] "donor1" "donor2" "donor3" "donor4"
  ..$ ct.id   : chr [1:2] "celltype" "notcelltype"

And when you run then calculate.gene.counts, you should get results for both cell types, could you check how the expressed.genes data.frame looks like? Make sure that you count the genes separately for the „celltype“ and „notceltype“ cells. In the tutorial, I had only one cell type, so I could do nrow(expressed.genes), but if you have multiple cell types, you need to do something else, e.g. table(expressed.genes$cell.type)

Yes! Although there are 36 601 genes in the pseudobulk count matrix, only ~22K and ~25K are expressed in each cell type, respectively (the union has ~26K genes).

expressed.genes.df.f %<>% rbind(data.frame(
    matrix = name,
    merge(
      dplyr::count(annot.df, cell.type, name = "num.cells"),
      dplyr::count(expressed.genes, cell.type, name = "expressed.genes")
    )
    ))
Screenshot 2023-12-12 at 4 44 27 PM

Another thing that you could check: how many of your mean values after the NB fit are larger than 0? Less than 34,000 or more?

> norm.mean.values %>%
  dplyr::group_by(matrix) %>%
  dplyr::summarise(sum(mean > 0))
Screenshot 2023-12-12 at 4 54 07 PM

I hope one of the suggestions helps to find the issue.

Thanks for pointing me in this direction, it does look like the issues are arising from having multiple cell types. I will try to repeat the analysis on only one cell type and see if the results make more sense. If that works, I will simply iterate the entire process for each cell type separately. What do you think?

Many thanks!

Best,
Nic

@KatharinaSchmid
Copy link
Collaborator

Hi Nic,
yes, it might be better to repeat the analysis separately for each cell type. But thanks for reporting your issue, I totally see how our current vignette is not showing good enough how to generate priors for a data set with multiple cell types. As soon as I have time, I will adapt the vignette to better explain this (and potentially also adapt some code to make it more elegant). But please let me know if you still have issues when running each cell type separately.
Best,
Katharina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants