Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom 3'utr #72

Open
aleighbrown opened this issue Mar 28, 2024 · 1 comment
Open

Custom 3'utr #72

aleighbrown opened this issue Mar 28, 2024 · 1 comment

Comments

@aleighbrown
Copy link

Hello,

I have a set of custom novel 3'UTRs that I would like to quantify in single-cell data.

Ideally I would just want to quantify the 100 or so 3'UTRs that I'm interested in for speed's sake

What would I need to build a minimum working kallisto index of UTRome, GTF, and TSV merge annotation for my custom set of 3'UTRs?

thank you in advance!

@mfansler
Copy link
Collaborator

mfansler commented Mar 28, 2024

Thanks for the interest!

Yes, you could build a custom target that only quantifies reads in the regions of interest. To plug into this pipeline, you would indeed provide a kallisto index ("kdx"), GTF, and TSV merge annotation. You would edit the extdata/targets/targets.yaml to add this information, something like:

custom_utrs:
  path: "extdata/targets/custom_utrs/"
  genome: "hg38"
  gtf: "custom_utrs.gtf"
  kdx: "custom_utrs.kdx"
  merge_tsv: "custom_utrs.merge.tsv"
  tx_annots: null
  gene_annots: null
  download_script: null

and the path would be relative to the root of the repository (absolute is also fine).

Caveats

I'll just note some caveats about taking this approach as opposed to adding the custom 3'UTRs to the full annotation.

Identifying Cells: Valid cell barcodes would need to come from previous data processing. Otherwise, the targeted regions alone may not be sufficient to discriminate high-quality cells from low-quality cells or background.

Comparing Across Cells or Samples: Normalization (size) factors would need to come from previous data processing. With only targeted regions, it would be unclear whether higher counts were due to higher expression, higher capture rate, deeper sequencing, or some mixture.

Multimapping Reads: Reads that would multimap in a full annotation might uniquely map in a targeted subset, leading to overestimation of counts. One should prove this isn't a factor before trusting the targeted results. You'd probably want to prepare a full index (full UTRome + custom novel 3'UTRs) and then inspect if any of the kmers from the targeted regions are shared with those in non-targeted regions. If they do, you may need to include the other transcripts that have shared k-mers to make the assignment fair. That is, one doesn't want changes in gene expression from some other gene to show up as isoform-specific expression in a targeted isoform due to excluding the alternative loci whence the reads may have originated.

On the last point, you may also just do some empirical spot checks. For example, run some samples with the full UTRome + novel 3'UTRs and separately with just the targeted index, then compare the results. That should surface multimapping issues if the counts do not come out identically.

Hope that's helpful! Let me know if I can answer any more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants