forked from duty-machine/duty-machine
-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Count数据转换为TPM数据方法整理-常规方法、DGEobj.utils和IOBR包
- Loading branch information
Showing
1 changed file
with
15 additions
and
0 deletions.
There are no files selected for viewing
15 changes: 15 additions & 0 deletions
15
docs/2024-08/Count数据转换为TPM数据方法整理_常规方法_DGEobj_utils和IOBR包.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
title: "Count数据转换为TPM数据方法整理-常规方法、DGEobj.utils和IOBR包" | ||
date: 2024-08-02T02:12:50Z | ||
draft: ["false"] | ||
tags: [ | ||
"fetched", | ||
"生信方舟" | ||
] | ||
categories: ["Acdemic"] | ||
--- | ||
Count数据转换为TPM数据方法整理-常规方法、DGEobj.utils和IOBR包 by 生信方舟 | ||
------ | ||
<div><section data-tool="mdnice编辑器" data-website="https://www.mdnice.com"><p data-tool="mdnice编辑器">在正式分析之前,对于数据的处理是至关重要的,这种重要性是体现在很多方面,其中有一点是要求分析者采用正确的数据类型。</p><p data-tool="mdnice编辑器">对于<strong>芯片数据</strong>,原始数据<strong>进行log2处理之后可以进行很多常见的分析</strong>,比如差异分析、热图、箱线图、PCA分析、生存分析、模型构建,聚类分析和相关性分析等。</p><p data-tool="mdnice编辑器">对于<strong>转录组数据</strong>,在上述的常见分析中<strong>只有差异分析时需要采用count值</strong>,<strong>其他的分析</strong>是需要<strong>采用log2后的cpm,tpm,fpkm,rpkm数据</strong>。</p><p data-tool="mdnice编辑器">除了上述的常规分析,在<strong>使用不同R包</strong>进行<strong>分析之前务必浏览一下输入数据的要求</strong>。</p><p data-tool="mdnice编辑器">那么芯片数据还好说,毕竟后续进行log2处理后就可以做很多分析。但是转录组数据的可选项就比较多了。</p><p data-tool="mdnice编辑器"><strong>但目前常用的只有tpm和cpm</strong></p><figure data-tool="mdnice编辑器"><img data-imgfileid="100000748" data-ratio="0.48520710059171596" data-src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVsxHh9tTpLjj6qEOo6dibZc6XhUWpBf5nYYz2WIcP2iaYK0a8H9EA72hIA/640?wx_fmt=png&from=appmsg" data-type="png" data-w="845" src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVsxHh9tTpLjj6qEOo6dibZc6XhUWpBf5nYYz2WIcP2iaYK0a8H9EA72hIA/640?wx_fmt=png&from=appmsg"></figure><p data-tool="mdnice编辑器"><strong>count数据转换为cpm数据非常简单</strong></p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944932" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code><span># exprSet是count表达矩阵</span><br><span># 一句代码搞定</span><br>exprSet = log2(edgeR::cpm(exprSet)+<span>1</span>)<br></code></pre><p data-tool="mdnice编辑器"><strong>比较难的就是count数据转换为tpm数据,因此搬运了常规的流程和R包的方法,做个对比</strong></p><p data-tool="mdnice编辑器">首先要去获取基因长度文件,因为后续需要用这个数据去矫正基因长度。</p><p data-tool="mdnice编辑器">网址:https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files</p><figure data-tool="mdnice编辑器"><img data-imgfileid="100000749" data-ratio="0.7461629279811098" data-src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVs9r7zFntnEpSsBxpxRO426MPds9cwI4bAJKgYLkwRNhN7Qbrzev3EBQ/640?wx_fmt=png&from=appmsg" data-type="png" data-w="847" src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVs9r7zFntnEpSsBxpxRO426MPds9cwI4bAJKgYLkwRNhN7Qbrzev3EBQ/640?wx_fmt=png&from=appmsg"></figure><h4 data-tool="mdnice编辑器"><span></span>方法一:常规方法-来自生信技能树<span></span></h4><p data-tool="mdnice编辑器">1、获取gencode.v36.annotation并处理</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944933" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code><span>if</span>(!<span>require</span>(GenomicFeatures))BiocManager::install(<span>"GenomicFeatures"</span>)<br><span>library</span>(GenomicFeatures)<br>txdb <- makeTxDbFromGFF(<span>"gencode.v36.annotation.gtf.gz"</span>,format=<span>"gtf"</span>)<br><span># 获取每个基因id的外显子数据</span><br>exons.list.per.gene <- exonsBy(txdb,by=<span>"gene"</span>)<br><span># 对于每个基因,将所有外显子减少成一组非重叠外显子,计算它们的长度(宽度)并求和</span><br>exonic.gene.sizes <- sum(width(GenomicRanges::reduce(exons.list.per.gene)))<br><span># 得到geneid和长度数据</span><br>gfe <- data.frame(gene_id=names(exonic.gene.sizes),<br> length=exonic.gene.sizes)<br>head(gfe)[<span>1</span>:<span>5</span>,<span>1</span>:<span>2</span>]<br><span># gene_id length</span><br><span># ENSG00000000003.15 ENSG00000000003.15 4536</span><br><span># ENSG00000000005.6 ENSG00000000005.6 1476</span><br><span># ENSG00000000419.13 ENSG00000000419.13 1207</span><br><span># ENSG00000000457.14 ENSG00000000457.14 6883</span><br><span># ENSG00000000460.17 ENSG00000000460.17 5970</span><br>save(gfe,file = <span>"gfe.Rdata"</span>)<br></code></pre><p data-tool="mdnice编辑器">2、使用TCGA-HNSC数据</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944934" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"hnsc_exp.Rdata"</span>)<br>head(hnsc)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># ENSG00000000003.15 2090 1680 5433</span><br><span># ENSG00000000005.6 0 0 0</span><br><span># ENSG00000000419.13 2098 3872 2240</span><br>identical(rownames(hnsc),rownames(gfe))<br><span>#[1] TRUE 行名是能够对上的</span><br></code></pre><p data-tool="mdnice编辑器">使用了TCGA-HNSC中的count数据,检测了一下count数据和下载的基因信息的顺序是一致的。</p><p data-tool="mdnice编辑器">3、曾老师写的代码进行count/tpm转化</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944935" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"gfe.Rdata"</span>)<br><span>#提取基因长度列</span><br>effLen = gfe$length<br><span>#转化</span><br>Counts2TPM <- <span>function</span>(counts, effLen){<br> rate <- log(counts) - log(effLen)<br> denom <- log(sum(exp(rate)))<br> exp(rate - denom + log(<span>1e6</span>))<br>}<br>hnsc_tpm_raw <- apply(hnsc, <span>2</span>, Counts2TPM, effLen = effLen)<br>head(hnsc_tpm_raw)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># ENSG00000000003.15 31.13713 15.74248 72.37614</span><br><span># ENSG00000000005.6 0.00000 0.00000 0.00000</span><br><span># ENSG00000000419.13 117.46366 136.35307 112.14231</span><br></code></pre><p data-tool="mdnice编辑器">4、加载既往处理好的数据进行对比</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944936" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"hnsc_tpm.Rdata"</span>)<br>head(hnsc_tpm)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># ENSG00000000003.15 31.13713 15.74248 72.37614</span><br><span># ENSG00000000005.6 0.00000 0.00000 0.00000</span><br><span># ENSG00000000419.13 117.46366 136.35307 112.14231</span><br></code></pre><p data-tool="mdnice编辑器">结果是一致的</p><h4 data-tool="mdnice编辑器"><span></span>方法二:R包 DGEobj.utils<span></span></h4><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944937" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"gfe.Rdata"</span>)<br><span>#提取上面处理好的基因长度列</span><br>effLen = gfe$length<br><span>library</span>(DGEobj.utils)<br>CC_res <- convertCounts(<br> exp,<br> unit = <span>"TPM"</span>, <span># CPM、FPKM、FPK 或 TPM</span><br> geneLength = effLen, <span>#这里还是需要下在基因长度数据</span><br> log = <span>FALSE</span>, <span>#默认为False,设为TRUE时返回Log2值</span><br> normalize = <span>"none"</span>, <span>#默认为none,调用edgeR的calcNormFactors()进行标准化</span><br> prior.count = <span>NULL</span> <span>#为避免取0的对数,对每个观测值添加平均count。仅当 log = TRUE 时使用,</span><br>)<br>head(CC_res)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span>#ENSG00000000003.15 31.13713 15.74248 72.37614</span><br><span>#ENSG00000000005.6 0.00000 0.00000 0.00000</span><br><span>#ENSG00000000419.13 117.46366 136.35307 112.14231</span><br></code></pre><p data-tool="mdnice编辑器">与前面得到的分析结果是一致的。</p><p data-tool="mdnice编辑器">需要下载基因长度数据,但是前期处理完之后后面可以很方便的转化为各种想要的格式(CPM、FPKM、FPK 或 TPM)。</p><h4 data-tool="mdnice编辑器"><span></span>方法三:R包 IOBR<span></span></h4><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944938" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"hnsc_exp.Rdata"</span>)<br>exp <- hnsc<br><span>library</span>(IOBR)<br>exp_tpm <- count2tpm(<br> exp, <span>#表达矩阵, 行为基因</span><br> idType = <span>"Ensembl"</span>, <span>#"Ensembl" "ENTREZ","SYMBOL" </span><br>)<br>head(exp_tpm)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># TSPAN6 31.89988 16.16771 73.85059</span><br><span># TNMD 0.00000 0.00000 0.00000</span><br><span># DPM1 15.65888 18.22162 14.88931</span><br><span># 可以看到这个函数还自动把gene_id给转换成了symbol</span><br><br><span># 为了更好的对比,我们也把hnsc_tpm_raw中的gene_id转换成symbol</span><br><span># 使用小洁老师的trans_exp_new函数</span><br><span>library</span>(tinyarray)<br>a <- trans_exp_new(hnsc_tpm_raw)<br>head(a)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># DDX11L1 0.000000 0.0000000 0.000000</span><br><span># WASH7P 1.150477 0.8180029 2.191641</span><br><span># MIR6859-1 0.993794 0.0000000 4.443138</span><br><br><span># 排个序再看一下</span><br>p <- identical(rownames(a),rownames(exp_tpm));p<br><span>if</span>(!p) {<br> s = intersect(rownames(a),rownames(exp_tpm))<br> a = a[s,]<br> exp_tpm = exp_tpm[s,]<br>}<br>head(a)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># DDX11L1 0.000000 0.00000000 0.000000</span><br><span># WASH7P 1.150477 0.81800290 2.191641</span><br><span># FAM138A 0.000000 0.03486849 0.000000</span><br><br>head(exp_tpm)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span># TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># DDX11L1 0.00000 0.00000000 0.00000</span><br><span># WASH7P 22.63327 5.95811827 15.88102</span><br><span># FAM138A 0.00000 0.03581036 0.00000</span><br></code></pre><p data-tool="mdnice编辑器">IOBR包中的count2tpm函数只能进行tpm转化(github上搜了这个R包内容似乎没有转化为其他数据格式的函数了)。用默认的方式进行运算得到的结果存在一定的偏差,而且我个人觉得这个偏差有点大... 但是我暂时不知道是什么原因?是内置的基因长度顺序有问题?还是我某个参数设的不对?求高手指点~</p><p data-tool="mdnice编辑器">综合上述分析,暂时还是选择<strong>常规方法</strong>和<strong>DGEobj.utils R包中的convertCounts函数</strong>吧~</p><p data-tool="mdnice编辑器">参考资料:</p><p data-tool="mdnice编辑器">1、https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/lessons/sample_level_QC.html</p><p data-tool="mdnice编辑器">2、https://rdrr.io/cran/DGEobj.utils/man/convertCounts.html</p><p data-tool="mdnice编辑器">3、https://rdrr.io/github/IOBR/IOBR/man/count2tpm.html</p><p data-tool="mdnice编辑器">4、生信技能树推文:<a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247513933&idx=2&sn=874301305219e93c26d173560a31b076&chksm=9b4bf7f6ac3c7ee0a2ae03e8d5ec9994e7d9b563423746c06a037f56f8013bc3924effe7580a&scene=21#wechat_redirect" textvalue="Counts FPKM RPKM TPM CPM 的转化" linktype="text" imgurl="" imgdata="null" data-itemshowtype="0" tab="innerlink" data-linktype="2">Counts FPKM RPKM TPM CPM 的转化</a></p><p data-tool="mdnice编辑器"><strong>致谢</strong>:感谢曾老师,小洁老师以及生信技能树团队全体成员。</p><p data-tool="mdnice编辑器"><strong>注</strong>:若对内容有疑惑或者有发现明确错误的朋友,请联系后台(欢迎交流)。</p><span>- END -</span></section><p><br></p><p><mp-style-type data-value="3"></mp-style-type></p></div> | ||
<hr> | ||
<a href="https://mp.weixin.qq.com/s/XXpc0IBa2SJlFp5iTH1OtA",target="_blank" rel="noopener noreferrer">原文链接</a> |