Skip to content

Commit

Permalink
Count数据转换为TPM数据方法整理-常规方法、DGEobj.utils和IOBR包
Browse files Browse the repository at this point in the history
  • Loading branch information
ixxmu committed Aug 2, 2024
1 parent 7f98f6c commit 29be472
Showing 1 changed file with 15 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "Count数据转换为TPM数据方法整理-常规方法、DGEobj.utils和IOBR包"
date: 2024-08-02T02:12:50Z
draft: ["false"]
tags: [
"fetched",
"生信方舟"
]
categories: ["Acdemic"]
---
Count数据转换为TPM数据方法整理-常规方法、DGEobj.utils和IOBR包 by 生信方舟
------
<div><section data-tool="mdnice编辑器" data-website="https://www.mdnice.com"><p data-tool="mdnice编辑器">在正式分析之前,对于数据的处理是至关重要的,这种重要性是体现在很多方面,其中有一点是要求分析者采用正确的数据类型。</p><p data-tool="mdnice编辑器">对于<strong>芯片数据</strong>,原始数据<strong>进行log2处理之后可以进行很多常见的分析</strong>,比如差异分析、热图、箱线图、PCA分析、生存分析、模型构建,聚类分析和相关性分析等。</p><p data-tool="mdnice编辑器">对于<strong>转录组数据</strong>,在上述的常见分析中<strong>只有差异分析时需要采用count值</strong>,<strong>其他的分析</strong>是需要<strong>采用log2后的cpm,tpm,fpkm,rpkm数据</strong>。</p><p data-tool="mdnice编辑器">除了上述的常规分析,在<strong>使用不同R包</strong>进行<strong>分析之前务必浏览一下输入数据的要求</strong>。</p><p data-tool="mdnice编辑器">那么芯片数据还好说,毕竟后续进行log2处理后就可以做很多分析。但是转录组数据的可选项就比较多了。</p><p data-tool="mdnice编辑器"><strong>但目前常用的只有tpm和cpm</strong></p><figure data-tool="mdnice编辑器"><img data-imgfileid="100000748" data-ratio="0.48520710059171596" data-src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVsxHh9tTpLjj6qEOo6dibZc6XhUWpBf5nYYz2WIcP2iaYK0a8H9EA72hIA/640?wx_fmt=png&amp;from=appmsg" data-type="png" data-w="845" src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVsxHh9tTpLjj6qEOo6dibZc6XhUWpBf5nYYz2WIcP2iaYK0a8H9EA72hIA/640?wx_fmt=png&amp;from=appmsg"></figure><p data-tool="mdnice编辑器"><strong>count数据转换为cpm数据非常简单</strong></p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944932" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code><span># exprSet是count表达矩阵</span><br><span># 一句代码搞定</span><br>exprSet = log2(edgeR::cpm(exprSet)+<span>1</span>)<br></code></pre><p data-tool="mdnice编辑器"><strong>比较难的就是count数据转换为tpm数据,因此搬运了常规的流程和R包的方法,做个对比</strong></p><p data-tool="mdnice编辑器">首先要去获取基因长度文件,因为后续需要用这个数据去矫正基因长度。</p><p data-tool="mdnice编辑器">网址:https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files</p><figure data-tool="mdnice编辑器"><img data-imgfileid="100000749" data-ratio="0.7461629279811098" data-src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVs9r7zFntnEpSsBxpxRO426MPds9cwI4bAJKgYLkwRNhN7Qbrzev3EBQ/640?wx_fmt=png&amp;from=appmsg" data-type="png" data-w="847" src="https://mmbiz.qpic.cn/sz_mmbiz_png/0SOG4MpDAyFQUhD0iby4kMyzyJk1AgOVs9r7zFntnEpSsBxpxRO426MPds9cwI4bAJKgYLkwRNhN7Qbrzev3EBQ/640?wx_fmt=png&amp;from=appmsg"></figure><h4 data-tool="mdnice编辑器"><span></span>方法一:常规方法-来自生信技能树<span></span></h4><p data-tool="mdnice编辑器">1、获取gencode.v36.annotation并处理</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944933" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code><span>if</span>(!<span>require</span>(GenomicFeatures))BiocManager::install(<span>"GenomicFeatures"</span>)<br><span>library</span>(GenomicFeatures)<br>txdb &lt;- makeTxDbFromGFF(<span>"gencode.v36.annotation.gtf.gz"</span>,format=<span>"gtf"</span>)<br><span># 获取每个基因id的外显子数据</span><br>exons.list.per.gene &lt;- exonsBy(txdb,by=<span>"gene"</span>)<br><span># 对于每个基因,将所有外显子减少成一组非重叠外显子,计算它们的长度(宽度)并求和</span><br>exonic.gene.sizes &lt;- sum(width(GenomicRanges::reduce(exons.list.per.gene)))<br><span># 得到geneid和长度数据</span><br>gfe &lt;- data.frame(gene_id=names(exonic.gene.sizes),<br>                  length=exonic.gene.sizes)<br>head(gfe)[<span>1</span>:<span>5</span>,<span>1</span>:<span>2</span>]<br><span>#                               gene_id length</span><br><span># ENSG00000000003.15 ENSG00000000003.15   4536</span><br><span># ENSG00000000005.6   ENSG00000000005.6   1476</span><br><span># ENSG00000000419.13 ENSG00000000419.13   1207</span><br><span># ENSG00000000457.14 ENSG00000000457.14   6883</span><br><span># ENSG00000000460.17 ENSG00000000460.17   5970</span><br>save(gfe,file = <span>"gfe.Rdata"</span>)<br></code></pre><p data-tool="mdnice编辑器">2、使用TCGA-HNSC数据</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944934" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"hnsc_exp.Rdata"</span>)<br>head(hnsc)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#                      TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># ENSG00000000003.15                         2090                         1680                         5433</span><br><span># ENSG00000000005.6                             0                            0                            0</span><br><span># ENSG00000000419.13                         2098                         3872                         2240</span><br>identical(rownames(hnsc),rownames(gfe))<br><span>#[1] TRUE 行名是能够对上的</span><br></code></pre><p data-tool="mdnice编辑器">使用了TCGA-HNSC中的count数据,检测了一下count数据和下载的基因信息的顺序是一致的。</p><p data-tool="mdnice编辑器">3、曾老师写的代码进行count/tpm转化</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944935" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"gfe.Rdata"</span>)<br><span>#提取基因长度列</span><br>effLen = gfe$length<br><span>#转化</span><br>Counts2TPM &lt;- <span>function</span>(counts, effLen){<br>  rate &lt;- log(counts) - log(effLen)<br>  denom &lt;- log(sum(exp(rate)))<br>  exp(rate - denom + log(<span>1e6</span>))<br>}<br>hnsc_tpm_raw &lt;- apply(hnsc, <span>2</span>, Counts2TPM, effLen = effLen)<br>head(hnsc_tpm_raw)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#                      TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># ENSG00000000003.15                     31.13713                     15.74248                     72.37614</span><br><span># ENSG00000000005.6                       0.00000                      0.00000                      0.00000</span><br><span># ENSG00000000419.13                    117.46366                    136.35307                    112.14231</span><br></code></pre><p data-tool="mdnice编辑器">4、加载既往处理好的数据进行对比</p><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944936" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"hnsc_tpm.Rdata"</span>)<br>head(hnsc_tpm)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#                       TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># ENSG00000000003.15                     31.13713                     15.74248                     72.37614</span><br><span># ENSG00000000005.6                       0.00000                      0.00000                      0.00000</span><br><span># ENSG00000000419.13                    117.46366                    136.35307                    112.14231</span><br></code></pre><p data-tool="mdnice编辑器">结果是一致的</p><h4 data-tool="mdnice编辑器"><span></span>方法二:R包 DGEobj.utils<span></span></h4><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944937" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"gfe.Rdata"</span>)<br><span>#提取上面处理好的基因长度列</span><br>effLen = gfe$length<br><span>library</span>(DGEobj.utils)<br>CC_res &lt;- convertCounts(<br>  exp,<br>  unit = <span>"TPM"</span>, <span># CPM、FPKM、FPK 或 TPM</span><br>  geneLength = effLen, <span>#这里还是需要下在基因长度数据</span><br>  log = <span>FALSE</span>, <span>#默认为False,设为TRUE时返回Log2值</span><br>  normalize = <span>"none"</span>, <span>#默认为none,调用edgeR的calcNormFactors()进行标准化</span><br>  prior.count = <span>NULL</span> <span>#为避免取0的对数,对每个观测值添加平均count。仅当 log = TRUE 时使用,</span><br>)<br>head(CC_res)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#                   TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span>#ENSG00000000003.15                     31.13713                     15.74248                     72.37614</span><br><span>#ENSG00000000005.6                       0.00000                      0.00000                      0.00000</span><br><span>#ENSG00000000419.13                    117.46366                    136.35307                    112.14231</span><br></code></pre><p data-tool="mdnice编辑器">与前面得到的分析结果是一致的。</p><p data-tool="mdnice编辑器">需要下载基因长度数据,但是前期处理完之后后面可以很方便的转化为各种想要的格式(CPM、FPKM、FPK 或 TPM)。</p><h4 data-tool="mdnice编辑器"><span></span>方法三:R包 IOBR<span></span></h4><pre data-tool="mdnice编辑器"><span data-remoteid="c1720870944938" data-cacheurl="https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg"></span><code>load(<span>"hnsc_exp.Rdata"</span>)<br>exp &lt;- hnsc<br><span>library</span>(IOBR)<br>exp_tpm &lt;- count2tpm(<br>  exp, <span>#表达矩阵, 行为基因</span><br>  idType = <span>"Ensembl"</span>, <span>#"Ensembl" "ENTREZ","SYMBOL" </span><br>)<br>head(exp_tpm)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#         TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># TSPAN6                     31.89988                     16.16771                     73.85059</span><br><span># TNMD                        0.00000                      0.00000                      0.00000</span><br><span># DPM1                       15.65888                     18.22162                     14.88931</span><br><span># 可以看到这个函数还自动把gene_id给转换成了symbol</span><br><br><span># 为了更好的对比,我们也把hnsc_tpm_raw中的gene_id转换成symbol</span><br><span># 使用小洁老师的trans_exp_new函数</span><br><span>library</span>(tinyarray)<br>a &lt;- trans_exp_new(hnsc_tpm_raw)<br>head(a)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#           TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># DDX11L1                       0.000000                    0.0000000                     0.000000</span><br><span># WASH7P                        1.150477                    0.8180029                     2.191641</span><br><span># MIR6859-1                     0.993794                    0.0000000                     4.443138</span><br><br><span># 排个序再看一下</span><br>p &lt;- identical(rownames(a),rownames(exp_tpm));p<br><span>if</span>(!p) {<br>  s = intersect(rownames(a),rownames(exp_tpm))<br>  a = a[s,]<br>  exp_tpm = exp_tpm[s,]<br>}<br>head(a)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#          TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># DDX11L1                     0.000000                   0.00000000                     0.000000</span><br><span># WASH7P                      1.150477                   0.81800290                     2.191641</span><br><span># FAM138A                     0.000000                   0.03486849                     0.000000</span><br><br>head(exp_tpm)[<span>1</span>:<span>3</span>,<span>1</span>:<span>3</span>]<br><span>#          TCGA-UF-A7JF-01A-11R-A34R-07 TCGA-CN-4725-01A-01R-1436-07 TCGA-D6-6827-01A-11R-1915-07</span><br><span># DDX11L1                      0.00000                   0.00000000                      0.00000</span><br><span># WASH7P                      22.63327                   5.95811827                     15.88102</span><br><span># FAM138A                      0.00000                   0.03581036                      0.00000</span><br></code></pre><p data-tool="mdnice编辑器">IOBR包中的count2tpm函数只能进行tpm转化(github上搜了这个R包内容似乎没有转化为其他数据格式的函数了)。用默认的方式进行运算得到的结果存在一定的偏差,而且我个人觉得这个偏差有点大... 但是我暂时不知道是什么原因?是内置的基因长度顺序有问题?还是我某个参数设的不对?求高手指点~</p><p data-tool="mdnice编辑器">综合上述分析,暂时还是选择<strong>常规方法</strong>和<strong>DGEobj.utils R包中的convertCounts函数</strong>吧~</p><p data-tool="mdnice编辑器">参考资料:</p><p data-tool="mdnice编辑器">1、https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/lessons/sample_level_QC.html</p><p data-tool="mdnice编辑器">2、https://rdrr.io/cran/DGEobj.utils/man/convertCounts.html</p><p data-tool="mdnice编辑器">3、https://rdrr.io/github/IOBR/IOBR/man/count2tpm.html</p><p data-tool="mdnice编辑器">4、生信技能树推文:<a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&amp;mid=2247513933&amp;idx=2&amp;sn=874301305219e93c26d173560a31b076&amp;chksm=9b4bf7f6ac3c7ee0a2ae03e8d5ec9994e7d9b563423746c06a037f56f8013bc3924effe7580a&amp;scene=21#wechat_redirect" textvalue="Counts FPKM RPKM TPM CPM 的转化" linktype="text" imgurl="" imgdata="null" data-itemshowtype="0" tab="innerlink" data-linktype="2">Counts FPKM RPKM TPM CPM 的转化</a></p><p data-tool="mdnice编辑器"><strong>致谢</strong>:感谢曾老师,小洁老师以及生信技能树团队全体成员。</p><p data-tool="mdnice编辑器"><strong>注</strong>:若对内容有疑惑或者有发现明确错误的朋友,请联系后台(欢迎交流)。</p><span>- END -</span></section><p><br></p><p><mp-style-type data-value="3"></mp-style-type></p></div>
<hr>
<a href="https://mp.weixin.qq.com/s/XXpc0IBa2SJlFp5iTH1OtA",target="_blank" rel="noopener noreferrer">原文链接</a>

0 comments on commit 29be472

Please sign in to comment.