forked from duty-machine/duty-machine
-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
15 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
title: "转录组测序的表达量的两个归一化方向会影响差异分析吗" | ||
date: 2024-07-19T08:52:43Z | ||
draft: ["false"] | ||
tags: [ | ||
"fetched", | ||
"生信技能树" | ||
] | ||
categories: ["Acdemic"] | ||
--- | ||
转录组测序的表达量的两个归一化方向会影响差异分析吗 by 生信技能树 | ||
------ | ||
<div><section data-tool="mdnice编辑器" data-website="https://www.mdnice.com"><blockquote data-tool="mdnice编辑器"><span></span><p>众所周知,转录组测序后拿到的表达量矩阵通常是基因在样品的reads的数量,就是最原始的整数的counts矩阵啦。它有两个归一化方向,首先是样品方向的就是抹去各个样品的文库大小这个变量,然后是基因方向的就是抹去基因长度对表达量的影响。</p></blockquote><p data-tool="mdnice编辑器">如果是使用deseq2这样的包进行转录组测序的表达量的差异分析需要的是最原始的整数的counts矩阵即可,如果是做表达量热图,通常是使用归一化后的矩阵,可以是两个方向都做。如果仅仅是考虑文库大小就是cpm和rpm,如果同时考虑基因长度就是 FPKM(Fragments Per Kilobase of transcript per Million mapped reads),以及tpm,让我们来理解一下:</p><h3 data-tool="mdnice编辑器"><span></span><span>cpm和rpm是同一个概念</span><span></span></h3><p data-tool="mdnice编辑器">CPM和RPM是同一种基因表达量标准化方法,它们用于调整测序深度的差异,以便在不同样本之间进行比较,英文全称是:</p><ol data-tool="mdnice编辑器"><li><section>**CPM (Counts Per Million)**:</section></li><li><section>**RPM (Reads Per Million)**:</section></li></ol><p data-tool="mdnice编辑器">其实就是就是最原始的整数的counts矩阵除以每个样品的文库大小(以1M为单位),但是目前转录组测序非常标准化了其实文库大小统一是20M附近,如果不做这个cpm或者rpm,问题也不大,但是就怕碰到极端值情况。</p><h3 data-tool="mdnice编辑器"><span></span><span>tpm不一定是转录本定量</span><span></span></h3><p data-tool="mdnice编辑器">本来呢,应该是先理解 FPKM(Fragments Per Kilobase of transcript per Million mapped reads),就是上面的cpm或者rpm矩阵再除以每个基因的长度(以1kb为单位)情况。但是这样的FPKM表达量有一个弊端就是每个样品的所有的基因的FPKM加和并不是固定的,所以就引入了tpm概念,就是继续除以FPKM表达量的文库(以1M为单位)大小,这个时候就不一定是20M附近,因为每个样品的FPKM加和并不是固定的。但是TPM(Transcripts Per Million)看起来很容易让人误解是针对转录本的定量。</p><h3 data-tool="mdnice编辑器"><span></span><span>最原始的整数的counts矩阵的差异分析</span><span></span></h3><p data-tool="mdnice编辑器">只需要在你的r里面加载两个包,就可以完成下面的分析啦:</p><pre data-tool="mdnice编辑器"><span></span><code><span># 魔幻操作,一键清空</span><br>rm(list = ls()) <br>options(stringsAsFactors = <span>F</span>)<br><span># BiocManager::install('airway')</span><br><span># 加载airway数据集并转换为表达矩阵</span><br><span>library</span>(airway,quietly = <span>T</span>) <br>data(airway)<br>rawcount <- assay(airway) <br>group_list <- colData(airway)$dex <br><span># 过滤在至少在75%的样本中都有表达的基因 (可选步骤,也可以修改)</span><br>keep <- rowSums(rawcount><span>0</span>) >= floor(<span>0.75</span>*ncol(rawcount))<br>table(keep) <br>filter_count <- rawcount[keep,]<br>filter_count[<span>1</span>:<span>4</span>,<span>1</span>:<span>4</span>]<br>dim(filter_count)<br><br>run_deseq2 <- <span>function</span>(exprSet,group_list){<br> <span>library</span>(DESeq2) <br> <span># 第一步,构建DESeq2的DESeq对象</span><br> colData <- data.frame(row.names=colnames(exprSet),group_list=group_list)<br> dds <- DESeqDataSetFromMatrix(countData = exprSet,colData = colData,design = ~ group_list)<br> <span># 第二步,进行差异表达分析</span><br> dds2 <- DESeq(dds)<br> <span># 提取差异分析结果,trt组对untrt组的差异分析结果</span><br> tmp <- results(dds2,contrast=c(<span>"group_list"</span>,<span>"trt"</span>,<span>"untrt"</span>))<br> DEG_DESeq2 <- as.data.frame(tmp[order(tmp$padj),])<br> head(DEG_DESeq2) <br> <span># 去除差异分析结果中包含NA值的行</span><br> DEG_DESeq2 = na.omit(DEG_DESeq2)<br>}<br><br>deg_raw = run_deseq2(filter_count,group_list)<br><br></code></pre><p data-tool="mdnice编辑器">上面的代码里面,我定义了一个 run_deseq2 函数,方便后续调用:</p><h3 data-tool="mdnice编辑器"><span></span><span>针对cpm或者rpm矩阵的差异分析</span><span></span></h3><p data-tool="mdnice编辑器">假如极端情况下,你拿到了的转录组测序的表达量矩阵就是cpm或者rpm,你可以直接把矩阵乘以20后向上取整,如下所示的代码:</p><pre data-tool="mdnice编辑器"><span></span><code>ct2 = floor(<span>20</span>*edgeR::cpm(filter_count))<br>deg_cpm = run_deseq2(ct2,group_list)<br><br>save(deg_raw,deg_cpm,file = <span>'deg.Rdata'</span>)<br></code></pre><p data-tool="mdnice编辑器">可以看到之前的整数的counts矩阵里面每个样品的文库大小确实是不一样的,但是都是在20M附近,而如果你拿到了的转录组测序的表达量矩阵就是cpm或者rpm意味着你没办法知道每个样品的真实文库大小,因为被抹除了 。直接把矩阵乘以20后向上取整的后果就是每个样品很整齐,就是20M的文库大小;</p><pre data-tool="mdnice编辑器"><span></span><code>> colSums(filter_count)/1e6<br>SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 <br> 20.63292 18.80417 25.34134 15.16004 24.44175 30.81030 <br>SRR1039520 SRR1039521 <br> 19.11741 21.15675 <br>> colSums(ct2)/1e6<br>SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 <br> 19.98724 19.99079 19.98892 19.98928 19.98938 19.98891 <br>SRR1039520 SRR1039521 <br> 19.99101 19.98787 <br></code></pre><h3 data-tool="mdnice编辑器"><span></span><span>比较两次差异分析结果:</span><span></span></h3><p data-tool="mdnice编辑器">两次都是同样的 run_deseq2 函数,所以结果矩阵的格式是一致的:</p><pre data-tool="mdnice编辑器"><span></span><code>rm(list = ls())<br><span>library</span>(data.table)<br>load(<span>'deg.Rdata'</span>) <br>ids=intersect(rownames(deg_cpm),<br> rownames(deg_raw))<br>df= data.frame(<br> deg_cpm = deg_cpm[ids,<span>'log2FoldChange'</span>],<br> deg_raw = deg_raw[ids,<span>'log2FoldChange'</span>]<br>)<br><span>library</span>(ggpubr)<br>ggscatter(df, x = <span>"deg_cpm"</span>, y = <span>"deg_raw"</span>,<br> color = <span>"black"</span>, shape = <span>21</span>, size = <span>3</span>, <span># Points color, shape and size</span><br> add = <span>"reg.line"</span>, <span># Add regressin line</span><br> add.params = list(color = <span>"blue"</span>, fill = <span>"lightgray"</span>), <span># Customize reg. line</span><br> conf.int = <span>TRUE</span>, <span># Add confidence interval</span><br> cor.coef = <span>TRUE</span>, <span># Add correlation coefficient. see ?stat_cor</span><br> cor.coeff.args = list(method = <span>"pearson"</span>, label.sep = <span>"\n"</span>)<br>)<br></code></pre><p data-tool="mdnice编辑器">可以看到虽然是两次计算的logFC略微有差异,但是相关性几乎是完美的:</p><p><img data-galleryid="" data-imgfileid="100048325" data-ratio="0.7363770250368189" data-s="300,640" data-src="https://mmbiz.qpic.cn/mmbiz_png/cZNhZQ6j4wxff2CmdQVTwImotMOuvArTkjibibvxxrXjsDibKhA49V0blW8gc1EUQhK5T8Y1BUfLJ0licaBKJd2QcA/640?wx_fmt=png&from=appmsg" data-type="png" data-w="1358" src="https://mmbiz.qpic.cn/mmbiz_png/cZNhZQ6j4wxff2CmdQVTwImotMOuvArTkjibibvxxrXjsDibKhA49V0blW8gc1EUQhK5T8Y1BUfLJ0licaBKJd2QcA/640?wx_fmt=png&from=appmsg"></p><figure data-tool="mdnice编辑器"><figcaption>相关性几乎是完美的</figcaption></figure><p data-tool="mdnice编辑器">也可以看看,两次差异分析后的统计学显著的上下调基因的一致性情况,代码如下所示:</p><pre data-tool="mdnice编辑器"><span></span><code>modify_deg<-<span>function</span>(DEG_DESeq2){<br> <br> <span># 筛选上下调,设定阈值</span><br> fc_cutoff <- <span>1</span><br> fdr <- <span>0.05</span><br> <br> DEG_DESeq2$regulated <- <span>"normal"</span><br> <br> loc_up <- intersect(which(DEG_DESeq2$log2FoldChange>log2(fc_cutoff)),<br> which(DEG_DESeq2$padj<fdr))<br> loc_down <- intersect(which(DEG_DESeq2$log2FoldChange< (-log2(fc_cutoff))),<br> which(DEG_DESeq2$padj<fdr))<br> <br> DEG_DESeq2$regulated[loc_up] <- <span>"up"</span><br> DEG_DESeq2$regulated[loc_down] <- <span>"down"</span><br> <br> table(DEG_DESeq2$regulated)<br> <br> head(DEG_DESeq2)<br> <span>library</span>(AnnoProbe)<br> ag=annoGene(rownames(DEG_DESeq2),<br> ID_type = <span>'ENSEMBL'</span>,species = <span>'human'</span><br> )<br> head(ag)<br> DEG_DESeq2$ENSEMBL=rownames(DEG_DESeq2)<br> <br> deg_anno=merge(ag,DEG_DESeq2,by=<span>'ENSEMBL'</span>)<br> deg_anno=deg_anno[!duplicated(deg_anno$SYMBOL),]<br> rownames(deg_anno)=deg_anno$SYMBOL<br> <span>return</span>(deg_anno)<br>}<br>deg_cpm=modify_deg(deg_cpm)<br>deg_raw=modify_deg(deg_raw)<br>colnames(deg_cpm)<br><br>ids=intersect(rownames(deg_cpm),<br> rownames(deg_raw))<br>df= data.frame(<br> deg_cpm = deg_cpm[ids,<span>'regulated'</span>],<br> deg_raw = deg_raw[ids,<span>'regulated'</span>]<br>)<br>table(df)<br>gplots::balloonplot(table(df))<br></code></pre><p data-tool="mdnice编辑器">可以看到的是两次的差异分析误差几乎是可以忽略不计的 :</p><pre data-tool="mdnice编辑器"><span></span><code>> table(df)<br> deg_raw<br>deg_cpm down normal up<br> down 1111 14 0<br> normal 32 14683 13<br> up 0 8 1511<br></code></pre><p data-tool="mdnice编辑器">也可以进一步的看看两次差异分析的冲突的基因列表的功能情况:</p><pre data-tool="mdnice编辑器"><span></span><code>symbols_list = split(ids,paste(df[,<span>1</span>],df[,<span>2</span>]))<br><span>library</span>(clusterProfiler)<br><span>library</span>(org.Hs.eg.db)<br><span>library</span>(ReactomePA)<br><span>library</span>(ggplot2)<br><span>library</span>(stringr) <br><span># 首先全部的symbol 需要转为 entrezID</span><br>gcSample = lapply(symbols_list, <span>function</span>(y){ <br> y=as.character(na.omit(AnnotationDbi::select(org.Hs.eg.db,<br> keys = y,<br> columns = <span>'ENTREZID'</span>,<br> keytype = <span>'SYMBOL'</span>)[,<span>2</span>])<br> )<br> y<br>})<br>gcSample<br>pro=<span>'test'</span><br><span># 第1个注释是 KEGG </span><br>xx <- compareCluster(gcSample, fun=<span>"enrichKEGG"</span>,<br> organism=<span>"hsa"</span>, pvalueCutoff=<span>0.3</span>)<br>dotplot(xx) + theme(axis.text.x=element_text(angle=<span>45</span>,hjust = <span>1</span>)) + <br> scale_y_discrete(labels=<span>function</span>(x) str_wrap(x, width=<span>50</span>)) <br>ggsave(paste0(pro,<span>'_kegg.pdf'</span>),width = <span>10</span>,height = <span>8</span>)<br><br></code></pre><p data-tool="mdnice编辑器">蛮有意思的, 这些基因都是代谢相关的:</p><p><img data-galleryid="" data-imgfileid="100048326" data-ratio="0.7987477638640429" data-s="300,640" data-src="https://mmbiz.qpic.cn/mmbiz_png/cZNhZQ6j4wxff2CmdQVTwImotMOuvArTznwhRxOBNtMqggkGqEaEzoVy9BAxBicV9S39ZnyeNYQibEJSNTbLODOg/640?wx_fmt=png&from=appmsg" data-type="png" data-w="2236" src="https://mmbiz.qpic.cn/mmbiz_png/cZNhZQ6j4wxff2CmdQVTwImotMOuvArTznwhRxOBNtMqggkGqEaEzoVy9BAxBicV9S39ZnyeNYQibEJSNTbLODOg/640?wx_fmt=png&from=appmsg"></p><figure data-tool="mdnice编辑器"><figcaption>基因都是代谢相关的</figcaption></figure><p data-tool="mdnice编辑器">其实是可以深入探索一下,如果你的生物学背景足够这些基因看一下就知道是代谢了没必要做kegg的富集分析了:</p><pre data-tool="mdnice编辑器"><span></span><code>$`down normal`<br> [1] <span>"HDAC9"</span> <span>"NFYC"</span> <span>"NBN"</span> <span>"ALMS1"</span> <span>"MTHFD1L"</span> <span>"ACSBG2"</span> <br> [7] <span>"YIPF2"</span> <span>"PI4KB"</span> <span>"LIMS1"</span> <span>"INSIG1"</span> <span>"ZSCAN23"</span> <span>"PLPP4"</span> <br>[13] <span>"SP5"</span> <span>"GUSBP16"</span><br><br>$`normal down`<br> [1] <span>"ABCB5"</span> <span>"CDH10"</span> <span>"SLC6A16"</span> <span>"PTGS2"</span> <span>"MOK"</span> <br> [6] <span>"PITPNM3"</span> <span>"C14orf93"</span> <span>"ISYNA1"</span> <span>"TBC1D30"</span> <span>"MCEE"</span> <br>[11] <span>"ZNF436"</span> <span>"RSAD2"</span> <span>"MDH1B"</span> <span>"DEPTOR"</span> <span>"LRRC56"</span> <br>[16] <span>"STOX2"</span> <span>"PODN"</span> <span>"RIN1"</span> <span>"GUSBP1"</span> <span>"CDNF"</span> <br>[21] <span>"SAMD11"</span> <span>"ZNF682"</span> <span>"FANK1"</span> <span>"EIF2AK3-DT"</span> <span>"H3P6"</span> <br>[26] <span>"AC108488.1"</span> <span>"ZNF512"</span> <span>"ZNF286B"</span> <span>"KLRA1P"</span> <span>"AL355102.1"</span><br></code></pre><p data-tool="mdnice编辑器">为什么看起来基本上没有修改表达量矩阵的操作,也会有一些冲突的基因呢?作为学徒作业给大家吧!</p><p data-tool="mdnice编辑器">下一期我们说一下如果你的矩阵被fpkm了或者tpm了,该如何最佳差异分析呢?</p></section><section data-tool="mdnice编辑器" data-website="https://www.mdnice.com"><h3 data-tool="mdnice编辑器"><span>写在文末</span></h3></section><p>如果你也想做单细胞转录组数据分析,<span>最好是有自己的计算机资源哦,比如我们的</span><a href="https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247528363&idx=1&sn=5e02f3e9b2e148191e23ebc2c0d780e7&scene=21#wechat_redirect" data-linktype="2">2024的共享服务器交个朋友福利价仍然是800</a><span>,而且还需要有基本的生物信息学基础,也可以看看我们的</span><a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247531929&idx=1&sn=f6f16b7bf6b907360d6d0052e3d10cf6&chksm=9b4b3d22ac3cb434b6aa7753a4cf0f266578147ccf10b49cc834e46af578ee6de99be0accb30&scene=21#wechat_redirect" textvalue="生物信息学马拉松授课(买一得五)" linktype="text" imgurl="" imgdata="null" data-itemshowtype="0" tab="innerlink" data-linktype="2" hasload="1">生物信息学马拉松授课(买一得五)</a><span> ,你的生物信息学入门课。</span></p><p><br></p><p><br></p><p><mp-style-type data-value="3"></mp-style-type></p></div> | ||
<hr> | ||
<a href="https://mp.weixin.qq.com/s/HPB3P2crOd9hrTtU1PjEVA",target="_blank" rel="noopener noreferrer">原文链接</a> |