forked from duty-machine/duty-machine
-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
15 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
title: "R数据挖掘 | 批次效应的处理" | ||
date: 2023-09-14T16:00:20Z | ||
draft: ["false"] | ||
tags: [ | ||
"fetched", | ||
"生信小白要知道" | ||
] | ||
categories: ["Acdemic"] | ||
--- | ||
R数据挖掘 | 批次效应的处理 by 生信小白要知道 | ||
------ | ||
<div><section data-tool="mdnice编辑器" data-website="https://www.mdnice.com" data-mpa-powered-by="yiban.io"><blockquote data-tool="mdnice编辑器"><blockquote><p>在使用多个GEO数据集进行Bulk组织RNAseq数据分析的时候,常常会遇到<strong>批次效应</strong>的问题,今天分享两个去除Bulk组织中批次效应的方法。</p></blockquote></blockquote><h2 data-tool="mdnice编辑器"><span></span>什么是批次效应?</h2><p data-tool="mdnice编辑器">批次效应(<code>Batch Effect</code>)是指的是在实验过程中引入的<strong>非生物学的技术或实验变异</strong>,可能源自实验的不同运行批次、不同的实验人员、仪器的不同或者实验条件的变化等。</p><h4 data-tool="mdnice编辑器"><span></span>批次效应特点<span></span></h4><ul data-tool="mdnice编辑器"><li><section><code>数据不一致性</code>:批次效应导致同样的生物样本在不同批次中产生不一致的测量结果。</section></li><li><section><code>掩盖生物学信号</code>:批次效应可能掩盖了真实的生物学差异。</section></li><li><section><code>增加噪声</code>:批次效应增加了数据的噪声,降低了实验的灵敏度和准确性。</section></li><li><section><code>降低可重复性</code>:如果批次效应没有被适当地控制和校正,那么实验的可重复性和可比性将会降低。</section></li></ul><h4 data-tool="mdnice编辑器"><span></span>批次效应解决<span></span></h4><ul data-tool="mdnice编辑器"><li><section><code>批次随机化</code>:在实验设计阶段,尽量随机分配样本到不同批次,以减小批次效应的影响。</section></li><li><section><code>质量控制</code>:对实验中不同批次的数据进行质量控制,包括检查异常值和离群值,以确保数据的质量。</section></li><li><section><code>校正方法</code>:<strong>使用统计方法来校正批次效应。这包括ComBat、SVA、PCA校正等方法,可以在数据分析之前对数据进行批次校正。</strong></section></li><li><section><code>整合数据</code>:对于多个批次的数据,可以采用数据整合的方法来合并不同批次的信息,例如单细胞RNA-seq数据的整合。</section></li></ul><h2 data-tool="mdnice编辑器"><span></span>去除批次效应R语言实现</h2><h4 data-tool="mdnice编辑器"><span></span>在GEO数据库中下载两个不同数据集的数据,<strong>GSE211167</strong>和<strong>GSE229119</strong>。<span></span></h4><p data-tool="mdnice编辑器">关于GEO数据的下载方法可以参考之前的推文</p><blockquote data-tool="mdnice编辑器"><p><a href="https://mp.weixin.qq.com/s?__biz=Mzg2NjYzNjQ4Ng==&mid=2247486294&idx=1&sn=b70aaa7ab76ec5c27ddf7afbf740b8ba&chksm=ce468cfff93105e9f60e5c304c2625a8f26ad0832c2f27cb9a8079bdf4e8121e537fad30aac3&token=560068309&lang=zh_CN&scene=21#wechat_redirect" data-linktype="2">GEO数据库介绍 | 三种方式下载GEO数据</a></p></blockquote><h4 data-tool="mdnice编辑器"><span></span>可以使用我下载好的数据<span></span></h4><p data-tool="mdnice编辑器">百度云链接:https://pan.baidu.com/s/1iVOmz_DyjroTqbZPt48frQ 提取码:b91f</p><h3 data-tool="mdnice编辑器"><span></span>数据加载<span></span></h3><pre data-tool="mdnice编辑器"><span></span><code><span>library</span>(dplyr)<br><span>library</span>(data.table)<br>GSE211167_tpm <- fread(<span>"./GSE211167_tpm_matrix.txt"</span>,data.table = <span>F</span>)<br>GSE229119_log2TPM <- fread(<span>"./GSE229119_log2TPM.txt"</span>,data.table = <span>F</span>)<br></code></pre><figure data-tool="mdnice编辑器"><img data-ratio="0.29130434782608694" data-src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAGwGafVw436Htug4IT2gYbYdfttFxhPqAXAU0m4aoPaVPAx42EqiaQlEw/640?wx_fmt=png" data-type="png" data-w="460" src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAGwGafVw436Htug4IT2gYbYdfttFxhPqAXAU0m4aoPaVPAx42EqiaQlEw/640?wx_fmt=png"><figcaption>GSE211167</figcaption></figure><figure data-tool="mdnice编辑器"><img data-ratio="0.24858757062146894" data-src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAGmxOlTcBcJqUlHkP1ndfiaGxdCmzeM9V8st45w3Hk5PJ0x9X81pau23A/640?wx_fmt=png" data-type="png" data-w="531" src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAGmxOlTcBcJqUlHkP1ndfiaGxdCmzeM9V8st45w3Hk5PJ0x9X81pau23A/640?wx_fmt=png"><figcaption>GSE229119</figcaption></figure><h3 data-tool="mdnice编辑器"><span></span>数据预处理<span></span></h3><p data-tool="mdnice编辑器">由于下载的数据标准化方式不一样,所以需要先对数据进行预处理,主要涉及<strong>数据统一标准化、重新命名样本名以及数据的合并</strong>。</p><p data-tool="mdnice编辑器">关于R语言数据处理的学习也可以看之前的介绍。</p><blockquote data-tool="mdnice编辑器"><p><a href="https://mp.weixin.qq.com/s?__biz=Mzg2NjYzNjQ4Ng==&mid=2247486354&idx=1&sn=ef696ddffb7c1b2ee3fda6e96227ec90&chksm=ce468c3bf931052d5583e7f0ee8776bbe6ae5189ce38ec0e0a5f1d2e5d95dc88f0d9164a1af5&token=560068309&lang=zh_CN&scene=21#wechat_redirect" data-linktype="2">R语言数据处理 | dplyr包的使用(一)</a><br><a href="https://mp.weixin.qq.com/s?__biz=Mzg2NjYzNjQ4Ng==&mid=2247486354&idx=1&sn=ef696ddffb7c1b2ee3fda6e96227ec90&chksm=ce468c3bf931052d5583e7f0ee8776bbe6ae5189ce38ec0e0a5f1d2e5d95dc88f0d9164a1af5&token=560068309&lang=zh_CN&scene=21#wechat_redirect" data-linktype="2">R语言数据处理 | dplyr包的使用(二)</a><br><a href="https://mp.weixin.qq.com/s?__biz=Mzg2NjYzNjQ4Ng==&mid=2247486453&idx=1&sn=79303287c1c2d66c8edf435e4224ca83&chksm=ce468c5cf931054a8478231f88885e2eb3708d50514d72e390ebcd5cbad164ac183bb565dbde&token=560068309&lang=zh_CN&scene=21#wechat_redirect" data-linktype="2">R语言数据处理 | apply系列函数的区别和使用</a></p></blockquote><pre data-tool="mdnice编辑器"><span></span><code>GSE211167_tpm <- data.frame(t(GSE211167_tpm)) <span>#转换行列</span><br>colnames(GSE211167_tpm) <- c(paste0(<span>"Data1_"</span>,<span>1</span>:<span>26</span>)) <span>#样本重命名</span><br>GSE211167_tpm$gene <- rownames(GSE211167_tpm) <span>#基因名</span><br><br>GSE229119_tpm <- GSE229119_log2TPM %>% select(-gene) <span>#去除基因列</span><br>GSE229119_tpm <- data.frame(apply(GSE229119_tpm, <span>2</span>, <span>function</span>(x){<span>2</span>^(x)})) <span>#数据转换成tmp值</span><br>colnames(GSE229119_tpm) <- c(paste0(<span>"Data2_"</span>,<span>1</span>:<span>28</span>))<span>#样本重命名</span><br>GSE229119_tpm$gene <- GSE229119_log2TPM$gene <span>#基因名</span><br><br>combine_data <- inner_join(GSE229119_tpm,GSE211167_tpm) <span>#合并数据</span><br><br>norm_data <- combine_data %>% select(-gene) <span>#去除基因名</span><br>norm_data <- data.frame(apply(norm_data, <span>2</span>, <span>function</span>(x){round(log2(x+<span>1</span>),<span>2</span>)})) <span>#统一标准化</span><br>rownames(norm_data) <- combine_data$gene <span>#命名行名</span><br></code></pre><h3 data-tool="mdnice编辑器"><span></span>查看批次效应<span></span></h3><p data-tool="mdnice编辑器">可以使用箱线图来看一下批次效应,也可以使用pca图、热图等,比较常用的是箱线图</p><pre data-tool="mdnice编辑器"><span></span><code><span>## 查看批次效应</span><br><span>## 各选10各样本,箱线图看批次效应</span><br>box_draw <- norm_data %>% select((paste0(<span>"Data1_"</span>,<span>1</span>:<span>10</span>)),(paste0(<span>"Data2_"</span>,<span>1</span>:<span>10</span>)))<br>boxplot(box_draw, col = <span>"lightblue"</span>,las = <span>2</span>)<br></code></pre><figure data-tool="mdnice编辑器"><img data-ratio="0.6175925925925926" data-src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAGib8K8ibh6AyItiaNszvuCeJdhIUbObyicgF8zic3nibDPAmAJhw02j8Oby2Q/640?wx_fmt=png" data-type="png" data-w="1080" src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAGib8K8ibh6AyItiaNszvuCeJdhIUbObyicgF8zic3nibDPAmAJhw02j8Oby2Q/640?wx_fmt=png"></figure><p data-tool="mdnice编辑器">可以看到两组数据存在明显的批次效应</p><h3 data-tool="mdnice编辑器"><span></span>limma包去除批次效应<span></span></h3><p data-tool="mdnice编辑器">limma包的removeBatchEffect函数可以用来去除批次效应。</p><pre data-tool="mdnice编辑器"><span></span><code><span>## 使用limma的 removeBatchEffect 函数</span><br><span>library</span>(limma)<br><span># 样本分组信息,我这里随便分组的,根据情况可以调整</span><br>group_list <- c(rep(<span>"tumor"</span>,<span>13</span>),rep(<span>"normal"</span>,<span>13</span>),rep(<span>"tumor"</span>,<span>14</span>),rep(<span>"normal"</span>,<span>14</span>)) <br>batch <- c(rep(<span>"Data1"</span>,<span>26</span>),rep(<span>"Data2"</span>,<span>28</span>)) <span>#批次信息</span><br>design=model.matrix(~group_list)<br>Batch_limma <- removeBatchEffect(norm_data,batch = batch,design = design)<br><br><span>## 看一下去除之后的箱线图</span><br>Batch_limma_draw <- Batch_limma %>% data.frame() %>% select((paste0(<span>"Data1_"</span>,<span>1</span>:<span>10</span>)),(paste0(<span>"Data2_"</span>,<span>1</span>:<span>10</span>)))<br>boxplot(Batch_limma_draw, col = <span>"lightblue"</span>,las = <span>2</span>)<br></code></pre><figure data-tool="mdnice编辑器"><img data-ratio="0.6584234930448223" data-src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAG8BeKwHBHeNmM8gPibDx3Tx6QicvWTEcgd8Q6LAmOfIVvSO0qyWOTgkYg/640?wx_fmt=png" data-type="png" data-w="647" src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAG8BeKwHBHeNmM8gPibDx3Tx6QicvWTEcgd8Q6LAmOfIVvSO0qyWOTgkYg/640?wx_fmt=png"><figcaption>去除批次效应后</figcaption></figure><p data-tool="mdnice编辑器"><strong>limma包去除批次效应,除了输入批次信息外,还需要样本分组信息。</strong></p><h3 data-tool="mdnice编辑器"><span></span>sva包去除批次效应<span></span></h3><p data-tool="mdnice编辑器">sva包的ComBat函数专门用来去除批次效应,<strong>只需要输入批次信息,不需要分组信息</strong>,个人比较推荐这个。</p><pre data-tool="mdnice编辑器"><span></span><code><span>library</span>(sva)<br>batch <- c(rep(<span>"Data1"</span>,<span>26</span>),rep(<span>"Data2"</span>,<span>28</span>)) <span>#批次信息</span><br>combat_data <- ComBat(norm_data, batch = batch)<br><br><span>## 看一下去除之后的</span><br>combat_data_draw <- combat_data %>% data.frame() %>% select((paste0(<span>"Data1_"</span>,<span>1</span>:<span>10</span>)),(paste0(<span>"Data2_"</span>,<span>1</span>:<span>10</span>)))<br>boxplot(combat_data_draw, col = <span>"lightblue"</span>,las = <span>2</span>)<br></code></pre><figure data-tool="mdnice编辑器"><img data-ratio="0.6584234930448223" data-src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAG8BeKwHBHeNmM8gPibDx3Tx6QicvWTEcgd8Q6LAmOfIVvSO0qyWOTgkYg/640?wx_fmt=png" data-type="png" data-w="647" src="https://mmbiz.qpic.cn/mmbiz_png/FiciaODWjVz620sK3Bwfs9vY69Y8o6lzAG8BeKwHBHeNmM8gPibDx3Tx6QicvWTEcgd8Q6LAmOfIVvSO0qyWOTgkYg/640?wx_fmt=png"></figure><h4 data-tool="mdnice编辑器"><span></span>以上就是两个方法去除批次效应,limma的优势是做差异分析的时候可以直接用,比较方便。sva包则需要单独做完再进行差异分析。</h4></section><p><mp-style-type data-value="3"></mp-style-type></p></div> | ||
<hr> | ||
<a href="https://mp.weixin.qq.com/s/e4eW_Blu4zMNP0gGbmT_jw",target="_blank" rel="noopener noreferrer">原文链接</a> |