<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 富集分析</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/%e5%af%8c%e9%9b%86%e5%88%86%e6%9e%90/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>用GSEA来做基因集富集分析</title>
		<link>http://www.bio-info-trainee.com/1282.html</link>
		<comments>http://www.bio-info-trainee.com/1282.html#comments</comments>
		<pubDate>Wed, 30 Dec 2015 01:17:43 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[未分类]]></category>
		<category><![CDATA[GSEA]]></category>
		<category><![CDATA[富集分析]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1282</guid>
		<description><![CDATA[how to use GSEA? 这个有点类似于pathway（GO,KEGG等 &#8230; <a href="http://www.bio-info-trainee.com/1282.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div><span style="color: #ff0000; font-family: FangSong_GB2312;"><b>how to use GSEA?</b></span></div>
<div>这个有点类似于pathway（GO,KEGG等）的富集分析，区别在于gene set（矫正好的基于文献的数据库）的概念更广泛一点，包括了</div>
<p><b><span style="color: #ff0000; font-family: FangSong_GB2312;">how to download GSEA ?</span></b></p>
<div>软件下载地址：<a href="http://software.broadinstitute.org/gsea/downloads.jsp" target="_blank">http://software.broadinstitute.org/gsea/downloads.jsp</a></p>
<div>教程：<a href="http://software.broadinstitute.org/gsea/doc/desktop_tutorial.jsp" target="_blank">http://software.broadinstitute.org/gsea/doc/desktop_tutorial.jsp</a></div>
<div>需要自己安装好java环境！</div>
</div>
<p><b><span style="color: #ff0000; font-family: FangSong_GB2312;">what's the input for the GSEA?</span></b></p>
<div>说明书上写的输入数据是：GSEA supported data files are simply tab delimited ASCII text files, which have special file extensions that identify them. For example, expression data usually has the extension *.gct, phenotypes *.cls, gene sets *.gmt, and chip annotations *.chip. Click the <b>More on file formats</b> help button to view detailed descriptions of all the data file formats.</div>
<div>
<div>并且提供了测试数据：<a href="http://software.broadinstitute.org/gsea/datasets.jsp" target="_blank">http://software.broadinstitute.org/gsea/datasets.jsp</a></div>
<div>实际上没那么复杂，一个表达矩阵即可！然后做一个分组说明的cls文件即可。</div>
<div>主要是自己看说明书，做出要求的数据格式：<a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats" target="_blank">http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats</a></div>
<div>表达矩阵我这里下载GSE1009数据集做测试吧！</div>
<div>
<div><a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse1009">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse1009</a></div>
<div><a href="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1009/matrix/GSE1009_series_matrix.txt.gz" target="_blank">ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1009/matrix/GSE1009_series_matrix.txt.gz</a></div>
</div>
<div>
<div>cls的样本说明文件，就随便搞一搞吧，下面这个是例子：</div>
<div>6 2 1</div>
<div># good bad</div>
<div>good good good bad bad bad</div>
<div>文件如下，六个样本，根据探针来的表达数据，分组前后各三个一组。</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/clipboard7.png"><img class="alignnone size-full wp-image-1283" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/clipboard7.png" alt="clipboard" width="538" height="427" /></a></div>
</div>
</div>
<div><b><span style="color: #ff0000; font-family: FangSong_GB2312;">现在开始运行GSEA！</span></b></div>
<p><b><span style="color: #ff0000; font-family: FangSong_GB2312;">start to run the GSEA !</span></b></p>
<div>
<div></div>
<div>首先载入数据</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/clipboard8.png"><img class="alignnone size-full wp-image-1284" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/clipboard8.png" alt="clipboard" width="520" height="389" /></a></div>
<div>确定无误，就开始运行，运行需要设置一定的参数！</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/clipboard9.png"><img class="alignnone size-full wp-image-1285" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/clipboard9.png" alt="clipboard" width="759" height="533" /></a></div>
</div>
<p><b><span style="color: #ff0000; font-family: FangSong_GB2312;">what's the output ?</span></b></p>
<div>
<div>输出的数据非常多，对你选择的gene set数据集里面的每个set都会分析看看是否符合富集的标准，富集就出来一个报告。</div>
<div></div>
<div>点击success就能进入报告主页，里面的链接可以进入任意一个分报告。</div>
<div></div>
<div>最大的特色是提供了大量的数据集：You can browse the MSigDB from the <a href="http://software.broadinstitute.org/gsea/msigdb/index.jsp" target="_blank">Molecular Signatures Database</a> page of the GSEA web site or the Browse MSigDB page of the GSEA application. To browse the MSigDB from the GSEA application:</div>
<div></div>
<div>还自己建立了wiki说明主页：<span style="color: #000000; font-family: Verdana,Arial,Helvetica,sans-serif;"><a href="http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page" target="_blank">http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page</a></span></div>
<div><span style="color: #000000; font-family: Verdana,Arial,Helvetica,sans-serif;"> </span></div>
<div>有些文献是基于GSEA的：</div>
<div><a href="http://www.ncbi.nlm.nih.gov/pubmed/16199517" target="_blank">www.ncbi.nlm.nih.gov/pubmed/16199517</a></div>
<div><a href="http://stke.sciencemag.org/highwire/filestream/4681053/field_highwire_adjunct_files/1/2001966_Slides.zip" target="_blank">http://stke.sciencemag.org/highwire/filestream/4681053/field_highwire_adjunct_files/1/2001966_Slides.zip</a></div>
<div><a href="http://www.ingentaconnect.com/content/ben/cbio/2007/00000002/00000002/art00003" target="_blank">http://www.ingentaconnect.com/content/ben/cbio/2007/00000002/00000002/art00003</a></div>
<div><a href="http://www.nature.com/articles/ng0704-663a" target="_blank">http://www.nature.com/articles/ng0704-663a</a></div>
<div><a href="http://bioinformatics.oxfordjournals.org/content/23/23/3251.short" target="_blank">http://bioinformatics.oxfordjournals.org/content/23/23/3251.short</a></div>
<div><a href="http://link.springer.com/article/10.1007/s00335-011-9359-x" target="_blank">http://link.springer.com/article/10.1007/s00335-011-9359-x</a></div>
<div>
<h3><a href="http://link.springer.com/article/10.1007/s00335-011-9359-x" target="_blank">Identification of high-copper-responsive target pathways in Atp7b knockout mouse liver byGSEA on microarray data sets</a></h3>
</div>
<div>
<h3><a href="http://synapse.koreamed.org/search.php?where=aview&amp;id=10.4110/in.2011.11.6.406&amp;code=0078IN&amp;vmode=FULL" target="_blank">Comparison of invariant NKT cells with conventional T cells by using gene set enrichment analysis (GSEA)</a></h3>
</div>
</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1282.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用超几何分布检验做富集分析</title>
		<link>http://www.bio-info-trainee.com/1225.html</link>
		<comments>http://www.bio-info-trainee.com/1225.html#comments</comments>
		<pubDate>Tue, 15 Dec 2015 13:07:09 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[杂谈-随笔]]></category>
		<category><![CDATA[富集分析]]></category>
		<category><![CDATA[统计]]></category>
		<category><![CDATA[超几何分布]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1225</guid>
		<description><![CDATA[我们可以直接使用R的bioconductor里面的一个包，GOstats里面的函 &#8230; <a href="http://www.bio-info-trainee.com/1225.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>我们可以直接使用R的bioconductor里面的一个包，GOstats里面的函数来做超几何分布检验，看看每条pathway是否会富集</p>
<p>我们直接读取用limma包做好的差异分析结果</p>
<p>setwd("D:\\my_tutorial\\补\\用limma包对芯片数据做差异分析")</p>
<p>DEG=read.table("GSE63067.diffexp.NASH-normal.txt",stringsAsFactors = F)</p>
<p>View(DEG)</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0015.png"><img class="alignnone size-full wp-image-1227" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0015.png" alt="image001" width="552" height="319" /></a></p>
<p>我们挑选logFC的绝对值大于0.5，并且P-value小雨0.05的基因作为差异基因，并且转换成entrezID</p>
<p>probeset=rownames(DEG[abs(DEG[,1])&gt;0.5 &amp; DEG[,4]&lt;0.05,])</p>
<p>library(hgu133plus2.db)</p>
<p>library(annotate)</p>
<p>platformDB="hgu133plus2.db";</p>
<p>EGID &lt;- as.numeric(lookUp(probeset, platformDB, "ENTREZID"))</p>
<p>length(unique(EGID))</p>
<p>#[1] 775</p>
<p>diff_gene_list &lt;- unique(EGID)</p>
<p>这样我们的到来775个差异基因的一个list</p>
<p>首先我们直接使用R的bioconductor里面的一个包，GOstats里面的函数来做超几何分布检验，看看每条pathway是否会富集</p>
<p>library(GOstats)</p>
<p>library(org.Hs.eg.db)</p>
<p>#then do kegg pathway enrichment !</p>
<p>hyperG.params = new("KEGGHyperGParams", geneIds=diff_gene_list, universeGeneIds=NULL, annotation="org.Hs.eg.db",</p>
<p>categoryName="KEGG", pvalueCutoff=1, testDirection = "over")</p>
<p>KEGG.hyperG.results = hyperGTest(hyperG.params);</p>
<p>htmlReport(KEGG.hyperG.results, file="kegg.enrichment.html", summary.args=list("htmlLinks"=TRUE))</p>
<p>结果如下：</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0025.png"><img class="alignnone size-full wp-image-1228" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0025.png" alt="image002" width="858" height="395" /></a></p>
<p>但是这样我们就忽略了其中的原理，我们不知道这些数据是如何算出来的，只是由别人写好的包得到了结果罢了。</p>
<p>事实上，这个包的这个hyperGTest函数无法就是包装了一个超几何分布检验而已。</p>
<p>如果我们了解了其中的统计学原理，我们完全可以写成一个自建的函数来实现同样的功能。</p>
<div>超几何分布很简单，球分成黑白两色，数量已知，那么你随机抽有限个球，应该抽多少白球的问题！</div>
<div>公式就是 exp_count=n*M/N</div>
<div>然后你实际上抽了多少白球，就可以计算一个概率值！</div>
<div>换算成通路的富集概念就是，总共有多少基因，你的通路有多少基因，你的通路被抽中了多少基因（在差异基因里面属于你的通路的基因），这样的数据就足够算出上面表格里面所有的数据啦！</div>
<div></div>
<div><span style="font-family: Tahoma;">tmp=toTable(org.Hs.egPATH)</span></div>
<div>GeneID2Path=tapply(tmp[,2],as.factor(tmp[,1]),function(x) x)</div>
<div><span style="font-family: Tahoma;">Path2GeneID=tapply(tmp[,1],as.factor(tmp[,2]),function(x) x)</span></div>
<div><span style="font-family: Tahoma;">#phyper(k-1,M, N-M, n, lower.tail=F)</span></div>
<div><span style="font-family: Tahoma;">#n*M/N</span></div>
<div><span style="font-family: Tahoma;">diff_gene_has_path=intersect(diff_gene_list,names(GeneID2Path))</span></div>
<div><span style="font-family: Tahoma;">n=length(diff_gene_has_path) #321 # 这里算出你总共抽取了多少个球</span></div>
<div><span style="font-family: Tahoma;">N=length(GeneID2Path) #5870  ##这里算出你总共有多少个球<span style="color: #ff0000;"><strong><span style="text-decoration: underline;">（这里是错的，有多少个球取决于背景基因！一般是两万个）</span></strong></span></span></div>
<div><span style="font-family: Tahoma;">options(digits = 4)</span></div>
<div><span style="font-family: Tahoma;">for (i in names(Path2GeneID)){</span></div>
<div><span style="font-family: Tahoma;"> M=length(Path2GeneID[[i]])  ##这个算出你的所有的球里面，白球有多少个</span></div>
<div><span style="font-family: Tahoma;"> exp_count=n*M/N  ###这里算出你抽取的球里面应该多多少个是白色</span></div>
<div><span style="font-family: Tahoma;"> k=0         ##这个k是你实际上抽取了多少个白球</span></div>
<div><span style="font-family: Tahoma;"> for (j in diff_gene_has_path){</span></div>
<div><span style="font-family: Tahoma;"> if (i %in% GeneID2Path[[j]]) k=k+1</span></div>
<div><span style="font-family: Tahoma;"> }</span></div>
<div><span style="font-family: Tahoma;"> OddsRatio=k/exp_count</span></div>
<div><span style="font-family: Tahoma;"> p=phyper(k-1,M, N-M, n, lower.tail=F)  ##根据你实际上抽取的白球个数，就能算出富集概率啦！</span></div>
<div><span style="font-family: Tahoma;"> print(paste(i,p,OddsRatio,exp_count,k,M,sep="    "))</span></div>
<div><span style="font-family: Tahoma;">}</span></div>
<div>随便检查一下，就知道结果是一模一样的！</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1225.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
