<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 统计</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/%e7%bb%9f%e8%ae%a1/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>用R语言做逻辑回归分析</title>
		<link>http://www.bio-info-trainee.com/1574.html</link>
		<comments>http://www.bio-info-trainee.com/1574.html#comments</comments>
		<pubDate>Thu, 14 Apr 2016 12:50:00 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[生信基础]]></category>
		<category><![CDATA[统计]]></category>
		<category><![CDATA[逻辑回归]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1574</guid>
		<description><![CDATA[回归的本质是建立一个模型用来预测，而逻辑回归的独特性在于，预测的结果是只能有两种 &#8230; <a href="http://www.bio-info-trainee.com/1574.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>回归的本质是建立一个模型用来预测，而逻辑回归的独特性在于，预测的结果是只能有两种，true or false</p>
<p>在R里面做逻辑回归也很简单，只需要构造好数据集，然后用glm函数(广义线性模型（generalized linear model）)建模即可,预测用predict函数。</p>
<p>我这里简单讲一个例子，来自于<a href="http://www.ats.ucla.edu/stat/r/dae/logit.htm">加州大学洛杉矶分校的课程</a></p>
<p><span id="more-1574"></span></p>
<p>这个我是用Rmarkdow写作的，<a href="http://www.bio-info-trainee.com/tmp/tutorial_for_logical_analysis.html">传送门</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1574.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用重抽样+主成分方法来做富集分析</title>
		<link>http://www.bio-info-trainee.com/1237.html</link>
		<comments>http://www.bio-info-trainee.com/1237.html#comments</comments>
		<pubDate>Mon, 21 Dec 2015 14:20:40 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[生信基础]]></category>
		<category><![CDATA[主成分]]></category>
		<category><![CDATA[统计]]></category>
		<category><![CDATA[重抽样]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1237</guid>
		<description><![CDATA[之前我们用超几何分布检验的方法做了富集分析，使用的是GSE63067.diffe &#8230; <a href="http://www.bio-info-trainee.com/1237.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>之前我们用<strong>超几何分布检验</strong>的方法做了富集分析，使用的是GSE63067.diffexp.NASH-normal.txt的logFC的绝对值大于0.5，并且P-value小雨0.05的基因作为<strong>差异基因</strong>来检验kegg的pathway的富集情况</p>
<p>结果是这样的</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0017.png"><img class="alignnone size-full wp-image-1238" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0017.png" alt="image001" width="858" height="395" /></a></p>
<p>我们接下来用另外一种方法来做富集分析，顺便检验一下，是不是超几何分布统计检验的富集分析方法就是最好的呢？</p>
<p><strong>这种方法是</strong><strong>-</strong><strong>重抽样</strong><strong>+</strong><strong>主成分分析</strong></p>
<p><strong>大概的原理是，比如对上图中的，</strong><strong>04380</strong><strong>这条</strong><strong>pathway</strong><strong>来说，总共有</strong><strong>128</strong><strong>个基因，</strong>那么我从原来的表达矩阵里面<strong>随机抽取</strong><strong>128</strong><strong>个基因</strong>的表达矩阵<strong>做主成分分析</strong>，并且抽取一千次，每次主成分分析都可以得到<strong>第一主成分的贡献度值</strong>。那么，当我并不是随机抽取的时候，我就抽04380这条pathway的128个基因，也做主成分分析，并且计算得到第一主成分分析的重要性值。我们看看这个值，跟随机抽1000次得到的值差别大不大。</p>
<p>这时候就需要用到表达矩阵啦！</p>
<p>setwd("D:\\my_tutorial\\补\\用limma包对芯片数据做差异分析")</p>
<p>exprSet=read.table("GSE63067_series_matrix.txt.gz",comment.char = "!",stringsAsFactors=F,header=T)</p>
<p>rownames(exprSet)=exprSet[,1]</p>
<p>exprSet=exprSet[,-1]</p>
<p>我们根据ncbi里面对GSE63067的介绍可以知道，对应NASH和normal的样本的ID号，就可以提取我们需要的表达矩阵</p>
<p>把前面两属于Steatosis的样本去掉即可，exprSet=exprSet[,-c(1:2)]</p>
<p>然后再把芯片探针的id转换成entrez id</p>
<p>exprSet=exprSet[,-c(1:2)]</p>
<p>library(hgu133plus2.db)</p>
<p>library(annotate)</p>
<p>platformDB="hgu133plus2.db";</p>
<p>probeset &lt;- rownames(exprSet)</p>
<p>rowMeans &lt;- rowMeans(exprSet)</p>
<p>EGID &lt;- as.numeric(lookUp(probeset, platformDB, "ENTREZID"))</p>
<p>match_row=aggregate(rowMeans,by=list(EGID),max)</p>
<p>colnames(match_row)=c("EGID","rowMeans")</p>
<p>dat=data.frame(EGID,rowMeans,probeset)</p>
<p>tmp_prob=merge(dat,match_row,by=c("EGID","rowMeans"))</p>
<p>relevantProbesets=as.character(tmp_prob$probeset)</p>
<p>length(relevantProbesets) #hgu133plus2.db  20156</p>
<p>exprSet=exprSet[relevantProbesets,]</p>
<p>EGID_name=as.numeric(lookUp(relevantProbesets, platformDB, "ENTREZID"))</p>
<p>rownames(exprSet)=as.character(EGID_name)</p>
<p>d=exprSet</p>
<p>最后得到表达矩阵表格</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0027.png"><img class="alignnone size-full wp-image-1239" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0027.png" alt="image002" width="666" height="304" /></a></p>
<p>我们首先得到1000次随机挑选128个基因的表达矩阵的主成分分析，第一主成分贡献度值。</p>
<p>gene128=sapply(1:1000,function(y) {</p>
<p>dat=t(d[<strong>sample</strong>(row.names(d), 128, replace=TRUE), ]);</p>
<p>round(100*summary(fast.prcomp(dat))$importance[2,1],2)</p>
<p>}</p>
<p>)</p>
<p>很快就能得到结果，可以看到数据如下</p>
<p>&gt;  summary(gene128)</p>
<p>Min. 1st Qu.  Median    Mean 3rd Qu.    Max.</p>
<p>19.1    25.8    28.8    29.8    32.5    59.7</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0034.png"><img class="alignnone size-full wp-image-1240" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0034.png" alt="image003" width="680" height="255" /></a></p>
<p>&nbsp;</p>
<p>那么接下来我们挑选这个<strong>04380</strong><strong>这条</strong><strong>pathway</strong><strong>特有</strong><strong>128</strong><strong>个基因</strong>来算第一主成分贡献度值</p>
<p>path_04380_gene=intersect(rownames(d),as.character(Path2GeneID[['04380']]))</p>
<p>dat=t(d[path_04380_gene,]);</p>
<p>round(100*summary(fast.prcomp(dat))$importance[2,1],2)</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0043.png"><img class="alignnone size-full wp-image-1241" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0043.png" alt="image004" width="703" height="265" /></a></p>
<p>得到的值是<strong>38.83</strong>，然后看看我们的这个38.83在之前随机得到的1000个数里面是否正常，就按照正态分布检验来算</p>
<p>1-pnorm((38.83-mean(gene128))/sd(gene128))</p>
<p>[1] <strong>0.0625</strong></p>
<p>可以看到已经非常显著的不正常了，可以说明这条通路被富集了。</p>
<p>至少说明<strong>超几何分布检验</strong>的方法得到的富集分析结果跟我们这次的<strong>重抽样</strong><strong>+</strong><strong>主成分分析结果是一致的，当然，也有不一致的，不然就不用发明一种新的方法了。</strong></p>
<p>如果写一个循环同样可以检验所有的通路，但是这样就不需要事先准备好差异基因啦！！！这是这个分析方法的特点！</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1237.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>主成分分析略讲</title>
		<link>http://www.bio-info-trainee.com/1232.html</link>
		<comments>http://www.bio-info-trainee.com/1232.html#comments</comments>
		<pubDate>Mon, 21 Dec 2015 14:16:42 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[生信基础]]></category>
		<category><![CDATA[主成分]]></category>
		<category><![CDATA[统计]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1232</guid>
		<description><![CDATA[主成分分析是为了简化变量的个数。 我这里不涉及到任何高级统计知识来简单讲解一下主 &#8230; <a href="http://www.bio-info-trainee.com/1232.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="markdown-here-wrapper" data-md-url="http://www.bio-info-trainee.com/wp-admin/post.php?post=1232&amp;action=edit">
<p style="margin: 0px 0px 1.2em !important;">主成分分析是为了简化变量的个数。<br />
我这里不涉及到任何高级统计知识来简单讲解一下主成分分析，首先我们用下面的代码随机创造一个矩阵：</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code class="hljs language-shell" style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block; overflow: auto; overflow-x: auto; color: #333333; background: #f8f8f8; text-size-adjust: none;">options(digits = 2)
x=c(rnorm(5),rnorm(5)+4)
y=3*c(rnorm(5),rnorm(5)+4)
dat=rbind(x,y,a=0.1*x,b=0.2*x,c=0.3*x,o=0.1*y,p=0.2*y,q=0.3*y)
colnames(dat)=paste('s',1:10,sep="")
dat
library(gmodels)
pca=fast.prcomp(t(dat))
pca
summary(pca)$importance
biplot(pca, cex=c(1.3, 1.2));
</code></pre>
<p style="margin: 0px 0px 1.2em !important;"><span id="more-1232"></span></p>
<p style="margin: 0px 0px 1.2em !important;">那么根据我们的创建规则，其实a,b,c,o,p,q变量跟x,y是有关系的，而主成分分析，就是找出这个关系：a=0.1<em>x,b=0.2</em>x,c=0.3<em>x,o=0.1</em>y,p=0.2<em>y,q=0.3</em>y<br />
<a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0016.png"><img class="alignnone size-full wp-image-1233" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0016.png" alt="image001" width="658" height="233" /></a></p>
<p style="margin: 0px 0px 1.2em !important;">如果不用主成分分析，我们需要把所有的变量进行两两组合，计算量太大了<br />
我们直接gmodels用这个保留里面的函数fast.prcomp来对dat矩阵做主成分分析，分析结果如下：<br />
<a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0026.png"><img class="alignnone size-full wp-image-1234" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0026.png" alt="image002" width="655" height="413" /></a><br />
主成分分析就是降维，本来应该有8个变量，现在我们变成了8个主成分，而一般前面的几个主成分就能解释所有的数据了。<br />
比如我们看这个PC1,PC2，根据结果画出下面的图，把我们的矩阵区分的特别清楚。<br />
<a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0033.png"><img class="alignnone size-full wp-image-1235" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0033.png" alt="image003" width="618" height="600" /></a><br />
如果想继续了解其中的统计学原理，请自行看下面的ppt<br />
<a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf">http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf</a><br />
<a href="https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf">https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf</a><br />
<a href="http://www.cs.umd.edu/~samir/498/PCA.pdf">http://www.cs.umd.edu/~samir/498/PCA.pdf</a><br />
<a href="http://www.yale.edu/ceo/Documentation/PCA_Outline.pdf">http://www.yale.edu/ceo/Documentation/PCA_Outline.pdf</a><br />
<a href="http://people.tamu.edu/~alawing/materials/ESSM689/pca.pdf">http://people.tamu.edu/~alawing/materials/ESSM689/pca.pdf</a> （R相关）<br />
<a href="http://www2.dc.ufscar.br/~cesar.souza/publications/pca-tutorial.pdf">http://www2.dc.ufscar.br/~cesar.souza/publications/pca-tutorial.pdf</a> （2012）<br />
中文的更多啦，我就不贴地址啦</p>
<div style="height: 0; width: 0; max-height: 0; max-width: 0; overflow: hidden; font-size: 0em; padding: 0; margin: 0;" title="MDH:PHA+5Li75oiQ5YiG5YiG5p6Q5piv5Li65LqG566A5YyW5Y+Y6YeP55qE5Liq5pWw44CCPC9wPjxw
PuaIkei/memHjOS4jea2ieWPiuWIsOS7u+S9lemrmOe6p+e7n+iuoeefpeivhuadpeeugOWNleiu
suino+S4gOS4i+S4u+aIkOWIhuWIhuaekO+8jOmmluWFiOaIkeS7rOeUqOS4i+mdoueahOS7o+eg
gemaj+acuuWIm+mAoOS4gOS4quefqemYte+8mjwvcD48cD5gYGBzaGVsbCA8YnI+b3B0aW9ucyhk
aWdpdHMgPSAyKTxicj54PWMocm5vcm0oNSkscm5vcm0oNSkrNCk8YnI+eT0zKmMocm5vcm0oNSks
cm5vcm0oNSkrNCk8YnI+ZGF0PXJiaW5kKHgseSxhPTAuMSp4LGI9MC4yKngsYz0wLjMqeCxvPTAu
MSp5LHA9MC4yKnkscT0wLjMqeSk8YnI+Y29sbmFtZXMoZGF0KT1wYXN0ZSgncycsMToxMCxzZXA9
IiIpPGJyPmRhdDxicj5saWJyYXJ5KGdtb2RlbHMpPGJyPnBjYT1mYXN0LnByY29tcCh0KGRhdCkp
PGJyPnBjYTxicj5zdW1tYXJ5KHBjYSkkaW1wb3J0YW5jZTxicj5iaXBsb3QocGNhLCBjZXg9Yygx
LjMsIDEuMikpOzxicj5gYGA8L3A+PHA+6YKj5LmI5qC55o2u5oiR5Lus55qE5Yib5bu66KeE5YiZ
77yM5YW25a6eYSxiLGMsbyxwLHHlj5jph4/ot594LHnmmK/mnInlhbPns7vnmoTvvIzogIzkuLvm
iJDliIbliIbmnpDvvIzlsLHmmK/mib7lh7rov5nkuKrlhbPns7vvvJphPTAuMSp4LGI9MC4yKngs
Yz0wLjMqeCxvPTAuMSp5LHA9MC4yKnkscT0wLjMqeTwvcD48cD48YSBocmVmPSJodHRwOi8vd3d3
LmJpby1pbmZvLXRyYWluZWUuY29tL3dwLWNvbnRlbnQvdXBsb2Fkcy8yMDE1LzEyL2ltYWdlMDAx
Ni5wbmciIGRhdGEtbWNlLWhyZWY9Imh0dHA6Ly93d3cuYmlvLWluZm8tdHJhaW5lZS5jb20vd3At
Y29udGVudC91cGxvYWRzLzIwMTUvMTIvaW1hZ2UwMDE2LnBuZyI+PGltZyBjbGFzcz0iYWxpZ25u
b25lIHNpemUtZnVsbCB3cC1pbWFnZS0xMjMzIiBzcmM9Imh0dHA6Ly93d3cuYmlvLWluZm8tdHJh
aW5lZS5jb20vd3AtY29udGVudC91cGxvYWRzLzIwMTUvMTIvaW1hZ2UwMDE2LnBuZyIgYWx0PSJp
bWFnZTAwMSIgd2lkdGg9IjY1OCIgaGVpZ2h0PSIyMzMiIGRhdGEtbWNlLXNyYz0iaHR0cDovL3d3
dy5iaW8taW5mby10cmFpbmVlLmNvbS93cC1jb250ZW50L3VwbG9hZHMvMjAxNS8xMi9pbWFnZTAw
MTYucG5nIj48L2E+PC9wPjxwPuWmguaenOS4jeeUqOS4u+aIkOWIhuWIhuaekO+8jOaIkeS7rOmc
gOimgeaKiuaJgOacieeahOWPmOmHj+i/m+ihjOS4pOS4pOe7hOWQiO+8jOiuoeeul+mHj+WkquWk
p+S6hjwvcD48cD7miJHku6znm7TmjqVnbW9kZWxz55So6L+Z5Liq5L+d55WZ6YeM6Z2i55qE5Ye9
5pWwZmFzdC5wcmNvbXDmnaXlr7lkYXTnn6npmLXlgZrkuLvmiJDliIbliIbmnpDvvIzliIbmnpDn
u5PmnpzlpoLkuIvvvJo8L3A+PHA+PGEgaHJlZj0iaHR0cDovL3d3dy5iaW8taW5mby10cmFpbmVl
LmNvbS93cC1jb250ZW50L3VwbG9hZHMvMjAxNS8xMi9pbWFnZTAwMjYucG5nIiBkYXRhLW1jZS1o
cmVmPSJodHRwOi8vd3d3LmJpby1pbmZvLXRyYWluZWUuY29tL3dwLWNvbnRlbnQvdXBsb2Fkcy8y
MDE1LzEyL2ltYWdlMDAyNi5wbmciPjxpbWcgY2xhc3M9ImFsaWdubm9uZSBzaXplLWZ1bGwgd3At
aW1hZ2UtMTIzNCIgc3JjPSJodHRwOi8vd3d3LmJpby1pbmZvLXRyYWluZWUuY29tL3dwLWNvbnRl
bnQvdXBsb2Fkcy8yMDE1LzEyL2ltYWdlMDAyNi5wbmciIGFsdD0iaW1hZ2UwMDIiIHdpZHRoPSI2
NTUiIGhlaWdodD0iNDEzIiBkYXRhLW1jZS1zcmM9Imh0dHA6Ly93d3cuYmlvLWluZm8tdHJhaW5l
ZS5jb20vd3AtY29udGVudC91cGxvYWRzLzIwMTUvMTIvaW1hZ2UwMDI2LnBuZyI+PC9hPjwvcD48
cD7kuLvmiJDliIbliIbmnpDlsLHmmK/pmY3nu7TvvIw8c3Ryb25nPuacrOadpeW6lOivpeaciTjk
uKrlj5jph4/vvIznjrDlnKjmiJHku6zlj5jmiJDkuoY45Liq5Li75oiQ5YiGPC9zdHJvbmc+77yM
6ICM5LiA6Iis5YmN6Z2i55qE5Yeg5Liq5Li75oiQ5YiG5bCx6IO96Kej6YeK5omA5pyJ55qE5pWw
5o2u5LqG44CCPC9wPjxwPuavlOWmguaIkeS7rOeci+i/meS4qlBDMSxQQzLvvIzmoLnmja7nu5Pm
npznlLvlh7rkuIvpnaLnmoTlm77vvIzmiormiJHku6znmoTnn6npmLXljLrliIbnmoTnibnliKvm
uIXmpZrjgII8L3A+PHA+PGEgaHJlZj0iaHR0cDovL3d3dy5iaW8taW5mby10cmFpbmVlLmNvbS93
cC1jb250ZW50L3VwbG9hZHMvMjAxNS8xMi9pbWFnZTAwMzMucG5nIiBkYXRhLW1jZS1ocmVmPSJo
dHRwOi8vd3d3LmJpby1pbmZvLXRyYWluZWUuY29tL3dwLWNvbnRlbnQvdXBsb2Fkcy8yMDE1LzEy
L2ltYWdlMDAzMy5wbmciPjxpbWcgY2xhc3M9ImFsaWdubm9uZSBzaXplLWZ1bGwgd3AtaW1hZ2Ut
MTIzNSIgc3JjPSJodHRwOi8vd3d3LmJpby1pbmZvLXRyYWluZWUuY29tL3dwLWNvbnRlbnQvdXBs
b2Fkcy8yMDE1LzEyL2ltYWdlMDAzMy5wbmciIGFsdD0iaW1hZ2UwMDMiIHdpZHRoPSI2MTgiIGhl
aWdodD0iNjAwIiBkYXRhLW1jZS1zcmM9Imh0dHA6Ly93d3cuYmlvLWluZm8tdHJhaW5lZS5jb20v
d3AtY29udGVudC91cGxvYWRzLzIwMTUvMTIvaW1hZ2UwMDMzLnBuZyI+PC9hPjwvcD48cD7lpoLm
npzmg7Pnu6fnu63kuobop6PlhbbkuK3nmoTnu5/orqHlrabljp/nkIbvvIzor7foh6rooYznnIvk
uIvpnaLnmoRwcHQ8L3A+PHA+PGEgaHJlZj0iaHR0cDovL3d3dy5jcy5vdGFnby5hYy5uei9jb3Nj
NDUzL3N0dWRlbnRfdHV0b3JpYWxzL3ByaW5jaXBhbF9jb21wb25lbnRzLnBkZiIgZGF0YS1tY2Ut
aHJlZj0iaHR0cDovL3d3dy5jcy5vdGFnby5hYy5uei9jb3NjNDUzL3N0dWRlbnRfdHV0b3JpYWxz
L3ByaW5jaXBhbF9jb21wb25lbnRzLnBkZiI+aHR0cDovL3d3dy5jcy5vdGFnby5hYy5uei9jb3Nj
NDUzL3N0dWRlbnRfdHV0b3JpYWxzL3ByaW5jaXBhbF9jb21wb25lbnRzLnBkZjwvYT48L3A+PHA+
PGEgaHJlZj0iaHR0cHM6Ly93d3cuY3MucHJpbmNldG9uLmVkdS9waWNhc3NvL21hdHMvUENBLVR1
dG9yaWFsLUludHVpdGlvbl9qcC5wZGYiIGRhdGEtbWNlLWhyZWY9Imh0dHBzOi8vd3d3LmNzLnBy
aW5jZXRvbi5lZHUvcGljYXNzby9tYXRzL1BDQS1UdXRvcmlhbC1JbnR1aXRpb25fanAucGRmIj5o
dHRwczovL3d3dy5jcy5wcmluY2V0b24uZWR1L3BpY2Fzc28vbWF0cy9QQ0EtVHV0b3JpYWwtSW50
dWl0aW9uX2pwLnBkZjwvYT48L3A+PHA+PGEgaHJlZj0iaHR0cDovL3d3dy5jcy51bWQuZWR1L35z
YW1pci80OTgvUENBLnBkZiIgZGF0YS1tY2UtaHJlZj0iaHR0cDovL3d3dy5jcy51bWQuZWR1L35z
YW1pci80OTgvUENBLnBkZiI+aHR0cDovL3d3dy5jcy51bWQuZWR1L35zYW1pci80OTgvUENBLnBk
ZjwvYT48L3A+PHA+PGEgaHJlZj0iaHR0cDovL3d3dy55YWxlLmVkdS9jZW8vRG9jdW1lbnRhdGlv
bi9QQ0FfT3V0bGluZS5wZGYiIGRhdGEtbWNlLWhyZWY9Imh0dHA6Ly93d3cueWFsZS5lZHUvY2Vv
L0RvY3VtZW50YXRpb24vUENBX091dGxpbmUucGRmIj5odHRwOi8vd3d3LnlhbGUuZWR1L2Nlby9E
b2N1bWVudGF0aW9uL1BDQV9PdXRsaW5lLnBkZjwvYT48L3A+PHA+PGEgaHJlZj0iaHR0cDovL3Bl
b3BsZS50YW11LmVkdS9+YWxhd2luZy9tYXRlcmlhbHMvRVNTTTY4OS9wY2EucGRmIiBkYXRhLW1j
ZS1ocmVmPSJodHRwOi8vcGVvcGxlLnRhbXUuZWR1L35hbGF3aW5nL21hdGVyaWFscy9FU1NNNjg5
L3BjYS5wZGYiPmh0dHA6Ly9wZW9wbGUudGFtdS5lZHUvfmFsYXdpbmcvbWF0ZXJpYWxzL0VTU002
ODkvcGNhLnBkZjwvYT4mbmJzcDvvvIhS55u45YWz77yJPC9wPjxwPjxhIGhyZWY9Imh0dHA6Ly93
d3cyLmRjLnVmc2Nhci5ici9+Y2VzYXIuc291emEvcHVibGljYXRpb25zL3BjYS10dXRvcmlhbC5w
ZGYiIGRhdGEtbWNlLWhyZWY9Imh0dHA6Ly93d3cyLmRjLnVmc2Nhci5ici9+Y2VzYXIuc291emEv
cHVibGljYXRpb25zL3BjYS10dXRvcmlhbC5wZGYiPmh0dHA6Ly93d3cyLmRjLnVmc2Nhci5ici9+
Y2VzYXIuc291emEvcHVibGljYXRpb25zL3BjYS10dXRvcmlhbC5wZGY8L2E+Jm5ic3A7Jm5ic3A7
77yIMjAxMu+8iTwvcD48cD7kuK3mlofnmoTmm7TlpJrllabvvIzmiJHlsLHkuI3otLTlnLDlnYDl
laY8L3A+PHA+Jm5ic3A7PGJyPjwvcD4=">​</div>
</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1232.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用超几何分布检验做富集分析</title>
		<link>http://www.bio-info-trainee.com/1225.html</link>
		<comments>http://www.bio-info-trainee.com/1225.html#comments</comments>
		<pubDate>Tue, 15 Dec 2015 13:07:09 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[杂谈-随笔]]></category>
		<category><![CDATA[富集分析]]></category>
		<category><![CDATA[统计]]></category>
		<category><![CDATA[超几何分布]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1225</guid>
		<description><![CDATA[我们可以直接使用R的bioconductor里面的一个包，GOstats里面的函 &#8230; <a href="http://www.bio-info-trainee.com/1225.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>我们可以直接使用R的bioconductor里面的一个包，GOstats里面的函数来做超几何分布检验，看看每条pathway是否会富集</p>
<p>我们直接读取用limma包做好的差异分析结果</p>
<p>setwd("D:\\my_tutorial\\补\\用limma包对芯片数据做差异分析")</p>
<p>DEG=read.table("GSE63067.diffexp.NASH-normal.txt",stringsAsFactors = F)</p>
<p>View(DEG)</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0015.png"><img class="alignnone size-full wp-image-1227" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0015.png" alt="image001" width="552" height="319" /></a></p>
<p>我们挑选logFC的绝对值大于0.5，并且P-value小雨0.05的基因作为差异基因，并且转换成entrezID</p>
<p>probeset=rownames(DEG[abs(DEG[,1])&gt;0.5 &amp; DEG[,4]&lt;0.05,])</p>
<p>library(hgu133plus2.db)</p>
<p>library(annotate)</p>
<p>platformDB="hgu133plus2.db";</p>
<p>EGID &lt;- as.numeric(lookUp(probeset, platformDB, "ENTREZID"))</p>
<p>length(unique(EGID))</p>
<p>#[1] 775</p>
<p>diff_gene_list &lt;- unique(EGID)</p>
<p>这样我们的到来775个差异基因的一个list</p>
<p>首先我们直接使用R的bioconductor里面的一个包，GOstats里面的函数来做超几何分布检验，看看每条pathway是否会富集</p>
<p>library(GOstats)</p>
<p>library(org.Hs.eg.db)</p>
<p>#then do kegg pathway enrichment !</p>
<p>hyperG.params = new("KEGGHyperGParams", geneIds=diff_gene_list, universeGeneIds=NULL, annotation="org.Hs.eg.db",</p>
<p>categoryName="KEGG", pvalueCutoff=1, testDirection = "over")</p>
<p>KEGG.hyperG.results = hyperGTest(hyperG.params);</p>
<p>htmlReport(KEGG.hyperG.results, file="kegg.enrichment.html", summary.args=list("htmlLinks"=TRUE))</p>
<p>结果如下：</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0025.png"><img class="alignnone size-full wp-image-1228" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/12/image0025.png" alt="image002" width="858" height="395" /></a></p>
<p>但是这样我们就忽略了其中的原理，我们不知道这些数据是如何算出来的，只是由别人写好的包得到了结果罢了。</p>
<p>事实上，这个包的这个hyperGTest函数无法就是包装了一个超几何分布检验而已。</p>
<p>如果我们了解了其中的统计学原理，我们完全可以写成一个自建的函数来实现同样的功能。</p>
<div>超几何分布很简单，球分成黑白两色，数量已知，那么你随机抽有限个球，应该抽多少白球的问题！</div>
<div>公式就是 exp_count=n*M/N</div>
<div>然后你实际上抽了多少白球，就可以计算一个概率值！</div>
<div>换算成通路的富集概念就是，总共有多少基因，你的通路有多少基因，你的通路被抽中了多少基因（在差异基因里面属于你的通路的基因），这样的数据就足够算出上面表格里面所有的数据啦！</div>
<div></div>
<div><span style="font-family: Tahoma;">tmp=toTable(org.Hs.egPATH)</span></div>
<div>GeneID2Path=tapply(tmp[,2],as.factor(tmp[,1]),function(x) x)</div>
<div><span style="font-family: Tahoma;">Path2GeneID=tapply(tmp[,1],as.factor(tmp[,2]),function(x) x)</span></div>
<div><span style="font-family: Tahoma;">#phyper(k-1,M, N-M, n, lower.tail=F)</span></div>
<div><span style="font-family: Tahoma;">#n*M/N</span></div>
<div><span style="font-family: Tahoma;">diff_gene_has_path=intersect(diff_gene_list,names(GeneID2Path))</span></div>
<div><span style="font-family: Tahoma;">n=length(diff_gene_has_path) #321 # 这里算出你总共抽取了多少个球</span></div>
<div><span style="font-family: Tahoma;">N=length(GeneID2Path) #5870  ##这里算出你总共有多少个球<span style="color: #ff0000;"><strong><span style="text-decoration: underline;">（这里是错的，有多少个球取决于背景基因！一般是两万个）</span></strong></span></span></div>
<div><span style="font-family: Tahoma;">options(digits = 4)</span></div>
<div><span style="font-family: Tahoma;">for (i in names(Path2GeneID)){</span></div>
<div><span style="font-family: Tahoma;"> M=length(Path2GeneID[[i]])  ##这个算出你的所有的球里面，白球有多少个</span></div>
<div><span style="font-family: Tahoma;"> exp_count=n*M/N  ###这里算出你抽取的球里面应该多多少个是白色</span></div>
<div><span style="font-family: Tahoma;"> k=0         ##这个k是你实际上抽取了多少个白球</span></div>
<div><span style="font-family: Tahoma;"> for (j in diff_gene_has_path){</span></div>
<div><span style="font-family: Tahoma;"> if (i %in% GeneID2Path[[j]]) k=k+1</span></div>
<div><span style="font-family: Tahoma;"> }</span></div>
<div><span style="font-family: Tahoma;"> OddsRatio=k/exp_count</span></div>
<div><span style="font-family: Tahoma;"> p=phyper(k-1,M, N-M, n, lower.tail=F)  ##根据你实际上抽取的白球个数，就能算出富集概率啦！</span></div>
<div><span style="font-family: Tahoma;"> print(paste(i,p,OddsRatio,exp_count,k,M,sep="    "))</span></div>
<div><span style="font-family: Tahoma;">}</span></div>
<div>随便检查一下，就知道结果是一模一样的！</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1225.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
