<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; bioconductor</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/bioconductor/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>用BioNet这个bioconductor包来找 maximal-scoring subgraph</title>
		<link>http://www.bio-info-trainee.com/2071.html</link>
		<comments>http://www.bio-info-trainee.com/2071.html#comments</comments>
		<pubDate>Fri, 25 Nov 2016 14:54:20 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[BioNet]]></category>
		<category><![CDATA[网络分析]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2071</guid>
		<description><![CDATA[## 此包是为了解决一个难题： maximal-scoring subgraph &#8230; <a href="http://www.bio-info-trainee.com/2071.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>## 此包是为了解决一个难题： maximal-scoring subgraph (MSS) problem ，在一个巨大的复杂网络里面找到significantly differentially expressed subnetworks，就是说，得到了几百个差异基因，去PPI数据库做网络图的时候，发现还是巨大无比，所以需要用这个包来精简我们的网络图。</div>
<div>heuristically的中文意思：启发性地</div>
<div>## 而这个R包可以整合多种数据结果来给一个网络打分，</div>
<div>包的主页是：<a href="https://www.bioconductor.org/packages/release/bioc/html/BioNet.html">https://www.bioconductor.org/packages/release/bioc/html/BioNet.html</a></div>
<div>paper：<a href="http://bioinformatics.oxfordjournals.org/content/early/2010/02/25/bioinformatics.btq089">BioNet: an R-Package for the Functional Analysis of ... - Bioinformatics</a></div>
<div>它整合了PPI网络分析和寻找功能模块的需求。</div>
<div>脚本：<a href="https://www.bioconductor.org/packages/release/bioc/vignettes/BioNet/inst/doc/Tutorial.R">https://www.bioconductor.org/packages/release/bioc/vignettes/BioNet/inst/doc/Tutorial.R</a></div>
<div>教程：<a href="https://www.bioconductor.org/packages/release/bioc/vignettes/BioNet/inst/doc/Tutorial.pdf">https://www.bioconductor.org/packages/release/bioc/vignettes/BioNet/inst/doc/Tutorial.pdf</a></div>
<div>重点就是根据一个"igraph" or "graphNEL"对象和打分来找最大的MSS</div>
<div>subnet &lt;- subNetwork(dataLym$label, interactome)</div>
<div>module &lt;- runFastHeinz(subnet, scores)</div>
<div>plotModule(module, scores=scores, diff.expr=logFC) #这个就是精简后的我们的网络图。</div>
<div>其实另外一个函数也有类似的功能，dNetFind <a href="https://rdrr.io/cran/dnet/man/dNetFind.html">https://rdrr.io/cran/dnet/man/dNetFind.html</a></div>
<div></div>
<p><span id="more-2071"></span></p>
<div>## 里面用到的网络，都是基于igraph的包： A graph object, either in graphNEL or igraph format.</div>
<div>## 首先加载一系列的包和内置数据</div>
<div></div>
<div>library(BioNet)</div>
<div>library(DLBCL)</div>
<div>data(dataLym)</div>
<div>data(interactome)</div>
<div>## dataLym 里面是3个样本,t,s,o 分别对应着的每个基因的p值</div>
<div>## interactome是一个内置的PPI网络对象，可以根据指定的基因list来提取里面的信息</div>
<div></div>
<div>pvals &lt;- cbind(t=dataLym$t.pval, s=dataLym$s.pval)</div>
<div>rownames(pvals) &lt;- dataLym$label</div>
<div>pval &lt;- aggrPvals(pvals, order=2, plot=FALSE)</div>
<div></div>
<div>## 提取t,s样本的p值，然后用aggrPvals整合成一个p值</div>
<div></div>
<div>subnet &lt;- subNetwork(dataLym$label, interactome)</div>
<div>subnet &lt;- rmSelfLoops(subnet)</div>
<div>subnet</div>
<div>## 根据指定的dataLym$label基因信息来提取网络，但是这个基因信息有点奇怪,比如TP53(7157) ， 看起来是symbol跟entrez ID的合体。</div>
<div>## 函数rmSelfLoops是标配，只要是网络，都需要处理一下，去除自循环信息</div>
<div>## 因为指定的dataLym$label基因是有限的，一般不会太多，提取的网络一般也就上千个nodes，万把个edges的</div>
<div></div>
<div>fb &lt;- fitBumModel(pval, plot=FALSE)</div>
<div>## 对我们整合好的基因对应的P值进行Beta-Uniform-Mixture (BUM) model模型处理。</div>
<div>scores &lt;- scoreNodes(subnet, fb, fdr=0.001)</div>
<div></div>
<div>module &lt;- runFastHeinz(subnet, scores)</div>
<div>## Here we use a fast heuristic approach to calculate an approximation to the optimal scoring subnetwork.</div>
<div>logFC &lt;- dataLym$diff</div>
<div>names(logFC) &lt;- dataLym$label</div>
<div></div>
<div>plotModule(module, scores=scores, diff.expr=logFC)</div>
<div>## diff.expr是用来给nodes调色的</div>
<div>## scores是用来给nodes赋予性状的</div>
<div>## 这个函数本身是基于graphNEL or igraph format的定制版，其实可以直接用igraph包来绘图。</div>
<div>## 也可以把这个network导出成Cytoscape format，这样可以用cytoscape来绘图</div>
<div>## 一般来说，红色是上调基因，绿色是下调基因，圆形是得分为正，菱形是得分为负</div>
<div></div>
<div></div>
<div>## 下面是一个实际的例子，如何使用BioNet包来做网络分析</div>
<div>library(BioNet)</div>
<div>library(DLBCL)</div>
<div>data(exprLym)</div>
<div>data(interactome)</div>
<div>exprLym ## 内置对象，所以它的gene的laber是符合interactome的要求的</div>
<div>interactome</div>
<div>network &lt;- subNetwork(featureNames(exprLym), interactome)</div>
<div>network</div>
<div>network &lt;- largestComp(network)</div>
<div>## The function extracts the largest component of a network</div>
<div>network</div>
<div></div>
<div>library(genefilter)</div>
<div>library(impute)</div>
<div>expressions &lt;- impute.knn(exprs(exprLym))$data</div>
<div>## exprs得到的不再是纯粹的表达矩阵，需要用来 impute missing expression data</div>
<div>## 这里选择genefilter包的rowttests函数来做差异分析</div>
<div>t.test &lt;- rowttests(expressions, fac=exprLym$Subgroup)</div>
<div>t.test[1:10, ]</div>
<div>data(dataLym)</div>
<div></div>
<div>ttest.pval &lt;- t.test[, "p.value"]</div>
<div>surv.pval &lt;- dataLym$s.pval</div>
<div>names(surv.pval) &lt;- dataLym$label</div>
<div>pvals &lt;- cbind(ttest.pval, surv.pval)</div>
<div>pval &lt;- aggrPvals(pvals, order=2, plot=FALSE)</div>
<div>fb &lt;- fitBumModel(pval, plot=FALSE)</div>
<div>fb</div>
<div>## 用图来展示这个fitBumModel函数到底做了什么</div>
<div>dev.new(width=13, height=7)</div>
<div>par(mfrow=c(1,2))</div>
<div>hist(fb)</div>
<div>plot(fb)</div>
<div>dev.off()</div>
<div></div>
<div>## 下面这个图可以看到 Beta-Uniform-Mixture (BUM) 模型的两个参数是如何体现的</div>
<div>plotLLSurface(pval, fb)</div>
<div></div>
<div>scores &lt;- scoreNodes(network=network, fb=fb, fdr=0.001)</div>
<div>## 根据p值来对每个edge打分</div>
<div></div>
<div>network &lt;- rmSelfLoops(network)</div>
<div></div>
<div>## 下面是把网络数据写到txt文档，就可以导入到cytoscape啦！</div>
<div>writeHeinzEdges(network=network, file="lymphoma_edges_001", use.score=FALSE)</div>
<div>writeHeinzNodes(network=network, file="lymphoma_nodes_001", node.scores = scores)</div>
<div></div>
<div>datadir &lt;- file.path(path.package("BioNet"), "extdata")</div>
<div>dir(datadir)</div>
<div>## 本次算法变了：the heinz algorithm is used to calculate the maximum-scoring subnetwork</div>
<div>## 下面的文件需要借助heinz.py脚本生成，这里实例用的是包自带的数据</div>
<div>## 脚本代码是：heinz.py -e lymphoma_edges_001.txt -n lymphoma_nodes_001.txt -N True -E False</div>
<div></div>
<div>module &lt;- readHeinzGraph(node.file=file.path(datadir, "lymphoma_nodes_001.txt.0.hnz"), network=network)</div>
<div>diff &lt;- t.test[, "dm"]</div>
<div>names(diff) &lt;- rownames(t.test)</div>
<div></div>
<div>plotModule(module, diff.expr=diff, scores=scores)</div>
<div></div>
<div>sum(scores[nodes(module)])</div>
<div>sum(scores[nodes(module)]&gt;0)</div>
<div>sum(scores[nodes(module)]&lt;0)</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 27: Tutorial.Rnw:375-380</div>
<div>###################################################</div>
<div>library(BioNet)</div>
<div>library(DLBCL)</div>
<div>library(ALL)</div>
<div>data(ALL)</div>
<div>data(interactome)</div>
<div>## 这个ALL是另外一个包的数据，基因ID现在还没有，是探针ID，需要转换成BioNet识别的！</div>
<div>mapped.eset &lt;- mapByVar(ALL, network=interactome, attr="geneID")</div>
<div>mapped.eset[1:5,1:5]</div>
<div>length(intersect(rownames(mapped.eset), nodes(interactome)))</div>
<div>network &lt;- subNetwork(rownames(mapped.eset), interactome)</div>
<div>network</div>
<div>network &lt;- largestComp(network)</div>
<div>network &lt;- rmSelfLoops(network)</div>
<div>network</div>
<div></div>
<div>## 这里用limma来做差异分析</div>
<div>library(limma)</div>
<div>design &lt;- model.matrix(~ -1+ factor(c(substr(unlist(ALL$BT), 0, 1))))</div>
<div>colnames(design)&lt;- c("B", "T")</div>
<div>contrast.matrix &lt;- makeContrasts(B-T, levels=design)</div>
<div>contrast.matrix</div>
<div>fit &lt;- lmFit(mapped.eset, design)</div>
<div>fit2 &lt;- contrasts.fit(fit, contrast.matrix)</div>
<div>fit2 &lt;- eBayes(fit2)</div>
<div>pval &lt;- fit2$p.value[,1]</div>
<div>fb &lt;- fitBumModel(pval, plot=FALSE)</div>
<div>fb</div>
<div>dev.new(width=13, height=7)</div>
<div>par(mfrow=c(1,2))</div>
<div>hist(fb)</div>
<div>plot(fb)</div>
<div>scores &lt;- scoreNodes(network=network, fb=fb, fdr=1e-14)</div>
<div>## 还是把网络数据写到本地，供cytoscape导入</div>
<div>writeHeinzEdges(network=network, file="ALL_edges_001", use.score=FALSE)</div>
<div>writeHeinzNodes(network=network, file="ALL_nodes_001", node.scores = scores)</div>
<div>## 还是使用 heinz algorithm is used to calculate the maximum-scoring subnetwork</div>
<div>## A new implementation Heinz v2.0 is also available at https://software.cwi.nl/software/heinz ,</div>
<div></div>
<div>datadir &lt;- file.path(path.package("BioNet"), "extdata")</div>
<div>module &lt;- readHeinzGraph(node.file=file.path(datadir, "ALL_nodes_001.txt.0.hnz"), network=network)</div>
<div></div>
<div>nodeDataDefaults(module, attr="diff") &lt;- ""</div>
<div>nodeData(module, n=nodes(module), attr="diff") &lt;- fit2$coefficients[nodes(module),1]</div>
<div>nodeDataDefaults(module, attr="score") &lt;- ""</div>
<div>nodeData(module, n=nodes(module), attr="score") &lt;- scores[nodes(module)]</div>
<div>nodeData(module)[1]</div>
<div></div>
<div>## 保存为XGMML file，供cytoscape使用</div>
<div>saveNetwork(module, file="ALL_module", type="XGMML")</div>
<div></div>
<div><span style="color: #ff0000;">## 一般来说，红色是上调基因，绿色是下调基因，圆形是得分为正，菱形是得分为负</span></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2071.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用R的bioconductor里面的stringDB包来做PPI分析</title>
		<link>http://www.bio-info-trainee.com/2041.html</link>
		<comments>http://www.bio-info-trainee.com/2041.html#comments</comments>
		<pubDate>Wed, 23 Nov 2016 11:37:37 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基础数据库]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[PPI]]></category>
		<category><![CDATA[string]]></category>
		<category><![CDATA[stringDB]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2041</guid>
		<description><![CDATA[PPI本质上是根据一系列感兴趣的蛋白质或者基因（可以是几百个甚至上千个）来去PP &#8230; <a href="http://www.bio-info-trainee.com/2041.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>PPI本质上是根据一系列感兴趣的蛋白质或者基因（可以是几百个甚至上千个）来去PPI数据库里面找到跟这系列蛋白质或者基因的相互作用关系！</p>
<div>本次的主角是stringDB，顾名思义用得是大名鼎鼎的string数据库，</div>
<div>paper见：<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383874/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383874/</a></div>
<div>主页见：<a href="http://string-db.org/cgi/input.pl">http://string-db.org/cgi/input.pl</a></div>
<div>本来还以为需要自己上传自己的基因给这个数据库去做分析，没想到他们也开发了R包，主页见： <a href="http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html">http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html</a> 而我比较喜欢用编程来解决问题，所以就学了一下这个包，非常好用！</div>
<div>它只需要一个3列的data.frame，分别是logFC,p.value,gene ID,就是标准的差异分析的结果。</div>
<div>然后用string_db$map函数给它加上一列是 string 数据库的蛋白ID，然后用string_db$add_diff_exp_color函数给它加上一列是color。</div>
<div>用string_db$plot_network函数画网络图，只需要 string 数据库的蛋白ID，如果需要给蛋白标记不同的颜色，需要用string_db$post_payload来把color对应到每个蛋白，然后再画网络图。</div>
<div><strong><span style="color: #ff0000;">也可以直接用get_interactions函数得到所有的PPI数据</span></strong>，然后写入到本地，再导入到cytoscape进行画图</div>
<div></div>
<p><span id="more-2041"></span></p>
<div>还以几个小功能，对我可能没什么用，但是比较适合初学者，仅仅根据string 数据库的蛋白ID就可以做GO/KEGG的enrichment分析啦，还可以查找两个蛋白的interaction呀，还有两个蛋白直接相互作用的paper呀，还有找某个蛋白在其它物种的同源蛋白呀！</div>
<div>软件运行中需要下载以下文件，悲催的是每次都在下载，很坑呀！因为它默认把这些文件存储在电脑的临时文件夹里面！</div>
<div><img src="C:\Users\jimmy1314\AppData\Local\YNote\data\jmzeng1314@163.com\8fd5ba9bd7ee46a298a32da35283661b\clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="653F2BA2D04145F194AFE261811B210E" /></div>
<div>所有的网络图本质上是基于iGraph的深度定制，包括后面的cluster方法，还有可能要结合cytoscape的MCODE插件来找hub基因</div>
<div><a href="http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html">基本上只需要把下面的代码运行一遍，就明白了：</a><a href="http://www.bioconductor.org/packages/release/bioc/vignettes/STRINGdb/inst/doc/STRINGdb.R">http://www.bioconductor.org/packages/release/bioc/vignettes/STRINGdb/inst/doc/STRINGdb.R</a></div>
<div></div>
<div>library(STRINGdb)</div>
<div>## 整个包不是用roxygen2来写帮助文档的，而且自己把所有函数放在了string_db对象里面，用$符合来调用各个函数，也可以查看函数的帮助文档！</div>
<div></div>
<div>## 首先选定物种及数据库的版本！</div>
<div>string_db &lt;- STRINGdb$new( version="10", species=9606,</div>
<div>score_threshold=0, input_directory="" )</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 3: help</div>
<div>###################################################</div>
<div>STRINGdb$methods() # To list all the methods available.</div>
<div>STRINGdb$help("get_graph") # To visualize their documentation.</div>
<div>## 列出该包所包含的所有函数，并且可以具体查看某个函数的帮助文档。</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 4: load_data</div>
<div>###################################################</div>
<div>data(diff_exp_example1)</div>
<div>head(diff_exp_example1)</div>
<div>##一个测试数据，三列，如下：</div>
<div># pvalue logFC gene</div>
<div># 0.0001018 3.333461 VSTM2L</div>
<div># 0.0001392 3.822383 TBC1D2</div>
<div># 通常就是差异分析的结果</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 5: map</div>
<div>###################################################</div>
<div>example1_mapped &lt;- string_db$map( diff_exp_example1, "gene", removeUnmappedRows = TRUE )</div>
<div>## 因为我们的差异分析是以基因来标识的，需要map到string数据库的蛋白ID</div>
<div>STRINGdb$help("map")</div>
<div># 查看帮助文档，明白map函数如何使用，以及该函数返回的是什么！</div>
<div># 本质上就是根据输入的data.frame的gene列来查找string的蛋白ID，返回的data.frame多了一列！</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 6: STRINGdb.Rnw:118-121</div>
<div>###################################################</div>
<div>options(SweaveHooks=list(fig=function()</div>
<div>par(mar=c(2.1, 0.1, 4.1, 2.1))))</div>
<div>#par(mar=c(1.1, 0.1, 4.1, 2.1))))</div>
<div>## 设置画图的属性，没什么好讲的</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 7: get_hits</div>
<div>###################################################</div>
<div>hits &lt;- example1_mapped$STRING_id[1:200]</div>
<div># 这里简单的挑选了前面的200个蛋白来进行下一步的分析！</div>
<div>## 请记住，这个例子是在随机挑选，事实上我们应该挑选自定义的差异基因</div>
<div>###################################################</div>
<div>### code chunk number 8: plot_network</div>
<div>###################################################</div>
<div>string_db$plot_network( hits )</div>
<div></div>
<div>## 只有有蛋白ID就可以进行画网络图，ID越多，耗时越长！</div>
<div>## 函数会根据输入的ID列表在string数据库里面找到所有的PPI数据，然后画网络图</div>
<div>## STRINGdb$help("plot_network")</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 9: add_diff_exp_color</div>
<div>###################################################</div>
<div># filter by p-value and add a color column</div>
<div># (i.e. green down-regulated gened and red for up-regulated genes)</div>
<div>example1_mapped_pval05 &lt;- string_db$add_diff_exp_color( subset(example1_mapped, pvalue&lt;0.05),</div>
<div>logFcColStr="logFC" )</div>
<div>## 上面简单的网络图一般不满足需求，比如我们需要定位基因的上下调关系，还有联系的紧密与否，可以用红绿色的深浅来刻画。</div>
<div>## 用add_diff_exp_color函数得到的对象还是data.frame，但是增加了一列是color</div>
<div>STRINGdb$help("add_diff_exp_color")</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 10: post_payload</div>
<div>###################################################</div>
<div># post payload information to the STRING server</div>
<div>payload_id &lt;- string_db$post_payload( example1_mapped_pval05$STRING_id,</div>
<div>colors=example1_mapped_pval05$color )</div>
<div></div>
<div>## 前面add_diff_exp_color函数为我们的data.frame增加了一列是color，还需要用post_payload函数来把string的蛋白ID跟color对应成功，返回一个payload_id对象给画图函数。</div>
<div>STRINGdb$help("post_payload")</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 11: plot_halo_network</div>
<div>###################################################</div>
<div># display a STRING network png with the "halo"</div>
<div>string_db$plot_network( hits, payload_id=payload_id )</div>
<div></div>
<div>## 同样是画网络图，但是增加了一个color的属性。</div>
<div>## 可以看出来，基因太多了，画的图其实很拥挤</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 13: plot_ppi_enrichment</div>
<div>###################################################</div>
<div># plot the enrichment for the best 1000 genes</div>
<div>string_db$plot_ppi_enrichment( example1_mapped$STRING_id[1:1000], quiet=TRUE )</div>
<div>STRINGdb$help("plot_ppi_enrichment")</div>
<div>## 这个代码我没有看懂在干吗</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 14: enrichment</div>
<div>###################################################</div>
<div>enrichmentGO &lt;- string_db$get_enrichment( hits, category = "Process", methodMT = "fdr", iea = TRUE )</div>
<div>enrichmentKEGG &lt;- string_db$get_enrichment( hits, category = "KEGG", methodMT = "fdr", iea = TRUE )</div>
<div>head(enrichmentGO, n=7)</div>
<div>head(enrichmentKEGG, n=7)</div>
<div>### 直接根据 string 数据库的蛋白ID来做富集分析，此函数会自动下载一些数据。默认是以人类的蛋白库作为背景，但是大部分情况下是需要改变的，否则P值就算的不准确啦</div>
<div></div>
<div>#################################################</div>
<div># code chunk number 15: background (eval = FALSE)</div>
<div>#################################################</div>
<div># 这里修改背景值，人类本来有两万多个基因，这里变成只有2000个了</div>
<div>backgroundV &lt;- example1_mapped$STRING_id[1:2000] # as an example, we use the first 2000 genes</div>
<div>string_db$set_background(backgroundV)</div>
<div>## string_db 是一个全局变量，之前是直接选择人类的V10.0版本，现在被修改了，只是做一个测试，一定要记得改回去！！！</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 16: new_background_inst (eval = FALSE)</div>
<div>###################################################</div>
<div>string_db &lt;- STRINGdb$new( score_threshold=0, backgroundV = backgroundV )</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 17: enrichmentHeatmap (eval = FALSE)</div>
<div>###################################################</div>
<div>eh &lt;- string_db$enrichment_heatmap( list( hits[1:100], hits[101:200]),</div>
<div>list("list1","list2"), title="My Lists" )</div>
<div></div>
<div>## 我们还是把 string_db 修改回来吧！</div>
<div>string_db &lt;- STRINGdb$new( version="10", species=9606,</div>
<div>score_threshold=0, input_directory="" )</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 18: clustering1</div>
<div>###################################################</div>
<div># get clusters</div>
<div>clustersList &lt;- string_db$get_clusters(example1_mapped$STRING_id[1:600])</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 19: STRINGdb.Rnw:254-256</div>
<div>###################################################</div>
<div>options(SweaveHooks=list(fig=function()</div>
<div>par(mar=c(2.1, 0.1, 4.1, 2.1))))</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 20: clustering2</div>
<div>###################################################</div>
<div># plot first 4 clusters</div>
<div>par(mfrow=c(2,2))</div>
<div>for(i in seq(1:4)){</div>
<div>string_db$plot_network(clustersList[[i]])</div>
<div>}</div>
<div>## 把4个cluster画在同一个画布上面！</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 21: proteins</div>
<div>###################################################</div>
<div>string_proteins &lt;- string_db$get_proteins()</div>
<div></div>
<div>## 下面是一下其它小工具，比如找两个蛋白的interaction呀，还有两个蛋白直接相互作用的paper呀，还有找某个蛋白在其它物种的同源蛋白呀！</div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 22: atmtp</div>
<div>###################################################</div>
<div>tp53 = string_db$mp( "tp53" )</div>
<div>atm = string_db$mp( "atm" )</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 23: neighbors (eval = FALSE)</div>
<div>###################################################</div>
<div>## string_db$get_neighbors( c(tp53, atm) )</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 24: interactions</div>
<div>###################################################</div>
<div>string_db$get_interactions( c(tp53, atm) )</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 25: pubmedInteractions (eval = FALSE)</div>
<div>###################################################</div>
<div>## string_db$get_pubmed_interaction( tp53, atm )</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 26: homologs (eval = FALSE)</div>
<div>###################################################</div>
<div>## # get the reciprocal best hits of the following protein in all the STRING species</div>
<div>## string_db$get_homologs_besthits(tp53, symbets = TRUE)</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 27: homologs2 (eval = FALSE)</div>
<div>###################################################</div>
<div>## # get the homologs of the following two proteins in the mouse (i.e. species_id=10090)</div>
<div>## string_db$get_homologs(c(tp53, atm), target_species_id=10090, bitscore_threshold=60 )</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 28: benchmark1</div>
<div>###################################################</div>
<div>data(interactions_example)</div>
<div></div>
<div>interactions_benchmark = string_db$benchmark_ppi(interactions_example, pathwayType = "KEGG",</div>
<div>max_homology_bitscore = 60, precision_window = 400, exclude_pathways = "blacklist")</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 29: STRINGdb.Rnw:391-393</div>
<div>###################################################</div>
<div>options(SweaveHooks=list(fig=function()</div>
<div>par(mar=c(4.1, 4.1, 4.1, 2.1))))</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 30: benchmark2</div>
<div>###################################################</div>
<div>plot(interactions_benchmark$precision, ylim=c(0,1), type="l", xlim=c(0,700),</div>
<div>xlab="interactions", ylab="precision")</div>
<div></div>
<div></div>
<div>###################################################</div>
<div>### code chunk number 31: benchmark3</div>
<div>###################################################</div>
<div>interactions_pathway_view = string_db$benchmark_ppi_pathway_view(interactions_benchmark, precision_threshold=0.2, pathwayType = "KEGG")</div>
<div>head(interactions_pathway_view)</div>
<div></div>
<div></div>
<div></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2041.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>R一大利器之对象的操作函数查询</title>
		<link>http://www.bio-info-trainee.com/1951.html</link>
		<comments>http://www.bio-info-trainee.com/1951.html#comments</comments>
		<pubDate>Sat, 15 Oct 2016 13:44:09 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[对象]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1951</guid>
		<description><![CDATA[对于生物出身的部分生物信息学工程师来说，很多计算机概念让人很头疼，尤其是计算机语 &#8230; <a href="http://www.bio-info-trainee.com/1951.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>对于生物出身的部分生物信息学工程师来说，很多计算机概念让人很头疼，尤其是计算机语言里面的高级对象。我以前学编程的时候，给我一个变量，一个数据，一个hash，我就心满意足了，可以解决大部分我数据处理问题，可事情远比想象之中复杂。因为很多高手喜欢用封装，代码复用，喜欢用高级对象。在R的bioconductor里面尤其是如此，经常会遇到各种包装好的S3，S4对象，看过说明书，倒是知道一些对象里面有什么，可以去如何处理那些对象，提取我们想要的信息，比如我就写过一系列的帖子：</p>
<div><a href="http://www.bio-info-trainee.com/886.html">Bioconductor系列之GenomicAlignments</a></div>
<div><a href="http://www.bio-info-trainee.com/883.html">Bioconductor系列之GenomicFeatures</a></div>
<div><a href="http://www.bio-info-trainee.com/831.html">R的bioconductor包TxDb.Hsapiens.UCSC.hg19.knownGene详解</a></div>
<div><a href="http://www.bio-info-trainee.com/828.html">R的bioconductor包里面的txdb对象及GRange对象详解</a></div>
<p><span id="more-1951"></span></p>
<p>那个时候傻傻的去搜集总结每个对象的操作函数，辛苦死了，一直想有没有地方可以查询这些对象，到底应该用什么函数呢？人怎么能记住一堆函数呢《比如seqnames(),strand(),cigar(),qwidth(),start(),end(),width(),njunc() 这些函数对这个GAlignments对象进行处理》</p>
<p>今天我又遇到了一个LumiBatch对象，也是很复杂，我明明知道里面有基因和探针，但就是拿它没办法：</p>
<blockquote><p>Summary of data information:<br />
Illumina Inc. BeadStudio version 1.4.0.1<br />
Normalization = none<br />
Array Content = 11188230_100CP_MAGE-ML.XML<br />
Error Model = none<br />
DateTime = 2/3/2005 3:21 PM<br />
Local Settings = en-US</p>
<p>Major Operation History:<br />
submitted finished command lumiVersion<br />
1 2007-04-22 00:08:36 2007-04-22 00:10:36 lumiR("../data/Barnes_gene_profile.txt") 1.1.6<br />
2 2007-04-22 00:10:36 2007-04-22 00:10:38 lumiQ(x.lumi = x.lumi) 1.1.6<br />
3 2007-04-22 00:13:06 2007-04-22 00:13:10 addNuId2lumi(x.lumi = x.lumi, lib = "lumiHumanV1") 1.1.6<br />
4 2007-04-22 00:59:20 2007-04-22 00:59:36 Subsetting 8000 features and 4 samples. 1.1.6</p>
<p>Object Information:<br />
LumiBatch (storageMode: lockedEnvironment)<br />
assayData: 8000 features, 4 samples<br />
element names: beadNum, detection, exprs, se.exprs<br />
protocolData: none<br />
phenoData<br />
sampleNames: A01 A02 B01 B02<br />
varLabels: sampleID label<br />
varMetadata: labelDescription<br />
featureData<br />
featureNames: oZsQEQXp9ccVIlwoQo 9qedFRd_5Cul.ueZeQ ... 33KnLHy.RFaieogAF4 (8000 total)<br />
fvarLabels: TargetID<br />
fvarMetadata: labelDescription<br />
experimentData: use 'experimentData(object)'<br />
Annotation: lumiHumanAll.db<br />
Control Data: Available<br />
QC information: Please run summary(x, 'QC') for details!</p></blockquote>
<p>看起来极度的复杂，教程里面有提到一些函数可以操作这个对象，用来画图，提取数据，但是不能满足我的需要。搜索了好久，终于找到了解决方法：</p>
<div><a href="https://www.rdocumentation.org/packages/Biobase/versions/2.26.0/topics/AnnotatedDataFrame?">https://www.rdocumentation.org/packages/Biobase/versions/2.26.0/topics/AnnotatedDataFrame?</a></div>
<div><a href="https://www.rdocumentation.org/packages/Biobase/versions/2.26.0/topics/ExpressionSet?">https://www.rdocumentation.org/packages/Biobase/versions/2.26.0/topics/ExpressionSet?</a></div>
<div><a href="https://www.rdocumentation.org/packages/Biobase/versions/2.26.0/topics/eSet?">https://www.rdocumentation.org/packages/Biobase/versions/2.26.0/topics/eSet?</a></div>
<div><a href="https://www.rdocumentation.org/packages/lumi/versions/2.24.0/topics/LumiBatch-class">https://www.rdocumentation.org/packages/lumi/versions/2.24.0/topics/LumiBatch-class</a></div>
<div><a href="https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/TxDb-class">https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/TxDb-class</a></div>
<div>这些函数是有规律的，而且这个网站也提供了查询接口，很容易就可以了解每个对象是如何设置的，有哪些属性，定义好了哪些函数可以去操作它。</div>
<div></div>
<div>我需要自己组合 pData(featureData(x.lumi)) 才能从 x.lumi这个对象里面提取到我想要的 ProbeID TargetID</div>
<blockquote>
<div>&gt; head(pData(featureData(x.lumi))）<br />
<span style="color: #ff0000;">ProbeID TargetID</span><br />
6450255 6450255 7A5<br />
2570615 2570615 A1BG<br />
6370619 6370619 A1BG<br />
2600039 2600039 A1CF<br />
2650615 2650615 A1CF<br />
5340672 5340672 A1CF</div>
</blockquote>
<div>以前就是把说明书给翻烂也找不到！</div>
<div> 而且，你只需要class一下你的对象，就知道它的具体名字，然后用method就可以看到它所有可供操作的函数！</div>
<div>
<div>&gt; class(x.lumi)</div>
<div>[1] "LumiBatch"</div>
<div>attr(,"package")</div>
<div>[1] "lumi"</div>
<div>&gt; methods(class='LumiBatch')</div>
<div>[1] $ $&lt;- [ [[ [[&lt;- abstract annotation annotation&lt;-</div>
<div>[9] as.matrix asBigMatrix assayData assayData&lt;- beadNum beadNum&lt;- boxplot classVersion</div>
<div>[17] classVersion&lt;- coerce combine controlData controlData&lt;- density description description&lt;-</div>
<div>[25] detection detection&lt;- dim dimnames dimnames&lt;- dims esApply experimentData</div>
<div>[33] experimentData&lt;- exprs exprs&lt;- fData fData&lt;- featureData featureData&lt;- featureNames</div>
<div>[41] featureNames&lt;- fvarLabels fvarLabels&lt;- fvarMetadata fvarMetadata&lt;- getHistory hist initialize</div>
<div>[49] isCurrent isVersioned makeDataPackage MAplot notes notes&lt;- pairs pData</div>
<div>[57] pData&lt;- phenoData phenoData&lt;- plot preproc preproc&lt;- protocolData protocolData&lt;-</div>
<div>[65] pubMedIds pubMedIds&lt;- rowMedians rowQ sampleNames sampleNames&lt;- se.exprs se.exprs&lt;-</div>
<div>[73] show storageMode storageMode&lt;- summary updateObject updateObjectTo varLabels varLabels&lt;-</div>
<div>[81] varMetadata varMetadata&lt;- write.exprs</div>
<div>see '?methods' for accessing help and source code</div>
<div>&gt;</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1951.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用lumi包来处理illumina的bead系列表达芯片</title>
		<link>http://www.bio-info-trainee.com/1944.html</link>
		<comments>http://www.bio-info-trainee.com/1944.html#comments</comments>
		<pubDate>Sat, 15 Oct 2016 12:01:03 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[芯片数据处理]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[illumina]]></category>
		<category><![CDATA[lumi]]></category>
		<category><![CDATA[芯片]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1944</guid>
		<description><![CDATA[表达芯片大家最熟悉的当然是affymetrix系列芯片啦，而且分析套路很简单，直 &#8230; <a href="http://www.bio-info-trainee.com/1944.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>表达芯片大家最熟悉的当然是affymetrix系列芯片啦，而且分析套路很简单，直接用R的affy包，就可以把cel文件经过RMA或者MAS5方法得到表达矩阵。illumina出厂的芯片略微有点不一样，它的原始数据有3个层级，一般拿到的是<span style="color: #ff0000;">Processed data</span> (<a href="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30669/suppl/GSE30669_HEK_Sample_Probe_Profile.txt.gz%20" target="_blank">示例</a>), 当仍然需要一系列的统计学方法才能提取到表达矩阵。我比较喜欢用bioconductor，所以下面讲一讲如何用lumi包来处理这个芯片数据！</p>
<div>这个lumi包的使用代码和说明书都有，按部就班的学一遍就好了。</div>
<div><a href="http://www.bioconductor.org/packages/release/bioc/vignettes/lumi/inst/doc/lumi.R">http://www.bioconductor.org/packages/release/bioc/vignettes/lumi/inst/doc/lumi.R</a></div>
<div><a href="http://www.bioconductor.org/packages/release/bioc/vignettes/lumi/inst/doc/lumi.pdf">http://www.bioconductor.org/packages/release/bioc/vignettes/lumi/inst/doc/lumi.pdf</a></div>
<div>如果仅仅是分析数据，那么并不难，但是每个分析步骤后面都隐含着一系列的统计学方法，想彻底搞清楚他它们， 就很难了。</div>
<p><span id="more-1944"></span></p>
<div>data(example.lumi)</div>
<div>lumi.N.Q &lt;- <span style="color: #ff0000;">lumiExpresso</span>(example.lumi)</div>
<div>dataMatrix &lt;- <span style="color: #ff0000;">exprs</span>(lumi.N.Q)</div>
<div>重点就是得到表达矩阵，它封装好了一个函数，lumiExpresso可以直接处理LumiBatch对象，这个函数结合了,N,T,B,Q(normalization,transformation,backgroud correction,qulity control)四个步骤，其中Q这个步骤又包括8种统计学图片。在该包的文章有详细说明：<a href="http://bioinformatics.oxfordjournals.org/content/24/13/1547.full " target="_blank">http://bioinformatics.oxfordjournals.org/content/24/13/1547.full </a></div>
<div>而 LumiBatch 对象是通过<span style="color: #ff0000;"> lumiR.batch 读取</span>的芯片文件被Illumina Bead Studio toolkit 处理的结果，也就是通常我们<span style="color: #ff0000;">从公司或者GEO下载的数据( level 3 的 process data)</span>，如下所示：</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/10/illumina-microarray-level3-data-example.png"><img class="alignnone size-full wp-image-1945" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/10/illumina-microarray-level3-data-example.png" alt="illumina-microarray-level3-data-example" width="704" height="666" /></a></div>
<div></div>
<div>这个包用的<span style="color: #ff0000;">测试文件Barnes_gene_profile.txt</span>可以在<a href="http://www.chibi.ubc.ca/wp-content/uploads/2013/02/">http://www.chibi.ubc.ca/wp-content/uploads/2013/02/</a> 下载。</div>
<div>
<div>如果是在GEO下载公共数据，每个study都会给芯片描述文件，基本没有用，只需要下载<span style="color: #ff0000;">non-normalized.txt.gz类似的文件</span>就好了</div>
<div>GPL10558_HumanHT-12_V4_0_R1_15002873_B.txt.gz 13.1 Mb</div>
<div>GPL10558_HumanHT-12_V4_0_R2_15002873_B.txt.gz 13.1 Mb</div>
<div>比如我下载了：<a href="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30669/suppl/GSE30669_HEK_Sample_Probe_Profile.txt.gz">ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30669/suppl/GSE30669_HEK_Sample_Probe_Profile.txt.gz</a> 这个文件，就可以直接用lumi包的lumiR.batch 函数读取文件成为LumiBatch对象，然后被lumiExpresso函数直接处理，然后被exprs函数提取表达矩阵。</div>
<blockquote>
<div>rm(list=ls())</div>
<div>library(lumi)</div>
<div># setwd('G:/array/illumina-beadseed-v4/lumi_example')</div>
<div># fileName &lt;- 'Barnes_gene_profile.txt' # Not Run</div>
<div>## 首先是从illumina的芯片结果文件，自己用R的lumi包来获取表达矩阵。</div>
<div>setwd('G:/array/illumina-beadseed-v4/GSE30669')</div>
<div>fileName &lt;- 'GSE30669_HEK_Sample_Probe_Profile.txt' # Not Run</div>
<div>x.lumi &lt;- lumiR.batch(fileName) ##, sampleInfoFile='sampleInfo.txt')</div>
<div>pData(phenoData(x.lumi))</div>
<div>## Do all the default preprocessing in one step</div>
<div>lumi.N.Q &lt;- lumiExpresso(x.lumi)</div>
<div>### retrieve normalized data</div>
<div>dataMatrix &lt;- exprs(lumi.N.Q)</div>
<div>## 下面是从GEO里面下载表达矩阵</div>
<div>rm(list=ls())</div>
<div>library(GEOquery)</div>
<div>library(limma)</div>
<div>GSE30669 &lt;- getGEO('GSE30669', destdir=".",getGPL = F)</div>
<div>exprSet=exprs(GSE30669[[1]])</div>
<div>GSE30669[[1]]</div>
<div>pdata=pData(GSE30669[[1]])</div>
<div>exprSet=exprs(GSE30669[[1]])</div>
<div>很明显可以看到前面得到的dataMatrix 和后面得到的 exprSet 都是我们想要的表达矩阵</div>
</blockquote>
<div>## 因为你有时候获取别人处理好的表达矩阵，不符合你的normalization要求。</div>
<div>这个芯片一般是处理12个样本，从GEO里面很容易看到样品是如何分组的。</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/10/tmp.png"><img class="alignnone size-full wp-image-1946" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/10/tmp.png" alt="tmp" width="514" height="308" /></a></div>
<div>
<div>lumi这个包甚至还提供了一个函数produceGEOSubmissionFile来直接把我们的芯片数据转换成NCBI的GEO要求的格式</div>
<div></div>
<div><strong><span style="color: #ff0000;">最后，官网链接很重要：https://support.illumina.com/array/array_kits/humanht-12_v4_expression_beadchip_kit/downloads.html </span></strong></div>
<div></div>
<div></div>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1944.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用SomaticSignatures包来解析maf突变数据获得mutation signature</title>
		<link>http://www.bio-info-trainee.com/1623.html</link>
		<comments>http://www.bio-info-trainee.com/1623.html#comments</comments>
		<pubDate>Fri, 06 May 2016 12:26:19 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[mutation]]></category>
		<category><![CDATA[signature]]></category>
		<category><![CDATA[SomaticSignatures]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1623</guid>
		<description><![CDATA[mutation signature这个概念提出来还不久，我看了看文献，最早见于 &#8230; <a href="http://www.bio-info-trainee.com/1623.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>mutation signature这个概念提出来还不久，我看了看文献，最早见于2013年的一篇<a href="http://www.nature.com/nature/journal/v500/n7463/full/nature12477.html">nature文章</a>，主要是用来描述癌症患者的somatic mutation情况的。</p>
<p>首先要自己分析癌症样本数据，拿到somatic mutation，<a href="https://wiki.nci.nih.gov/display/TCGA/TCGA+MAF+Files">TCGA计划发展到现在已经有非常多的somatic mutation结果啦</a>，大家可以自行选择感兴趣的癌症数据拿来研究，解析一下mutation signature 。</p>
<p>我这里给大家推荐一个工具，是R语言的Bioconductor系列包中的一个，<a href="http://www.bioconductor.org/packages/3.3/bioc/vignettes/SomaticSignatures/inst/doc/SomaticSignatures-vignette.html">SomaticSignatures</a></p>
<p>其实它的说明书写的非常详细了已经，如果你理解了mutation signature的概念，很容易用那个包，其实你自己写一个脚本也是非常任意的，就是根据mutation的位置在基因组中找到它的前后一个碱基，然后组成三碱基突变模式，最后统计一下那96种突变模式的分布状况！</p>
<p>我这里简单讲一讲这个包如何用吧！</p>
<p>首先下载并加载几个必须的包：</p>
<div>library(SomaticSignatures)  ## 程序</div>
<div>library(SomaticCancerAlterations) ## 自带测试数据</div>
<div>library(BSgenome.Hsapiens.1000genomes.hs37d5)  ## 我们的参考基因组</div>
<div>library(VariantAnnotation)</div>
<div>## 这个对象很重要： GRanges class of the GenomicRanges package</div>
<div>
<div>##其中SomaticCancerAlterations这个包提供了测试数据，来自于8个不同癌症的外显子测序的项目。</div>
<div>sca_metadata = scaMetadata()</div>
<div>###可以查看关于这8个项目的介绍，每个项目都测了好几百个样本。但是我们只关心突变数据，而且只关心somatic的突变数据。</div>
<div>sca_data = unlist(scaLoadDatasets())</div>
</div>
<p>然后根据突变数据做好一个GRanges对象，这个可以看我以前的博客</p>
<div>sca_data$study = factor(gsub("(.*)_(.*)", "\\1", toupper(names(sca_data))))</div>
<div>sca_data = unname(subset(sca_data, Variant_Type %in% "SNP"))</div>
<div>sca_data = keepSeqlevels(sca_data, hsAutosomes())</div>
<div>## 这个对象就是我们软件的输入数据</div>
<div>sca_vr = VRanges(</div>
<div>    seqnames = seqnames(sca_data),</div>
<div>    ranges = ranges(sca_data),</div>
<div>    ref = sca_data$Reference_Allele,</div>
<div>    alt = sca_data$Tumor_Seq_Allele2,</div>
<div>    sampleNames = sca_data$Patient_ID,</div>
<div>    seqinfo = seqinfo(sca_data),</div>
<div>    study = sca_data$study</div>
<div>)</div>
<div>## 这里还可以直接用readVcf或者readMutect 来读取本地somatic mutation文件</div>
<div>## 提取突变数据，并且构造成一个Range对象。</div>
<div>sca_vr</div>
<div></div>
<div>
<div>###可以简单看看每个study都有多少somatic mutation</div>
<div>sort(table(sca_vr$study), decreasing = TRUE)</div>
<div>    LUAD   SKCM   HNSC   LUSC   KIRC    GBM   THCA     OV</div>
<div>   208724 200589  67125  61485  24158  19938   6716   5872</div>
<div>##用mutationContext函数来根据Range对象和下载好的参考基因组文件来获取突变的上下文信息。</div>
<div>sca_motifs = mutationContext(sca_vr, BSgenome.Hsapiens.1000genomes.hs37d5)</div>
<div>head(sca_motifs)</div>
<div>##可以看到Range对象，增加了两列：alteration        context</div>
<div></div>
<div>## 接下来根据做好的上下文突变数据矩阵来构建 the matrix MM of the form {motifs × studies}</div>
<div>sca_mm = motifMatrix(sca_motifs, group = "study", normalize = TRUE)</div>
<div>## 根据96种突变的频率，而不是次数来构造矩阵</div>
<div>head(round(sca_mm, 4))</div>
<div>## 然后直接画出每个study的Mutation spectrum 图</div>
<div>plotMutationSpectrum(sca_motifs, "study")</div>
<div> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/mutation-spectrum.png"><img class="alignnone wp-image-1625 size-medium" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/mutation-spectrum-260x260.png" alt="mutation spectrum" width="260" height="260" /></a></div>
<div>## 还要把spectrum分解成signature！！</div>
<div>## 这个包提供了两种方法，分别是NMF和PCA</div>
<div>n_sigs = 5</div>
<div>sigs_nmf = identifySignatures(sca_mm, n_sigs, nmfDecomposition)</div>
<div>sigs_pca = identifySignatures(sca_mm, n_sigs, pcaDecomposition)</div>
<div></div>
<div>##还提供了很多函数来探索：signatures, samples, observed and fitted.</div>
<div>需要我们掌握的是assessNumberSignatures，用来探索我们到底应该把ｓｐｅｃｔｒｕｍ分成多少个ｓｉｇｎａｔｕｒｅ</div>
<div>n_sigs = 2:8</div>
<div>gof_nmf = assessNumberSignatures(sca_mm, n_sigs, nReplicates = 5)</div>
<div>gof_pca = assessNumberSignatures(sca_mm, n_sigs, pcaDecomposition)</div>
<div>plotNumberSignatures(gof_nmf)　## 可视化展现</div>
<div></div>
<div>## 接下来可视化展现具体每个cancer type里面的各个个体在各个signature的占比</div>
<div>library(ggplot2)</div>
<div>plotSignatureMap(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Heatmap")</div>
<div>plotSignatures(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Barchart")</div>
<div>plotObservedSpectrum(sigs_nmf)</div>
<div>plotFittedSpectrum(sigs_nmf)</div>
<div>plotSampleMap(sigs_nmf)</div>
<div>plotSamples(sigs_nmf)</div>
<div></div>
<div>同理，PCA的结果也可以同样的可视化展现：</div>
<div>plotSignatureMap(sigs_pca) + ggtitle("Somatic Signatures: PCA - Heatmap")</div>
<div>plotSignatures(sigs_pca) + ggtitle("Somatic Signatures: PCA - Barchart")</div>
<div>plotFittedSpectrum(sigs_pca)</div>
<div>plotObservedSpectrum(sigs_pca)</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/mutation-signature-NMF.png"><img class="alignnone  wp-image-1624" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/mutation-signature-NMF.png" alt="mutation signature NMF" width="608" height="608" /></a></div>
<div>值得一提的是，所有的plot系列函数，都是基于ggplot的，所以可以继续深度定制化绘图细节。</div>
<div>p = plotSamples(sigs_nmf)</div>
<div></div>
<div>## (re)move the legend</div>
<div>p = p + theme(legend.position = "none")</div>
<div>## (re)label the axis</div>
<div>p = p + xlab("Studies")</div>
<div>## add a title</div>
<div>p = p + ggtitle("Somatic Signatures in TGCA WES Data")</div>
<div>## change the color scale</div>
<div>p = p + scale_fill_brewer(palette = "Blues")</div>
<div>## decrease the size of x-axis labels</div>
<div>p = p + theme(axis.text.x = element_text(size = 9))</div>
<div></div>
<div>###当然，对上下文突变数据矩阵也可以进行聚类分析</div>
<div>clu_motif = clusterSpectrum(sca_mm, "motif")</div>
<div>library(ggdendro)</div>
<div>p = ggdendrogram(clu_motif, rotate = TRUE)</div>
<div>p</div>
<div></div>
<div></div>
<div>## 最后，由于我们综合了8个不同的study，所以必然会有批次影响，如果可以，也需要去除。</div>
</div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1623.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用samr包对芯片数据做差异分析</title>
		<link>http://www.bio-info-trainee.com/1608.html</link>
		<comments>http://www.bio-info-trainee.com/1608.html#comments</comments>
		<pubDate>Thu, 05 May 2016 11:43:04 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[基础数据库]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[limma]]></category>
		<category><![CDATA[samr]]></category>
		<category><![CDATA[差异分析]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1608</guid>
		<description><![CDATA[本来搞差异分析的工具和包就一大堆了，而且limma那个包已经非常完善了，我是不准 &#8230; <a href="http://www.bio-info-trainee.com/1608.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<blockquote><p>本来搞差异分析的工具和包就一大堆了，而且limma那个包已经非常完善了，我是不准备再讲这个的，正好有个同学问了一下这个包，我就随手测试了一下，顺便看看它跟limma有什么差异没有！手痒了就记录了测试流程！</p></blockquote>
<blockquote><p>学习一个包其实非常简单，就是找到包的官网看看说明书即可！<a href="https://cran.r-project.org/web/packages/samr/samr.pdf">说明书链接</a></p>
<p>&nbsp;</p></blockquote>
<p><span id="more-1608"></span></p>
<p>samr这个包更简单，就一个函数<strong>SAM</strong>,但是根据分析数据的不同被包装成了两个函数，分别是处理高通量测序数据的<strong>SAMseq</strong>和处理芯片数据的<strong>samr</strong>,本次我只讲解芯片数据的处理，然后跟limma这个包做一个简单比较~</p>
<p>所以，我们只需要制作好数据，然后学会用samr这个函数即可！</p>
<p>我们还是利用CLL这个包的测试数据来讲解这个包的用法，首先也是制作表达矩阵和分组信息。</p>
<blockquote>
<pre class="r"><code class="r"><span class="identifier">suppressPackageStartupMessages</span><span class="paren">(</span><span class="keyword">library</span><span class="paren">(</span><span class="identifier">CLL</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">data</span><span class="paren">(</span><span class="identifier">sCLLex</span><span class="paren">)</span>
<span class="identifier">exprSet</span><span class="operator">=</span><span class="identifier">exprs</span><span class="paren">(</span><span class="identifier">sCLLex</span><span class="paren">)</span>   <span class="comment">##sCLLex是依赖于CLL这个package的一个对象</span>
<span class="identifier">samples</span><span class="operator">=</span><span class="identifier">sampleNames</span><span class="paren">(</span><span class="identifier">sCLLex</span><span class="paren">)</span>
<span class="identifier">pdata</span><span class="operator">=</span><span class="identifier">pData</span><span class="paren">(</span><span class="identifier">sCLLex</span><span class="paren">)</span>
<span class="identifier">group_list</span><span class="operator">=</span><span class="identifier">as.character</span><span class="paren">(</span><span class="identifier">pdata</span><span class="paren">[</span>,<span class="number">2</span><span class="paren">]</span><span class="paren">)</span>
<span class="identifier">group_list</span></code></pre>
<pre><code>##  [1] "progres." "stable"   "progres." "progres." "progres." "progres."
##  [7] "stable"   "stable"   "progres." "stable"   "progres." "stable"  
## [13] "progres." "stable"   "stable"   "progres." "progres." "progres."
## [19] "progres." "progres." "progres." "stable"</code></pre>
<pre class="r"><code class="r"><span class="identifier">as.numeric</span><span class="paren">(</span><span class="identifier">as.factor</span><span class="paren">(</span><span class="identifier">group_list</span><span class="paren">)</span><span class="paren">)</span></code></pre>
<pre><code>##  [1] 1 2 1 1 1 1 2 2 1 2 1 2 1 2 2 1 1 1 1 1 1 2</code></pre>
</blockquote>
<p>这个表达矩阵exprSet和分组信息group_list就可以直接用来做差异分析啦~！ 它的分组信息要求比较读取，需要1,1,1,2,2,2这样的向量，所以我用了as.numeric(as.factor(group_list))，具体见下面的代码！</p>
<blockquote>
<pre class="r"><code class="r"><span class="identifier">suppressPackageStartupMessages</span><span class="paren">(</span><span class="keyword">library</span><span class="paren">(</span><span class="identifier">samr</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">data</span><span class="operator">=</span><span class="identifier">list</span><span class="paren">(</span><span class="identifier">x</span><span class="operator">=</span><span class="identifier">exprSet</span>,<span class="identifier">y</span><span class="operator">=</span><span class="identifier">as.numeric</span><span class="paren">(</span><span class="identifier">as.factor</span><span class="paren">(</span><span class="identifier">group_list</span><span class="paren">)</span><span class="paren">)</span>, 
          <span class="identifier">geneid</span><span class="operator">=</span><span class="identifier">as.character</span><span class="paren">(</span><span class="number">1</span><span class="operator">:</span><span class="identifier">nrow</span><span class="paren">(</span><span class="identifier">exprSet</span><span class="paren">)</span><span class="paren">)</span>,
          <span class="identifier">genenames</span><span class="operator">=</span><span class="identifier">rownames</span><span class="paren">(</span><span class="identifier">exprSet</span><span class="paren">)</span>, 
          <span class="identifier">logged2</span><span class="operator">=</span><span class="literal">TRUE</span>
<span class="paren">)</span>
<span class="identifier">samr.obj</span><span class="operator">&lt;-</span><span class="identifier">samr</span><span class="paren">(</span><span class="identifier">data</span>, <span class="identifier">resp.type</span><span class="operator">=</span><span class="string">"Two class unpaired"</span>, <span class="identifier">nperms</span><span class="operator">=</span><span class="number">100</span><span class="paren">)</span></code></pre>
</blockquote>
<p>这样其实已经OK啦，重点是如何调整这个函数的参数，以及如何理解这个函数返回的结果(samr.obj这个对象非常重要，关乎你能否真正用好samr)~</p>
<p>我这里的genenames其实是探针名，如果真正要做分析，可以修改，而且我的nperms次数为100，也可以修改，一般是1000.</p>
<p>除了直接应用它找差异基因外，它还有几个单独的函数</p>
<p>首先是对表达矩阵进行normalization</p>
<blockquote>
<pre class="r"><code class="r"><span class="identifier">x.norm</span> <span class="operator">&lt;-</span> <span class="identifier">samr.norm.data</span><span class="paren">(</span><span class="identifier">data</span><span class="operator">$</span><span class="identifier">x</span><span class="paren">)</span>
<span class="identifier">par</span><span class="paren">(</span><span class="identifier">mfrow</span><span class="operator">=</span><span class="identifier">c</span><span class="paren">(</span><span class="number">1</span>,<span class="number">2</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">boxplot</span><span class="paren">(</span><span class="identifier">exprSet</span>, <span class="identifier">col</span> <span class="operator">=</span> <span class="identifier">rainbow</span><span class="paren">(</span><span class="identifier">exprSet</span><span class="paren">)</span>,<span class="identifier">main</span><span class="operator">=</span><span class="string">"before normalization"</span>,<span class="identifier">las</span><span class="operator">=</span><span class="number">2</span><span class="paren">)</span>
<span class="identifier">boxplot</span><span class="paren">(</span><span class="identifier">x.norm</span>,  <span class="identifier">col</span> <span class="operator">=</span> <span class="identifier">rainbow</span><span class="paren">(</span><span class="identifier">exprSet</span><span class="paren">)</span>,<span class="identifier">main</span><span class="operator">=</span><span class="string">"after normalization"</span>,<span class="identifier">las</span><span class="operator">=</span><span class="number">2</span><span class="paren">)
<a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/QQ截图20160505194154.png"><img class="alignnone size-full wp-image-1609" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/05/QQ截图20160505194154.png" alt="QQ截图20160505194154" width="720" height="503" /></a>
</span></code></pre>
</blockquote>
<p>&nbsp;</p>
<p>看图好像没什么区别</p>
<p>另外几个函数，我就不一一介绍了，大家可以自行探索。</p>
<p>* samr.plot(samr.obj, del, min.foldchange=0)</p>
<p>* samr.plot(samr.obj, del=.3)</p>
<p>* samr.assess.samplesize.obj&lt;- samr.assess.samplesize(samr.obj, data, log2(1.5))</p>
<p>* samr.assess.samplesize.plot(samr.assess.samplesize.obj)</p>
<p>我们重点看看这个samr得到的差异与limma的差异区别在哪里</p>
<blockquote>
<pre class="r"><code class="r"><span class="comment">## 首先提取samr做差异分析检验的p值</span>
<span class="identifier">pv</span><span class="operator">=</span><span class="identifier">samr.pvalues.from.perms</span><span class="paren">(</span><span class="identifier">samr.obj</span><span class="operator">$</span><span class="identifier">tt</span>, <span class="identifier">samr.obj</span><span class="operator">$</span><span class="identifier">ttstar</span><span class="paren">)</span>
<span class="comment">## 然后提取limma包做差异分析检验的p值</span>
<span class="keyword">library</span><span class="paren">(</span><span class="identifier">limma</span><span class="paren">)</span> 
<span class="identifier">design</span><span class="operator">=</span><span class="identifier">model.matrix</span><span class="paren">(</span><span class="operator">~</span><span class="identifier">factor</span><span class="paren">(</span><span class="identifier">sCLLex</span><span class="operator">$</span><span class="identifier">Disease</span><span class="paren">)</span><span class="paren">)</span>
<span class="identifier">fit</span><span class="operator">=</span><span class="identifier">lmFit</span><span class="paren">(</span><span class="identifier">sCLLex</span>,<span class="identifier">design</span><span class="paren">)</span>
<span class="identifier">fit</span><span class="operator">=</span><span class="identifier">eBayes</span><span class="paren">(</span><span class="identifier">fit</span><span class="paren">)</span>
<span class="identifier">options</span><span class="paren">(</span><span class="identifier">digits</span> <span class="operator">=</span> <span class="number">4</span><span class="paren">)</span>
<span class="identifier">DEG_limma</span><span class="operator">=</span><span class="identifier">topTable</span><span class="paren">(</span><span class="identifier">fit</span>,<span class="identifier">coef</span><span class="operator">=</span><span class="number">2</span>,<span class="identifier">adjust</span><span class="operator">=</span><span class="string">'BH'</span>,<span class="identifier">n</span><span class="operator">=</span><span class="literal">Inf</span><span class="paren">)</span> 
<span class="identifier">pv_limma</span><span class="operator">=</span><span class="identifier">DEG_limma</span><span class="operator">$</span><span class="identifier">P.Value</span>
<span class="identifier">names</span><span class="paren">(</span><span class="identifier">pv_limma</span><span class="paren">)</span><span class="operator">=</span><span class="identifier">rownames</span><span class="paren">(</span><span class="identifier">DEG_limma</span><span class="paren">)</span>
<span class="identifier">head</span><span class="paren">(</span><span class="identifier">pv</span><span class="paren">[</span><span class="identifier">sort</span><span class="paren">(</span><span class="identifier">names</span><span class="paren">(</span><span class="identifier">pv</span><span class="paren">)</span><span class="paren">)</span><span class="paren">]</span><span class="paren">)</span></code></pre>
<pre><code>##  100_g_at   1000_at   1001_at 1002_f_at 1003_s_at   1004_at 
##    0.2531    0.4144    0.5671    0.5686    0.4687    0.6340</code></pre>
<pre class="r"><code class="r"><span class="identifier">head</span><span class="paren">(</span><span class="identifier">pv_limma</span><span class="paren">[</span><span class="identifier">sort</span><span class="paren">(</span><span class="identifier">names</span><span class="paren">(</span><span class="identifier">pv_limma</span><span class="paren">)</span><span class="paren">)</span><span class="paren">]</span><span class="paren">)</span></code></pre>
<pre><code>##  100_g_at   1000_at   1001_at 1002_f_at 1003_s_at   1004_at 
##    0.2497    0.4312    0.5349    0.5498    0.4361    0.6473</code></pre>
<pre class="r"><code class="r"><span class="identifier">cor</span><span class="paren">(</span><span class="identifier">pv</span><span class="paren">[</span><span class="identifier">sort</span><span class="paren">(</span><span class="identifier">names</span><span class="paren">(</span><span class="identifier">pv</span><span class="paren">)</span><span class="paren">)</span><span class="paren">]</span>,<span class="identifier">pv_limma</span><span class="paren">[</span><span class="identifier">sort</span><span class="paren">(</span><span class="identifier">names</span><span class="paren">(</span><span class="identifier">pv_limma</span><span class="paren">)</span><span class="paren">)</span><span class="paren">]</span><span class="paren">)</span></code></pre>
<pre><code>## [1] 0.9976</code></pre>
</blockquote>
<p>从数据上来看，没什么本质区别,而且相关系数高达0.9978.</p>
<p>所以结论是，没必要搞那么多的包，用limma就好了，甚至直接用t检验也是OK的</p>
<p>还有plot和summary也是可以直接作用于samr的结果samr.obj对象的</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1608.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用oligo包来读取affymetix的基因表达芯片数据-CEL格式数据</title>
		<link>http://www.bio-info-trainee.com/1586.html</link>
		<comments>http://www.bio-info-trainee.com/1586.html#comments</comments>
		<pubDate>Sat, 23 Apr 2016 14:58:31 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[affymetrix]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[oligo]]></category>
		<category><![CDATA[芯片数据]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1586</guid>
		<description><![CDATA[前面讲到affy处理的芯片平台是有限的，一般是hgu 95系列和133系列，[H &#8230; <a href="http://www.bio-info-trainee.com/1586.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>前面讲到affy处理的芯片平台是有限的，一般是hgu 95系列和133系列，[HuGene-1_1-st] Affymetrix Human Gene 1.1 ST Array这个平台虽然也是affymetrix公司的，但是affy包就无法处理 了，这时候就需要oligo包了！</p>
<p>oligo包是R语言的bioconductor系列包的一个，就一个功能，读取affymetix的基因表达芯片数据-CEL格式数据，处理成表达矩阵！！！</p>
<p><span id="more-1586"></span></p>
<p>同理，我们也是要下载原始数据：一个例子：<a href="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE48nnn/GSE48452/suppl/GSE48452_RAW.tar">GSE48452</a></p>
<p>下载之后，解压到指定目录，就可以直接用oligo包啦！</p>
<blockquote>
<div>geneCELs=list.celfiles('<span style="color: #ff0000;"><strong>/path/GSE48452/cel_files/</strong></span>',listGzipped=T,<a href="http://full.name">full.name</a>=T)</div>
<div>#用全路径，一般cel文件也是压缩包形式，没必要解压</div>
<div>affyGeneFS &lt;- read.celfiles(geneCELs)  ##读取ｃｅｌ文件</div>
<div>geneCore &lt;- rma(affyGeneFS, target = "core")　 ##这一步是normalization，会比较耗时</div>
<div>genePS &lt;- rma(affyGeneFS, target = "probeset")</div>
<div>#两种normlization的方法，##一般我们会选择transcript相关的</div>
<div>## 这个芯片平台还需要自己把探针ID赋值给表达矩阵</div>
<div>featureData(genePS) &lt;- getNetAffx(genePS, "probeset")</div>
<div>featureData(geneCore) &lt;- getNetAffx(geneCore, "transcript")</div>
<div>## 探针ID还需要注释到基因ID，这里就不讲了！</div>
</blockquote>
<p>处理之后得到的表达矩阵应该是与GEO官网的一致，大家可以自己对照检查一下：</p>
<p>ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE48nnn/GSE48452/matrix/GSE48452_series_matrix.txt.gz</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1586.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用affy包读取affymetix的基因表达芯片数据-CEL格式数据</title>
		<link>http://www.bio-info-trainee.com/1580.html</link>
		<comments>http://www.bio-info-trainee.com/1580.html#comments</comments>
		<pubDate>Sat, 23 Apr 2016 14:50:46 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[affy]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[芯片数据]]></category>
		<category><![CDATA[表达矩阵]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1580</guid>
		<description><![CDATA[Affymetrix的探针（proble）一般是长为25碱基的寡聚核苷酸；探针总 &#8230; <a href="http://www.bio-info-trainee.com/1580.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Affymetrix的探针（proble）一般是长为25碱基的寡聚核苷酸；探针总是以perfect match 和mismatch成对出现，其信号值称为PM和MM，成对的perfect match 和mismatch有一个共同的affyID。<br />
CEL文件：信号值和定位信息。<br />
CDF文件：探针对在芯片上的定位信息</p>
<p>affy包是R语言的bioconductor系列包的一个，就一个功能，读取affymetix的基因表达芯片数据-CEL格式数据，处理成表达矩阵！！！</p>
<p><span id="more-1580"></span></p>
<p>一般我们都是去GEO数据库里面知道找到CEL文件的下载地址~~~比如<a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1428">GSE1438</a>，测了10 young (19-25 years old) and 12 older (70-80 years old) male的样品，然后找差异基因，从GEO数据库我们找到cel文件下载地址是：</p>
<p>ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1428/suppl/<span style="color: #ff0000;">GSE1428_RAW.tar</span></p>
<p>我们是为了讲解affy才下载原始数据的，其实GEO也提供处理好的表达矩阵供下载</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/1.png"><img class="alignnone size-full wp-image-1581" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/1.png" alt="1" width="290" height="201" /></a></p>
<p>下载后压缩到指定目录即可</p>
<p>下载到本地之后就可以用代码读取它了！</p>
<blockquote><p>library(affy)<br />
dir_cels='D:\\test_analysis\\TNBC\\cel_files'<br />
affy_data = ReadAffy(celfile.path=dir_cels)<br />
eset.mas5 = mas5(affy_data)</p></blockquote>
<p><!--StartFragment --></p>
<div>读取的过程还是蛮耗时间的，<span style="color: #ff0000;">也可以选择rma函数而不是mas5函数对表达数据进行normalization</span></div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/2.png"><img class="alignnone size-full wp-image-1582" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/2.png" alt="2" width="449" height="251" /></a></div>
<div>读取之后的表达矩阵如图所示：</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/3.png"><img class="alignnone size-full wp-image-1583" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/3.png" alt="3" width="727" height="318" /></a></div>
<div>理论上，处理得到的数据应该与直接在GEO官网下载的表达量是一样的，下载链接都是有规律的！</div>
<p>ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1428/matrix/<span style="color: #ff0000;">GSE1428_series_matrix.txt.gz</span></p>
<p>当然这个affy包支持的芯片平台是有限的！</p>
<p>一般是hgu 95系列和133系列~~</p>
<p>其实严格来说，这个芯片得到的表达矩阵，是需要过滤的。</p>
<p>比如像下面的代码：</p>
<p>setwd('../')<br />
library(affy)<br />
dir_cels='GSE34824_RAW'<br />
data &lt;- ReadAffy(celfile.path=dir_cels)<br />
eset &lt;- rma(data)<br />
calls &lt;- mas5calls(data) # get PMA calls<br />
calls &lt;- exprs(calls)<br />
absent &lt;- rowSums(calls == 'A') # how may samples are each gene 'absent' in all samples<br />
absent &lt;- which (absent == ncol(calls)) # which genes are 'absent' in all samples<br />
rmaFiltered &lt;- eset[-absent,] # filters out the genes 'absent' in all samples</p>
<p>54675 features 经过过滤后，剩下 42482 features</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1580.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>R包精讲第四篇：4种R包安装方式</title>
		<link>http://www.bio-info-trainee.com/1565.html</link>
		<comments>http://www.bio-info-trainee.com/1565.html#comments</comments>
		<pubDate>Tue, 12 Apr 2016 15:45:07 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[perl]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[包]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1565</guid>
		<description><![CDATA[请先看：R包精讲第一篇：如何查看你已经安装了和可以安装哪些R包？ 第一种方式，当 &#8230; <a href="http://www.bio-info-trainee.com/1565.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>请先看：<a title="详细阅读 R包精讲第一篇：如何查看你已经安装了和可以安装哪些R包？" href="http://www.bio-info-trainee.com/1537.html" rel="bookmark">R包精讲第一篇：如何查看你已经安装了和可以安装哪些R包？</a></p>
<p>第一种方式，当然是R自带的函数直接安装包了，这个是最简单的，而且不需要考虑各种包之间的依赖关系。</p>
<p>对普通的R包，直接install.packages()即可，一般下载不了都是包的名字打错了，或者是R的版本不够，如果下载了安装不了，一般是依赖包没弄好，或者你的电脑缺少一些库文件，如果实在是找不到或者下载慢，一般就用repos=来切换一些镜像。</p>
<table class="GEM3DMTCOFB ace_text-layer ace_line GEM3DMTCKT" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td align="left">
<blockquote>
<pre id="rstudio_console_output" class="GEM3DMTCFGB" tabindex="0"><span class="GEM3DMTCLGB ace_keyword">&gt; </span><span class="GEM3DMTCLFB ace_keyword">install.packages("ape")  ##直接输入包名字即可
</span><span class="GEM3DMTCPFB  ace_constant ace_language">Installing package into ‘C:/Users/jmzeng/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)  ##一般不指定lib，除非你明确知道你的lib是在哪里
</span><span class="GEM3DMTCPFB  ace_constant ace_language">trying URL '<span style="color: #ff0000;">http://mirror.bjtu.edu.cn/cran</span>/bin/windows/contrib/3.1/ape_3.4.zip'
</span><span class="GEM3DMTCPFB  ace_constant ace_language">Content type 'application/zip' length 1418322 bytes (1.4 Mb)
</span><span class="GEM3DMTCPFB  ace_constant ace_language">opened URL   ## 根据你选择的镜像，程序会自动拼接好下载链接url
</span><span class="GEM3DMTCPFB  ace_constant ace_language">downloaded 1.4 Mb

</span>package ‘ape’ successfully unpacked and MD5 sums checked  ##表明你已经安装好包啦

The downloaded binary packages are in  ##程序自动下载的原始文件一般放在临时目录，会自动删除
	C:\Users\jmzeng\AppData\Local\Temp\Rtmpy0OivY\downloaded_packages
</pre>
</blockquote>
</td>
</tr>
<tr>
<td align="left"></td>
</tr>
<tr>
<td align="left">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td rowspan="1" align="left" width="1" height="">
<div class="GEM3DMTCLGB ace_keyword">&gt;</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>对于bioconductor的包，我们一般是</p>
<blockquote><p>source("http://bioconductor.org/biocLite.R") ##安装BiocInstaller</p>
<p>#options(BioC_mirror=”<a href="http://mirrors.ustc.edu.cn/bioc/">http://mirrors.ustc.edu.cn/bioc/</a>“) 如果需要切换镜像<br />
biocLite("ggbio")</p>
<p>或者直接BiocInstaller::biocLite('ggbio') ## 前提是你已经安装好了BiocInstaller</p>
<p>某些时候你还需要卸载remove.packages("BiocInstaller") 然后安装新的</p></blockquote>
<p>第二种方式，是直接找到包的下载地址，需要进入包的主页</p>
<blockquote><p>packageurl &lt;- "<span style="text-decoration: underline; color: #ff00ff;">http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.1.tar.gz</span>"<br />
packageurl &lt;- "http://cran.r-project.org/src/contrib/Archive/gridExtra/gridExtra_0.9.1.tar.gz"<br />
install.packages(packageurl, repos=NULL, type="source")<br />
#packageurl &lt;- "http://www.bioconductor.org/packages/2.11/bioc/src/contrib/ggbio_1.6.6.tar.gz"<br />
#packageurl &lt;- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_1.0.1.tar.gz"<br />
install.packages(packageurl, repos=NULL, type="source")</p></blockquote>
<p>这样安装的就不需要选择镜像了，也跨越了安装器的版本！</p>
<p>第三种是，先把包下载到本地，然后安装：</p>
<blockquote>
<pre><b>download.file</b>("<a href="http://bioconductor.org/packages/release/bioc/src/contrib/BiocInstaller_1.20.1.tar.gz">http://bioconductor.org/packages/release/bioc/src/contrib/BiocInstaller_1.20.1.tar.gz</a>","BiocInstaller_1.20.1.tar.gz")
##也可以选择用浏览器下载这个包
<b>install.packages</b>("BiocInstaller_1.20.1.tar.gz", repos = NULL)
## 如果你用的RStudio这样的IDE，那么直接用鼠标就可以操作了
或者用choose.files()来手动交互的选择你把下载的源码BiocInstaller_1.20.1.tar.gz放到了哪里。</pre>
</blockquote>
<p>这种形式大部分安装都无法成功，因为R包之间的依赖性很强！</p>
<p>第四种是：命令行版本安装</p>
<blockquote>
<pre>如果是linux版本，命令行从网上自动下载包如下：
sudo su - -c \
<span class="pl-s"><span class="pl-pds">"</span>R -e <span class="pl-cce">\"</span>install.packages('shiny', repos='https://cran.rstudio.com/')<span class="pl-cce">\"</span><span class="pl-pds">"
如果是linux，命令行安装本地包，在shell的终端
sudo R CMD INSTALL package.tar.gz
window或者mac平台一般不推荐命令行格式，可视化那么舒心，何必自讨苦吃</span></span></pre>
</blockquote>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1565.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>R包精讲第三篇：如何切换镜像？</title>
		<link>http://www.bio-info-trainee.com/1561.html</link>
		<comments>http://www.bio-info-trainee.com/1561.html#comments</comments>
		<pubDate>Tue, 12 Apr 2016 13:11:53 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[bioconductor]]></category>
		<category><![CDATA[镜像]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1561</guid>
		<description><![CDATA[这个技巧很重要，一般来说，R语言自带的install.packages函数来安装 &#8230; <a href="http://www.bio-info-trainee.com/1561.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>这个技巧很重要，一般来说，R语言自带的install.packages函数来安装一个包时，都是用的默认的镜像！</p>
<p>如果你是用的Rstudio这个IDE，你的默认镜像就是： <a href="https://cran.rstudio.com/bin/windows/contrib/3.2/">https://cran.rstudio.com/ </a></p>
<p>如果你直接用的R语言，那么就是："http://cran.us.r-project.org" 但是一般你安装的时候会提醒你选择。</p>
<p><span id="more-1561"></span></p>
<p>而我们一般需要更改成自己最方便的</p>
<blockquote><p>  install.packages(pkgs, lib,<strong> repos = getOption("repos"),</strong><br />
<strong> contriburl = contrib.url(repos, type),</strong><br />
method, available = NULL, destdir = NULL,<br />
dependencies = NA, type = getOption("pkgType"),<br />
configure.args = getOption("configure.args"),<br />
configure.vars = getOption("configure.vars"),<br />
clean = FALSE, Ncpus = getOption("Ncpus", 1L),<br />
verbose = getOption("verbose"),<br />
libs_only = FALSE, INSTALL_opts, quiet = FALSE,<br />
keep_outputs = FALSE, ...)</p></blockquote>
<p>如果是在国内， install.packages("ABC",repos="<span class="GEM3DMTCPFB  ace_constant ace_language">http://mirror.bjtu.edu.cn/ </span>"),换成北大的镜像，飞一般的感觉！</p>
<p>如果想永久设置，就用options修改即可。<span class="GEM3DMTCPFB  ace_constant ace_language"><br />
</span></p>
<p>如果你是Rstudio的IDE，那么直接进入全局设置，一劳永逸的选择好镜像！</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/tmp.png"><img class="alignnone size-full wp-image-1562" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/04/tmp.png" alt="tmp" width="488" height="406" /></a></p>
<p>你可以check一下每个镜像的包是不是一致的：</p>
<p>dim(available.packages(contriburl = "<span style="color: #ff0000;"><strong>http://cran.rstudio.com/</strong></span>bin/windows/contrib/<span style="color: #ff0000;">3.2</span>/"))</p>
<p>更改镜像主页及包的版本即可查看所有镜像各提供哪些包！</p>
<p>当然，我们的bioconductor其实也是有镜像的，只是大部分人都不知道，也不会去用而已！</p>
<blockquote>
<div>source("<a href="http://bioconductor.org/biocLite.R">http://bioconductor.org/biocLite.R</a>")</div>
<div>options(BioC_mirror="<a href="http://mirrors.ustc.edu.cn/bioc/">http://mirrors.ustc.edu.cn/bioc/</a>")</div>
<div>biocLite("RGalaxy")##这样就用中科大的镜像来下载包啦</div>
</blockquote>
<div>## bioconductor还有很多其它镜像：<a href="https://www.bioconductor.org/about/mirrors/">https://www.bioconductor.org/about/mirrors/</a></div>
<div>##<a href="https://stat.ethz.ch/R-manual/R-devel/library/utils/html/chooseBioCmirror.html">https://stat.ethz.ch/R-manual/R-devel/library/utils/html/chooseBioCmirror.html</a></div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1561.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
