<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 流程</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/%e6%b5%81%e7%a8%8b/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>一个表达芯片数据处理实例</title>
		<link>http://www.bio-info-trainee.com/1024.html</link>
		<comments>http://www.bio-info-trainee.com/1024.html#comments</comments>
		<pubDate>Fri, 25 Sep 2015 14:53:39 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[未分类]]></category>
		<category><![CDATA[R分享]]></category>
		<category><![CDATA[实战]]></category>
		<category><![CDATA[流程]]></category>
		<category><![CDATA[芯片数据]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1024</guid>
		<description><![CDATA[这个实例上部分包括： 如何用R包下载GEO数据(只限单一平台，其余平台需要修改下 &#8230; <a href="http://www.bio-info-trainee.com/1024.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>这个实例上部分包括：</div>
<div>如何用R包下载GEO数据(只限单一平台，其余平台需要修改下面的代码)</div>
<div>如何对GEO的芯片数据归一化并且得到表达量矩阵，</div>
<div>如何用limma包做差异分析，</div>
<div>对找到的差异基因如何做GO和KEGG注释</div>
<p><span id="more-1024"></span></p>
<div></div>
<div>首先下载两个GEO数据：</div>
<div>平台是：Affymetrix U133 gene chips</div>
<div>67 diseased triple negative breast cancer<span class="Apple-converted-space"> </span>samples（GSE31519<span class="Apple-converted-space"> </span>）and 42 control samples (GSE20437）</div>
<div>都是表达量数据，同一种芯片。分成两组，正好做差异表达。</div>
<div>数据来源的文献是：</div>
<div>文章title：A clinically relevant gene signature in triple negative and basal-like breast cancer</div>
<div>结论（We describe a ratio of high B-cell presence and low IL-8 activity as a powerful new prognostic marker for TNBC. ）</div>
<div>地址：<a href="http://www.breast-cancer-research.com/content/13/5/R97">http://www.breast-cancer-research.com/content/13/5/R97</a></div>
<div><span style="font-size: medium;">GEO数据地址: </span><a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE31519">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE31519</a></div>
<div><span style="font-size: medium;">Platform: GPL96 67 Samples</span></div>
<div><span style="font-size: medium;">Download data: GEO (CEL, TXT)</span></div>
<div><span style="font-size: medium;">SeriesAccession: GSE31519ID: 200031519</span></div>
<div></div>
<div>文章title：Histologically normal epithelium from breast cancer patients and cancer-free prophylactic mastectomy patient</p>
<div><span style="font-size: medium;">GEO数据地址:  </span><a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20437">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20437</a></div>
<div><span style="font-size: medium;">Platform: GPL96 67 Samples</span></div>
<div><span style="font-size: medium;">Download data: GEO (CEL, TXT)</span></div>
<div><span style="font-size: medium;">SeriesAccession: GSE31519  ID: 200031519</span></div>
<p>Platform: GPL96 Series: GSE20437   42 Samples</p>
</div>
<div>Download data: GEO (CEL)</div>
<div>DataSetAccession: GDS3716  ID: 3716</div>
<div></div>
<div></div>
<div>我首先用R的GEOquery包来下载。(其实你完全可以直接去GEO网站下载数据，然后解压的)</div>
<div>suppressMessages(library(GEOquery))</div>
<div>setwd("D:\\test_analysis\\TNBC")</div>
<div>gse31519=getGEO("GSE31519",GSEMatrix = T,destdir = "./")</div>
<div>getGEOSuppFiles("GSE31519",baseDir = "./")</div>
<div>gse31519=getGEO("GSE20437",GSEMatrix = T,destdir = "./")</div>
<div>getGEOSuppFiles("GSE20437",baseDir = "./")</div>
<div></div>
<div><span style="color: #ff0000;"><strong>这样下载之后的数据都存在D:\\test_analysis\\TNBC下面</strong></span></div>
<div><img src="file:///C:/Users/jmzeng/AppData/Local/YNote/data/jmzeng1314@163.com/8467bf2015714848812f3bf1a85e8adc/clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="CFEEEEB609C645C58285EE460BAFD8AE" data-attr-org-img-file="file:///C:/Users/jmzeng/AppData/Local/YNote/data/jmzeng1314@163.com/8467bf2015714848812f3bf1a85e8adc/clipboard.png" /></div>
<div>接下来我们就用affy包和limma来进行差异分析：</div>
<div></div>
<div><span style="font-size: medium;">library(affy)</span></div>
<div><span style="font-size: medium;">library(limma)</span></div>
<div><span style="font-size: medium;">affy.data=ReadAffy(celfile.path="<span style="color: #ff0000;">./cel_files</span>")</span></div>
<div>请先搞清楚，ReadAffy 这个函数的用法！</div>
<div><!--StartFragment --></p>
<div><span style="color: #ff0000;">当前工作目录下面有没有cel_files文件夹?</span></div>
<div><span style="color: #ff0000;"><!--StartFragment --></span></p>
<div><span style="color: #ff0000;">cel_files文件夹下面有没有文件？</span></div>
</div>
</div>
<div><span style="font-size: medium;">eset.rma=rma(affy.data)</span></div>
<div><span style="font-size: medium;">exprSet=exprs(eset.rma)</span></div>
<div><span style="font-size: medium;">write.table(exprSet,"expr_rma_matrix.txt",quote=F,sep="\t")</span></div>
<div><span style="font-size: medium;">group=factor(c(rep("control",42),rep("case",67)))</span></div>
<div><span style="font-size: medium;">design = model.matrix(~0+group)</span></div>
<div><span style="font-size: medium;">colnames(design)=c("case","control")</span></div>
<div><span style="font-size: medium;">rownames(design)=sampleNames(affy.data)</span></div>
<div><span style="font-size: medium;">fit=lmFit(exprSet,design)</span></div>
<div><span style="font-size: medium;">cont.matrix = makeContrasts(contrasts="case-control",levels=design)</span></div>
<div><span style="font-size: medium;">fit2=contrasts.fit(fit,cont.matrix)</span></div>
<div><span style="font-size: medium;">fit2=eBayes(fit2)</span></div>
<div><span style="font-size: medium;">diff_dat=topTable(fit2,coef=1,n=Inf)</span></div>
<div><span style="font-size: medium;">write.table(diff_dat,"diff_dat.txt",quote=F)</span></div>
<div><span style="font-size: medium;"> </span></div>
<div><span style="font-size: medium;">这样得到的diff_dat就是我们差异分析的结果啦</span></div>
<div>we choose the log fold cut off<span class="Apple-converted-space"> </span>change to be “2” to get a manageable set of genes.<br />
原文说：we were able to get a list of<span class="Apple-converted-space"> </span>2567 genes after removing the duplicates and the not available genes</div>
<div>我们仅仅根据一个标准来挑选差异基因， the log fold cut off change to be “2”，我只挑出来了782个探针</div>
<div><img src="file:///C:/Users/jmzeng/AppData/Local/YNote/data/jmzeng1314@163.com/a67937bb6442411aa42d9efed6c4d6c1/clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="0E1CE3D83D654D50B02E66C9C28BF001" data-attr-org-img-file="file:///C:/Users/jmzeng/AppData/Local/YNote/data/jmzeng1314@163.com/a67937bb6442411aa42d9efed6c4d6c1/clipboard.png" /><br />
接下来对这些探针进行注释，得到基因名，我这里用biomart包来进行注释</div>
<div><img src="file:///C:/Users/jmzeng/AppData/Local/YNote/data/jmzeng1314@163.com/85a569a5e423450e8473ce80e380ada1/clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="353A8AA43CA243A99F80CE70E99CB053" data-attr-org-img-file="file:///C:/Users/jmzeng/AppData/Local/YNote/data/jmzeng1314@163.com/85a569a5e423450e8473ce80e380ada1/clipboard.png" /></div>
<div>我们的平台是：Affymetrix U133 gene chips，虽然有22283个探针，但是只有13908个基因</div>
<div>所以代码如下：</div>
<div>ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")</div>
<div>gene_probe=getBM(attributes=c("hgnc_symbol","affy_hg_u133a"),filter="affy_hg_u133a",value=rownames(diff_dat),mart=ensembl)</div>
<div>diff_probe=rownames(diff_dat[abs(diff_dat[,1])&gt;2,])</div>
<div>diff_gene=gene_probe[match(diff_probe,gene_probe[,2]),1]</div>
<div>diff_gene=na.omit(diff_gene)</div>
<div>diff_gene=unique(diff_gene)</div>
<div>length(diff_gene)</div>
<div>这样会得到604个差异基因</div>
<div>然后我做一下GO和KEGG的富集分析</div>
<div>gene_entrez=getBM(attributes=c("hgnc_symbol","entrezgene"),filter="hgnc_symbol",value=diff_gene,mart=ensembl)</div>
<div>require(DOSE)</div>
<div>require(clusterProfiler)</div>
<div>gene_entrez=na.omit(gene_entrez)</div>
<div>gene=as.character(gene_entrez[,2])</div>
<div>ego &lt;- enrichGO(gene=gene,organism="human",ont="CC",pvalueCutoff=0.01,readable=TRUE)</div>
<div>ekk &lt;- enrichKEGG(gene=gene,organism="human",pvalueCutoff=0.01,readable=TRUE)</div>
<div>write.csv(summary(ekk),"KEGG-enrich.csv",row.names =F)</div>
<div>write.csv(summary(ego),GO-enrich.csv,row.names =F)</div>
<p>懒得上传图片了，大家可以用同样的代码自己实现所有的流程</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1024.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RNA-seq流程需要进化啦！</title>
		<link>http://www.bio-info-trainee.com/1022.html</link>
		<comments>http://www.bio-info-trainee.com/1022.html#comments</comments>
		<pubDate>Fri, 25 Sep 2015 14:46:21 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[RNA]]></category>
		<category><![CDATA[流程]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1022</guid>
		<description><![CDATA[Tophat 首次被发表已经是6年前 Cufflinks也是五年前的事情了 St &#8230; <a href="http://www.bio-info-trainee.com/1022.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Tophat 首次被发表已经是6年前</p>
<p>Cufflinks也是五年前的事情了</p>
<p>Star的比对速度是tophat的50倍，hisat更是star的1.2倍。</p>
<p>stringTie的组装速度是cufflinks的25倍，但是内存消耗却不到其一半。</p>
<p>Ballgown在差异分析方面比cuffdiff更高的特异性及准确性，且时间消耗不到cuffdiff的千分之一</p>
<p>Bowtie2+eXpress做质量控制优于tophat2+cufflinks和bowtie2+RSEM</p>
<p>Sailfish更是跳过了比对的步骤，直接进行kmer计数来做QC，特异性及准确性都还行，但是速度提高了25倍</p>
<p>kallisto同样不需要比对，速度比sailfish还要提高5倍！！！</p>
<p>参考：<a href="https://speakerdeck.com/stephenturner/rna-seq-qc-and-data-analysis-using-the-tuxedo-suite">https://speakerdeck.com/stephenturner/rna-seq-qc-and-data-analysis-using-the-tuxedo-suite</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1022.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>搜索学习其他学者的RNA数据处理流程（包括原始数据、脚本、中间文件）</title>
		<link>http://www.bio-info-trainee.com/32.html</link>
		<comments>http://www.bio-info-trainee.com/32.html#comments</comments>
		<pubDate>Sat, 07 Mar 2015 11:52:20 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基础数据库]]></category>
		<category><![CDATA[RNA-seq]]></category>
		<category><![CDATA[trinity]]></category>
		<category><![CDATA[流程]]></category>
		<category><![CDATA[谷歌]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=32</guid>
		<description><![CDATA[搜索其他学者的RNA数据处理流程（包括原始数据、脚本、中间文件） 一：原始数据  &#8230; <a href="http://www.bio-info-trainee.com/32.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p style="text-align: center;"><b>搜索其他学者的RNA数据处理流程（包括原始数据、脚本、中间文件）</b></p>
<p><b>一：原始数据</b></p>
<p><b>是谷歌里面无意中搜索到的，</b>是某个物种的RNA数据，不是很大，但是里面有所有的分析流程，非常方便，对原始reads进行了组装，和注释。</p>
<p><a href="http://moana.dnsalias.org/~sgeib/Anth_RNAseq/Run2.1/RawData/">http://moana.dnsalias.org/~sgeib/Anth_RNAseq/Run2.1/RawData/</a></p>
<p>打开网址可以看到raw data的下载链接</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/QQ截图20150309220349.png"><img class="alignnone size-full wp-image-61" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/QQ截图20150309220349.png" alt="QQ截图20150309220349" width="352" height="315" /></a></p>
<p>&nbsp;</p>
<p><span id="more-32"></span></p>
<p><b>二：中间文件</b></p>
<p><b>可以清楚的看到所有的流程操作手册</b></p>
<p><img class="alignnone size-full wp-image-34" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/搜索其他学者的RNA数据处理流程328.png" alt="搜索其他学者的RNA数据处理流程328" width="554" height="364" /></p>
<p>要是有空，可以对它们做一次检验，需要的空间不大40多个G的空间即可。</p>
<p>它是通过solexaQA套件中的两个perl程序来过滤reads的</p>
<p>它过滤之前和过滤之后都用来fastqc来进行质控画图</p>
<p>过滤之后的数据量如图所示</p>
<p>对这些reads进行trinity组装好得到转录本信息，是312M的数据量</p>
<p><img class="alignnone size-full wp-image-36" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/搜索其他学者的RNA数据处理流程870.png" alt="搜索其他学者的RNA数据处理流程870" width="411" height="79" /></p>
<p>转录本的统计信息如下</p>
<p><img class="alignnone size-full wp-image-37" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/搜索其他学者的RNA数据处理流程1079.png" alt="搜索其他学者的RNA数据处理流程1079" width="339" height="451" /></p>
<p>&nbsp;</p>
<p>三：处理流程</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/搜索其他学者的RNA数据处理流程1092.png"><img class="alignnone size-full wp-image-39" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/搜索其他学者的RNA数据处理流程1092.png" alt="搜索其他学者的RNA数据处理流程1092" width="743" height="103" /></a></p>
<p>四：所有的脚本，有兴趣的同学可以自行下载慢慢解读</p>
<p><img class="alignnone size-full wp-image-38" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/搜索其他学者的RNA数据处理流程1082.png" alt="搜索其他学者的RNA数据处理流程1082" width="532" height="502" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/32.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
