<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; miRNA-seq</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/mirna-seq/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>自学miRNA-seq分析第八讲~miRNA-mRNA表达相关下游分析</title>
		<link>http://www.bio-info-trainee.com/1719.html</link>
		<comments>http://www.bio-info-trainee.com/1719.html#comments</comments>
		<pubDate>Sun, 03 Jul 2016 03:31:07 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[heatmap]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[normalization]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1719</guid>
		<description><![CDATA[通过前面的分析，我们已经量化了ET1刺激前后的细胞的miRNA和mRNA表达水平 &#8230; <a href="http://www.bio-info-trainee.com/1719.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>通过前面的分析，我们已经量化了ET1刺激前后的细胞的miRNA和mRNA表达水平，也通过成熟的统计学分析分别得到了差异miRNA和mRNA，这时候我们就需要换一个参考文献了，因为前面提到的那篇文章分析的不够细致，我这里选择了浙江大学的一篇TCGA数据挖掘分析文章<a href="http://www.nature.com/articles/srep12995%20">Identifying miRNA/mRNA negative regulation pairs in colorectal cancer</a>，里面首先就是查找miRNA-mRNA基因对，因为miRNA主要还是负向调控mRNA表达，所以根据我们得到的两个表达矩阵做相关性分析，很容易得到符合统计学意义的miRNA-mRNA基因对，具体分析内容如下：</p>
<blockquote><p>把得到的差异miRNA的表达量画一个热图，看看它是否能显著的分类<br />
用miRWalk2.0等数据库或者根据来获取这些差异miRNA的validated target genes<br />
然后看看这些<strong>pairs of miRNA- target genes的表达量相关系数</strong>，选取显著正相关或者负相关的pairs<br />
这些被选取的pairs of miRNA- target genes拿去做<strong>富集分析</strong><br />
最后这些pairs of miRNA- target genes做<strong>PPI网络分析</strong></p></blockquote>
<p>首先我们看第一个热图的实现：</p>
<blockquote><p>resOrdered=na.omit(resOrdered)<br />
DEmiRNA=resOrdered[abs(resOrdered$log2FoldChange)&gt;log2(1.5) &amp; resOrdered$padj &lt;0.01 ,]<br />
write.csv(resOrdered,"deseq2.results.csv",quote = F)<br />
DEmiRNAexprSet=exprSet[rownames(DEmiRNA),]<br />
write.csv(DEmiRNAexprSet,'DEmiRNAexprSet.csv')</p>
<p>DEmiRNAexprSet=read.csv('<span style="color: #ff0000;"><strong>DEmiRNAexprSet.csv</strong></span>',stringsAsFactors = F)<br />
exprSet=as.matrix(DEmiRNAexprSet[,2:7])<br />
rownames(exprSet)=rownames(DEmiRNAexprSet)<br />
heatmap(exprSet)<br />
gplots::heatmap.2(exprSet)<br />
library(pheatmap)<br />
##<span style="color: #ff0000;"> http://biit.cs.ut.ee/clustvis/</span></p></blockquote>
<p>因为我前面保存的表达量就基于counts的，所以画热图还需要进行normalization，我这里懒得弄了，就用了一个网页版工具，自动出热图<span style="color: #ff0000;">http://biit.cs.ut.ee/clustvis/</span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/07/miRNA-heatmap.png"><img class="alignnone  wp-image-1721" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/07/miRNA-heatmap.png" alt="miRNA-heatmap" width="696" height="520" /></a></p>
<p>感觉还不错，可以很清楚的看到ET1刺激前后细胞中miRNA表达量变化</p>
<p>然后就是检验我们选取的感兴趣的有显著差异的miRNA的target genes，这时候有两种方法，一个是先由数据库得到已经被检验的miRNA的target genes，另一种是根据miRNA和mRNA表达量的相关性来预测。</p>
<p>用数据库来查找MiRNA的作用基因，非常多的工具，比较常用的有<span style="color: #ff0000;"><strong>TargetScan/miRTarBase</strong> </span><br />
### http://nar.oxfordjournals.org/content/early/2015/11/19/nar.gkv1258.full<br />
### http://mirtarbase.mbc.nctu.edu.tw/<br />
### http://mirtarbase.mbc.nctu.edu.tw/cache/download/6.1/hsa_MTI.xlsx<br />
### http://www.targetscan.org/vert_71/ (version 7.1 (June 2016))<br />
我还看到过一个整合工具： miRecords  (DIANA-microT, MicroInspector, miRanda, MirTarget2, miTarget, NBmiRTar, PicTar, PITA, RNA22, RNAhybrid and TargetScan/TargertScanS)里面提到了查找MiRNA的作用基因这一过程，高假阳性，至少被5种工具支持，才算是真的<br />
还有很多类似的工具，miRWalk2，psRNATarget网页版工具，最后值得一提的是中山大学的：<a href="http://starbase.sysu.edu.cn/panCancer.php"> starBase  </a>Pan-Cancer Analysis Platform is designed for deciphering Pan-Cancer Networks of lncRNAs, miRNAs, ceRNAs and RNA-binding proteins (RBPs) by mining clinical and expression profiles of 14 cancer types (&gt;6000 samples) from The Cancer Genome Atlas (TCGA) Data Portal (all data available without limitations).虽然我没有仔细的用，但是看介绍好牛的样子，还有一个R包：miRLAB我玩了一会，它是先通过算所有配对的<strong>miRNA- genes的表达量相关系数</strong>，选取显著正相关或者负相关的pairs，然后反过来通过已知数据库来验证。</p>
<p>后面我就不讲了，主要看你得到miRNA的时候其它生物学数据是否充分，如果是癌症病人，有生存相关数据，可以做生存分析，如果你同时测了甲基化数据，可以做甲基化相关分析~~~~~~~~~</p>
<p>如果只是单纯的miRNA测序数据，可以回过头去研究一下de novo的miRNA预测的步骤，也是研究重点</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1719.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>自学miRNA-seq分析第七讲~miRNA样本配对mRNA表达量获取</title>
		<link>http://www.bio-info-trainee.com/1716.html</link>
		<comments>http://www.bio-info-trainee.com/1716.html#comments</comments>
		<pubDate>Fri, 01 Jul 2016 15:57:59 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[hgu133plus2]]></category>
		<category><![CDATA[limma]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[差异分析]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1716</guid>
		<description><![CDATA[这一讲其实算不上是自学miRNA-seq分析，本质就是affymetrix的mR &#8230; <a href="http://www.bio-info-trainee.com/1716.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>这一讲其实算不上是自学miRNA-seq分析，本质就是affymetrix的mRNA表达芯片数据分析，而且还是最常用的那种GPL570    HG-U133_Plus_2，但是因为是跟miRNA样本配对检测的，而且后面会利用到这两个数据分析结果来做共表达网络分析等等，所以就贴出对该芯片数据的分析结果。文章里面也提到了 Messenger RNA expression analysis identified 731 probe sets with significant differential expression，作者挑选的差异分析结果的显著基因列表如下：<span id="more-1716"></span>## <a href="http://journals.plos.org/plosone/article/asset?unique&amp;id=info:doi/10.1371/journal.pone.0108051.s002">http://journals.plos.org/plosone/article/asset?unique&amp;id=info:doi/10.1371/journal.pone.0108051.s002</a><br />
## mRNA expression array - GSE60291  (Affymetrix Human Genome U133 Plus 2.0 Array)</p>
<p>hgu133plus2芯片数据太常见了，可以从GEO里面下载该study的原始测序数据，然后用affy,limma包来分析，也可以直接用GEOquery包来下载作者分析好的表达矩阵，然后直接做差异分析。我这里选择的是后者，而且我跟作者分析方法有一点区别是，我先把探针都注释好了基因，然后对每个基因只挑最大表达量的基因。而作者是直接对探针为单位的的表达矩阵进行差异分析，对分析结果里面的探针进行基因注释。我这里无法给出哪种方法好的绝对评价。代码如下：</p>
<blockquote><p>rm(list=ls())<br />
library(GEOquery)<br />
library(limma)<br />
GSE60291 &lt;- getGEO('GSE60291', destdir=".",getGPL = F)</p>
<p>#下面是表达矩阵<br />
<strong><span style="color: #ff0000;">exprSet</span></strong>=exprs(GSE60291[[1]])<br />
library("annotate")<br />
GSE60291[[1]]<br />
## 下面是分组信息<br />
pdata=pData(GSE60291[[1]])<br />
<span style="color: #ff0000;"><strong>treatment</strong></span>=factor(unlist(lapply(pdata$title,function(x) strsplit(as.character(x),"-")[[1]][1])))<br />
#treatment=relevel(treatment,'control')<br />
## 下面做基因注释<br />
platformDB='hgu133plus2.db'<br />
library(platformDB, character.only=TRUE)<br />
probeset &lt;- featureNames(GSE60291[[1]])<br />
#EGID &lt;- as.numeric(lookUp(probeset, platformDB, "ENTREZID"))<br />
SYMBOL &lt;-  lookUp(probeset, platformDB, "SYMBOL")<br />
## 下面对每个基因挑选最大表达量探针<br />
a=cbind(SYMBOL,exprSet)<br />
## remove the duplicated probeset<br />
rmDupID &lt;-function(a=matrix(c(1,1:5,2,2:6,2,3:7),ncol=6)){<br />
exprSet=a[,-1]<br />
rowMeans=apply(exprSet,1,function(x) mean(as.numeric(x),na.rm=T))<br />
a=a[order(rowMeans,decreasing=T),]<br />
exprSet=a[!duplicated(a[,1]),]<br />
#<br />
exprSet=exprSet[!is.na(exprSet[,1]),]<br />
rownames(exprSet)=exprSet[,1]<br />
exprSet=exprSet[,-1]<br />
return(exprSet)<br />
}<br />
exprSet=rmDupID(a)<br />
rn=rownames(exprSet)<br />
exprSet=apply(exprSet,2,as.numeric)<br />
rownames(exprSet)=rn<br />
exprSet[1:4,1:4]<br />
#exprSet=log(exprSet) ## based on e<br />
boxplot(exprSet,las=2)<br />
## 下面用limma包来进行芯片数据差异分析<br />
design=model.matrix(~ treatment)<br />
fit=lmFit(exprSet,design)<br />
fit=eBayes(fit)<br />
#vennDiagram(decideTests(fit))<br />
DEG=topTable(fit,coef=2,n=Inf,adjust='BH')<br />
dim(DEG[abs(DEG[,1])&gt;1.2 &amp; DEG[,5]&lt;0.05,])  ## 806 genes<br />
write.csv(DEG,"ET1-normal.DEG.csv")</p></blockquote>
<p>得到的ET1-normal.DEG.csv 文件就是我们的差异分析结果，可以跟文章提供的差异结果做比较，是几乎一模一样的！</p>
<p>如果根据logFC 1.2 p 矫正P 值0.05来挑选，可以拿到806个基因。</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1716.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>自学miRNA-seq分析第六讲~miRNA表达量差异分析</title>
		<link>http://www.bio-info-trainee.com/1714.html</link>
		<comments>http://www.bio-info-trainee.com/1714.html#comments</comments>
		<pubDate>Fri, 01 Jul 2016 15:11:26 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[DESeq]]></category>
		<category><![CDATA[DESeq2]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[差异分析]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1714</guid>
		<description><![CDATA[这一讲是miRNA-seq数据分析的分水岭，前面的5讲说的是读文献下载数据比对然 &#8230; <a href="http://www.bio-info-trainee.com/1714.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>这一讲是miRNA-seq数据分析的分水岭，前面的5讲说的是读文献下载数据比对然后计算表达量，属于常规的流程分析，一般在公司测序之后都可以拿到分析结果，或者文献也会给出下载结果。但是单纯的分析一个样本意义不大，一般来说，我们做研究都是针对于不同状态下的miRNA表达量差异分析，然后做注释，功能分析，网络分析，这才是重点，也是难点。我这里就直接拿文献处理好的miRNA表达量来展示如何做下游分析，首先就是差异分析啦：<span id="more-1714"></span>根据文献，我们可以知道样本的分类情况是:</p>
<blockquote>
<div>GSM1470353: control-CM, experiment1; Homo sapiens; miRNA-Seq   SRR1542714</div>
<div>GSM1470354: ET1-CM, experiment1; Homo sapiens; miRNA-Seq  SRR1542715</div>
<div>GSM1470355: control-CM, experiment2; Homo sapiens; miRNA-SeqSRR1542716</div>
<div>GSM1470356: ET1-CM, experiment2; Homo sapiens; miRNA-Seq SRR1542717</div>
<div>GSM1470357: control-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542718</div>
<div>GSM1470358: ET1-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542719</div>
<div>可以看到是6个样本的测序数据，分成两组，就是ET1刺激了CM细胞系前后对比而已！</div>
</blockquote>
<div>同时，我们也拿到了这6个样本的表达矩阵，计量单位是counts的reads数，所以我们一般会选用DESeq2，edgeR这样的常用包来做差异分析，当然，做差异分析的工具还有十几个，我这里只是拿一根最顺手的举例子，就是DESeq2</div>
<div>下面的代码有点长，因为我在bioconductor系列教程里面多次提到了DESeq2使用方法，这里就只贴出代码，反正我要说的重点就是，我们进行了差异分析，然后得到差异miRNA列表</div>
<blockquote>
<div>### step8: differential expression analysis by R package for miRNA expression patterns:<br />
## 文章里面提到的结果是：<br />
MicroRNA sequencing revealed over 250 known and 34 predicted novel miRNAs to be differentially expressed between ET-1 stimulated and unstimulated control hiPSC-CMs.<br />
## (FDR &lt; 0.1 and 1.5 fold change)<br />
rm(list=ls())<br />
setwd('J:\\miRNA_test\\paper_results')  ##把从GEO里面下载的文献结果放在这里<br />
sampleIDs=c()<br />
groupList=c()<br />
allFiles=list.files(pattern = '.txt')<br />
i=allFiles[1]<br />
sampleID=strsplit(i,"_")[[1]][1]<br />
treat=strsplit(i,"_")[[1]][4]<br />
dat=read.table(i,stringsAsFactors = F)<br />
colnames(dat)=c('miRNA',sampleID)<br />
groupList=c(groupList,treat)<br />
for (i in allFiles[-1]){<br />
sampleID=strsplit(i,"_")[[1]][1]<br />
treat=strsplit(i,"_")[[1]][4]<br />
a=read.table(i,stringsAsFactors = F)<br />
colnames(a)=c('miRNA',sampleID)<br />
dat=merge(dat,a,by='miRNA')<br />
groupList=c(groupList,treat)<br />
}</div>
<div>### 上面的代码只是为了把6个独立的表达文件给合并成一个表达矩阵<br />
## we need to filter the low expression level miRNA<br />
exprSet=dat[,-1]<br />
rownames(exprSet)=dat[,1]<br />
suppressMessages(library(DESeq2))<br />
exprSet=ceiling(exprSet)<br />
(colData &lt;- data.frame(row.names=colnames(exprSet), groupList=groupList))</div>
<div>## DESeq2就是这么简单的用<br />
dds &lt;- DESeqDataSetFromMatrix(countData = exprSet,<br />
colData = colData,<br />
design = ~ groupList)<br />
dds &lt;- DESeq(dds)<br />
png("qc_dispersions.png", 1000, 1000, pointsize=20)<br />
plotDispEsts(dds, main="Dispersion plot")<br />
dev.off()<br />
res &lt;- results(dds)<br />
## 画一些图，相当于做QC吧<br />
png("RAWvsNORM.png")<br />
rld &lt;- rlogTransformation(dds)<br />
exprSet_new=assay(rld)<br />
par(cex = 0.7)<br />
n.sample=ncol(exprSet)<br />
if(n.sample&gt;40) par(cex = 0.5)<br />
cols &lt;- rainbow(n.sample*1.2)<br />
par(mfrow=c(2,2))<br />
boxplot(exprSet,  col = cols,main="expression value",las=2)<br />
boxplot(exprSet_new, col = cols,main="expression value",las=2)<br />
hist(exprSet[,1])<br />
hist(exprSet_new[,1])<br />
dev.off()library(RColorBrewer)<br />
(mycols &lt;- brewer.pal(8, "Dark2")[1:length(unique(groupList))])</p>
<p># Sample distance heatmap<br />
sampleDists &lt;- as.matrix(dist(t(exprSet_new)))<br />
#install.packages("gplots",repos = "http://cran.us.r-project.org")<br />
library(gplots)<br />
png("qc-heatmap-samples.png", w=1000, h=1000, pointsize=20)<br />
heatmap.2(as.matrix(sampleDists), key=F, trace="none",<br />
col=colorpanel(100, "black", "white"),<br />
ColSideColors=mycols[groupList], RowSideColors=mycols[groupList],<br />
margin=c(10, 10), main="Sample Distance Matrix")<br />
dev.off()</p>
<p>png("MA.png")<br />
DESeq2::plotMA(res, main="DESeq2", ylim=c(-2,2))<br />
dev.off()<br />
## 重点就是这里啦，得到了差异分析的结果<br />
resOrdered &lt;- res[order(res$padj),]<br />
resOrdered=as.data.frame(resOrdered)<br />
write.csv(resOrdered,"<span style="color: #ff0000;"><strong>deseq2.results.csv</strong></span>",quote = F)</p>
<p>##下面也是一些图，主要是看看样本之间的差异情况<br />
library(limma)<br />
plotMDS(log(counts(dds, normalized=TRUE) + 1))<br />
plotMDS(log(counts(dds, normalized=TRUE) + 1) - log(t( t(assays(dds)[["mu"]]) / sizeFactors(dds) ) + 1))<br />
plotMDS( assays(dds)[["counts"]] )  ## raw count<br />
plotMDS( assays(dds)[["mu"]] ) ##- fitted values.</p>
</div>
</blockquote>
<div>最后我们得到的差异分析结果：deseq2.results.csv，就可以跟进FDR和fold change来挑选符合要求的差异miRNA啦</div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1714.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>自学miRNA-seq分析第五讲~miRNA表达量获取</title>
		<link>http://www.bio-info-trainee.com/1712.html</link>
		<comments>http://www.bio-info-trainee.com/1712.html#comments</comments>
		<pubDate>Sat, 25 Jun 2016 09:34:46 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[HTseq]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[表达量]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1712</guid>
		<description><![CDATA[拿到比对后的sam/bam文件之后，这只能算是level2的数据，一般我们给他人 &#8230; <a href="http://www.bio-info-trainee.com/1712.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>拿到比对后的sam/bam文件之后，这只能算是level2的数据，一般我们给他人share我们的结果也是直接给表达矩阵的， miRNA分析跟mRNA分析类似，但是它的表达矩阵更好获取一点。如果是mRNA，我们一般会跟基因组来比较，而基因组就那24条参考染色体，想知道具体比对到了哪个基因，需要根据基因组注释文件来写程序提取表达量信息，现在比较流行的是htseq这个软件，我前面也写过教程如何安装和使用，这里就不啰嗦了。但是对于miRNA，因为我比对的就是那1881条前体miRNA序列，所以直接分析比对的sam/bam文件就可以知道每条参考miRNA序列的表达量了。 <span id="more-1712"></span></p>
<blockquote>
<div>## step6: counts the reads which mapping to each miRNA reference.</div>
<div></div>
<div>## we need to exclude unmapped as well as multiple-mapped  reads</div>
<div></div>
<div>## XS:i:&lt;n&gt; Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode</div>
<div>## NM:i:1   ## NM i Edit distance to the reference, including ambiguous bases but excluding clipping</div>
<div>#The following command exclude unmapped (-F 4) as well as multiple-mapped (grep -v “XS:”) reads</div>
<div>#samtools view -F 4 input.bam | grep -v "XS:" | wc -l</div>
<div></div>
<div>## 180466//1520320</div>
<div></div>
<div>##cat &gt;<a href="http://count.hairpin.sh/">count.hairpin.sh</a></div>
<div></div>
<div>ls *hairpin.sam  | while read id</div>
<div>do</div>
<div><strong>samtools view  -SF 4 $id |perl -alne '{$h{$F[2]}++}END{print "$_\t$h{$_}" foreach sort keys %h }'  &gt; ${id%%_*}.hairpin.counts</strong></div>
<div>done</div>
<div></div>
<div>## bash <a href="http://count.hairpin.sh/">count.hairpin.sh</a></div>
<div></div>
<div>##cat &gt;<a href="http://count.mature.sh/">count.mature.sh</a></div>
<div></div>
<div>ls *mature.sam  | while read id</div>
<div>do</div>
<div><strong>samtools view  -SF 4 $id |perl -alne '{$h{$F[2]}++}END{print "$_\t$h{$_}" foreach sort keys %h }'  &gt; ${id%%_*}.mature.counts</strong></div>
<div>done</div>
<div></div>
<div>## bash <a href="http://count.mature.sh/">count.mature.sh</a></div>
</blockquote>
<div>上面的代码，是我自己写的脚本来算表达量，非常简单，因为我没有考虑细节，直接想得到各个样本测序数据的表达量而已。如果是比对到了参考基因组，就要根据miRNA的gff注释文件用htseq等软件来计算表达量啦。</div>
<div>得到了表达量，就可以跟文献来做比较啦：</div>
<blockquote>
<div>### step7: compare the results with paper's</div>
<div>GSM1470353: control-CM, experiment1; Homo sapiens; miRNA-Seq   SRR1542714</div>
<div>GSM1470354: ET1-CM, experiment1; Homo sapiens; miRNA-Seq  SRR1542715</div>
<div>GSM1470355: control-CM, experiment2; Homo sapiens; miRNA-SeqSRR1542716</div>
<div>GSM1470356: ET1-CM, experiment2; Homo sapiens; miRNA-Seq SRR1542717</div>
<div>GSM1470357: control-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542718</div>
<div>GSM1470358: ET1-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542719</div>
<div>### 下面我用R语言来检验一下，我得到的分析结果跟文章发表的结果的区别。</div>
<div> <strong>a=read.table("bowtie_bam/SRR1542714.mature.counts")</strong></div>
<div><strong> b=read.table("paper_results/GSM1470353_iPS_010313_Unstim_known_miRNA_counts.txt")</strong></div>
<div> plot(log(tmp[,2]),log(tmp[,3]))</div>
<div> cor(tmp[,2],tmp[,3])</div>
<div><strong>##[1] 0.8413439</strong></div>
</blockquote>
<div>相关性还不错，总算没有分析错咯。</div>
<div>这个代码是我自己根据文章的理解写出的，因为我本身不擅长miRNA数据分析，所以在进行alignment的时候参数选择可能并不是那么友好，如果有高手能指正就最好了，可以直接打我电话告诉我，或者发邮箱给我，邮箱用户名是jmzeng1314，是163邮箱。</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1712.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>自学miRNA-seq分析第四讲~测序数据比对</title>
		<link>http://www.bio-info-trainee.com/1709.html</link>
		<comments>http://www.bio-info-trainee.com/1709.html#comments</comments>
		<pubDate>Sat, 25 Jun 2016 09:25:10 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[hairpin]]></category>
		<category><![CDATA[miRBase]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[数据比对]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1709</guid>
		<description><![CDATA[序列比对是大多数类型数据分析的核心，如果要利用好测序数据，比对细节非常重要，我这 &#8230; <a href="http://www.bio-info-trainee.com/1709.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>序列比对是大多数类型数据分析的核心，如果要利用好测序数据，比对细节非常重要，我这里只是研读一篇文章也就没有对比对细节过多考虑，只是列出自己的代码和自己的几点思考，力求重现文章作者的分析结果。对miRNA-seq数据有两条比对策略，一种是下载miRBase数据库里面的已知miRNA序列来进行比对，一种直接比对到参考基因组(比如人类的是hg19/hg38)，前面的比对非常简单，而且很容易就可以数出已经的所以miRNA序列的表达量，后面的比对有点耗时，而且算表达量的时候也不是很方便，但是它有个有点是可以来预测新的miRNA，所以大多数文章都会把这两条路给走一下。<span id="more-1709"></span></p>
<p>本文选择的是SHRiMP这个小众软件，起初我并没有在意，就用的bowtie2而已，参考基因组我这里因为服务器原因，就用了miRBase数据库下载的人类的参考序列，现在的miRNA版本来说，人类这个物种已知的成熟miRNA共有2588条序列，而前体miRNA共有1881条序列，我下载（下载时间2016年6月 ）的代码见<a href="http://www.bio-info-trainee.com/1697.html"> 自学miRNA-seq分析第二讲~学习资料的搜集</a> ，下面比对所用到的软件已经序列在我的： <a href="http://www.bio-info-trainee.com/1703.html">自学miRNA-seq分析第三讲~公共测序数据下载</a></p>
<blockquote>
<div>## step5 : alignment to miRBase v21 (hairpin.human.fa/mature.human.fa )</div>
<div>#### step5.1 using bowtie2 to do alignment</div>
<div></div>
<div>mkdir  bowtie2_index &amp;&amp;  cd bowtie2_index</div>
<div>~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ../hairpin.human.fa hairpin_human</div>
<div>~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ../mature.human.fa  mature_human</div>
<div>ls *_clean.fq.gz | while read id ; do  ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -x miRBase/bowtie2_index/hairpin_human -U $id   -S ${id%%.*}.hairpin.sam ; done</div>
<div><strong>## overall alignment rate:  10.20% / 5.71%/ 10.18%/ 4.36% / 10.02% / 4.95%  (before convert U to T )</strong></div>
<div><strong>## overall alignment rate:  51.77% / 70.38%/51.45% /61.14%/ 52.20% / 65.85% (after convert U to T )</strong></div>
<div></div>
<div>ls *_clean.fq.gz | while read id ; do  ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -x miRBase/bowtie2_index/mature_human  -U $id   -S ${id%%.*}.mature.sam ; done</div>
<div><strong>## overall alignment rate:  6.67% / 3.78% / 6.70% / 2.80%/ 6.55% / 3.23%    (before convert U to T )</strong></div>
<div><strong>## overall alignment rate:  34.94% / 46.16%/ 35.00%/ 38.50% / 35.46% /42.41%(after convert U to T )</strong></div>
<div></div>
<div>#### step5.2 using SHRiMP to do alignment</div>
<div>##    <a href="http://compbio.cs.toronto.edu/shrimp/README">http://compbio.cs.toronto.edu/shrimp/README</a></div>
<div>##    3.5 Mapping cDNA reads against a miRNA database</div>
<div>cd ~/biosoft/SHRiMP/SHRiMP_2_2_3</div>
<div>export SHRIMP_FOLDER=$PWD</div>
<div>cd -</div>
<div>##　　We project the database with:</div>
<div>$SHRIMP_FOLDER/utils/project-db.py --seed 00111111001111111100,00111111110011111100,00111111111100111100,00111111111111001100,00111111111111110000 \</div>
<div> --h-flag --shrimp-mode ls miRBase/hairpin.human.fa</div>
<div>##</div>
<div>$SHRIMP_FOLDER/bin/gmapper-ls -L  hairpin.human-ls SRR1542716.fastq  --qv-offset 33   \</div>
<div>-o 1 -H -E -a -1 -q -30 -g -30 --qv-offset 33 --strata -N 8  &gt;map.out 2&gt;map.log</div>
</blockquote>
<p>大家可以看到我们把测序reads比对到前体miRNA和成熟的miRNA结果是有略微区别的，因为一个前体miRNA可以形成多个成熟的miRNA，而并不是所有的成熟的miRNA形式都被记录在数据库，所以一般推荐我们比对到前体miRNA数据库，这样还可以预测新的成熟miRNA，也是非常有意义的。</p>
<p>而且有个非常重要的一点，就是大家可以看到我把U变成T前后比对率差异非常大，这其实是一个非常蠢的错误。我就不多说了。但是做到这一步，其实可以跟文章来做验证了，文章有提到比对率，比对的序列。</p>
<p>我也是在博客里面看到这个信息的：</p>
<p>Thank you so  much!. Yes I contacted the lab-guy and he just said that trimmed the first 4 bp and last 4bp. ( as you found)</p>
<p>So  I firstly<span class=""> </span><strong>trimmed the adapter sequences</strong>(TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC)</p>
<p>And then,<span class=""> </span><strong>trimmed the first 4bp and last 4bp</strong><span class=""><b> </b></span>from reads, which<span class=""><b> </b></span><strong>leads to the 22bp peak of read-length distribution(instead of 24bp)</strong></p>
<p>Anyhow, I tried to map with bowtie2 again.</p>
<p><strong>&gt; </strong><strong>bowtie2 --local -N 1 -L 16</strong></p>
<p><strong>-x ../miRNA_reference/<span style="color: #ff00ff;">hairpin_UtoT.fa</span></strong></p>
<p><strong>-U first4bptrimmed_A1-SmallRNA_S1_L001_R1_001_Illuminaadpatertrim.fastq</strong></p>
<p><strong>-S f4_trimmed.sam</strong></p>
<p>&nbsp;</p>
<p><strong>I also changed hairpin.fa file (U to T) </strong></p>
<p>Oh.. thank you David,</p>
<p>Finallly, I got</p>
<p>2565353 reads; of these:<br />
2565353 (100.00%) were unpaired; of these:<br />
479292 (18.68%) aligned 0 times<br />
11959 (0.47%) aligned exactly 1 time<br />
2074102 (80.85%) aligned &gt;1 times<br />
<strong>81.32% overall alignment rate</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1709.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>自学miRNA-seq分析第三讲~公共测序数据下载</title>
		<link>http://www.bio-info-trainee.com/1703.html</link>
		<comments>http://www.bio-info-trainee.com/1703.html#comments</comments>
		<pubDate>Sat, 25 Jun 2016 09:08:43 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[未分类]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[ncbi]]></category>
		<category><![CDATA[SHRiMP]]></category>
		<category><![CDATA[sratoolkit]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1703</guid>
		<description><![CDATA[前面已经讲到了该文章的数据已经上传到NCBI的SRA数据中心，所以直接根据索引号 &#8230; <a href="http://www.bio-info-trainee.com/1703.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>前面已经讲到了该文章的数据已经上传到NCBI的SRA数据中心，所以直接根据索引号下载，然后用SRAtoolkit转出我们想要的fastq测序数据即可。下载的数据一般要进行质量控制，可视化展现一下质量如何，然后根据大题测序质量进行简单过滤。所以需要提前安装一些软件来完成这些任务，包括： sratoolkit /fastx_toolkit /fastqc/bowtie2/hg19/miRBase/SHRiMP</p>
<p>下面是我用新服务器下载安装软件的一些代码记录，因为fastx_toolkit /fastqc我已经安装过，就不列代码了，还有miRBase的下载，我在前面第二讲里面提到过，传送门：<a href="http://www.bio-info-trainee.com/1697.html">自学miRNA-seq分析第二讲~学习资料的搜集</a><span id="more-1703"></span></p>
<blockquote>
<div>## pre-step: download sratoolkit /fastx_toolkit_0.0.13/fastqc/bowtie2/hg19/miRBase/SHRiMP</div>
<div>## <a href="http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software">http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software</a></div>
<div>## <a href="http://www.ncbi.nlm.nih.gov/books/NBK158900/">http://www.ncbi.nlm.nih.gov/books/NBK158900/</a></div>
<div> ## 我这里特意挑选的二进制版本程序下载的，这样直接解压就可以用，但是需要挑选适合自己的操作系统的程序。</div>
<div>cd ~/biosoft</div>
<div>mkdir sratoolkit &amp;&amp;  cd sratoolkit</div>
<div>wget <a href="http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.6.3/sratoolkit.2.6.3-centos_linux64.tar.gz">http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.6.3/sratoolkit.2.6.3-centos_linux64.tar.gz</a></div>
<div>##</div>
<div>##  Length: 63453761 (61M) [application/x-gzip]</div>
<div>##  Saving to: "sratoolkit.2.6.3-centos_linux64.tar.gz"</div>
<div>tar zxvf <strong>sratoolkit.2.6.3-centos_linux64.tar.gz</strong></div>
<div></div>
<div>cd ~/biosoft</div>
<div>mkdir bowtie &amp;&amp;  cd bowtie</div>
<div>wget <a href="https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/bowtie2-2.2.9-linux-x86_64.zip/download">https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/bowtie2-2.2.9-linux-x86_64.zip/download</a></div>
<div>#Length: 27073243 (26M) [application/octet-stream]</div>
<div>#Saving to: "download"</div>
<div> mv download  bowtie2-2.2.9-linux-x86_64.zip</div>
<div> unzip <strong>bowtie2-2.2.9-linux-x86_64.zip</strong></div>
<div></div>
<div>## <a href="http://compbio.cs.toronto.edu/shrimp/">http://compbio.cs.toronto.edu/shrimp/</a></div>
<div>mkdir SHRiMP &amp;&amp;  cd SHRiMP</div>
<div>wget <a href="http://compbio.cs.toronto.edu/shrimp/releases/SHRiMP_2_2_3.lx26.x86_64.tar.gz">http://compbio.cs.toronto.edu/shrimp/releases/SHRiMP_2_2_3.lx26.x86_64.tar.gz</a></div>
<div>tar zxvf<strong> SHRiMP_2_2_3.lx26.x86_64.tar.gz </strong></div>
<div>cd SHRiMP_2_2_3</div>
<div>export SHRIMP_FOLDER=$PWD  ## 这个软件使用的时候比较奇葩，需要设置到环境变量，不能简单的调用全路径</div>
</blockquote>
<div>SHRiMP这个软件比较小众，我也是第一次听说过，本来我计划是能用bowtie搞定，就不麻烦了，但是第一次比对出了一个bug，就是下载的miRNA序列里面的U没有转换成T，所以导致比对率非常之低，所以我不得不根据文章里面记录的软件SHRiMP 来做比对，最后发现比对率完全没有改善，搞得我都在怀疑是不是作者乱来了。</div>
<div>下面是下载数据，质量控制的代码，希望大家可以照着运行一下：</div>
<div>
<blockquote>
<div>## step1 : download raw data</div>
<div>mkdir miRNA_test &amp;&amp; cd miRNA_test</div>
<div>echo {14..19} |sed 's/ /\n/g' |while read id; \</div>
<div>do  wget "<a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045420/SRR15427$id/SRR15427$id.sra">ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045420/SRR15427$id/SRR15427$id.sra</a>"  ;\</div>
<div>done</div>
<div></div>
<div>## step2 :  change sra data to fastq files.</div>
<div>## 主要是用shell脚本来批量下载</div>
<div>ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump $id;done</div>
<div>rm *sra</div>
<div></div>
<div>##  33M --&gt; 247M</div>
<div>#Read 1866654 spots for SRR1542714.sra</div>
<div>#Written 1866654 spots for SRR1542714.sra</div>
<div></div>
<div></div>
<div>## step3 : download the results from paper</div>
<div>## <a href="http://www.bio-info-trainee.com/1571.html">http://www.bio-info-trainee.com/1571.html</a></div>
<div>## <a href="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1009/suppl/GSE1009_RAW.tar">ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1009/suppl/GSE1009_RAW.tar</a></div>
<div></div>
<div>mkdir paper_results &amp;&amp; cd paper_results</div>
<div>wget <a href="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60292/suppl/GSE60292_RAW.tar">ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60292/suppl/GSE60292_RAW.tar</a></div>
<div>## tar xvf GSE60292_RAW.tar</div>
<div>ls *gz |while read id ; do (echo $id;zcat $id | cut -f 2 |perl -alne '{$t+=$_;}END{print $t}');done</div>
<div>ls *gz |xargs gunzip</div>
<div></div>
<div></div>
<div></div>
<div>## step4 : quality assessment</div>
<div></div>
<div>ls *fastq | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done</div>
<div>## Sequence length 8-109</div>
<div>## %GC 52</div>
<div>## Adapter Content passed</div>
<div></div>
<div>## write a script : :: cat &gt;filter.sh</div>
<div></div>
<div>ls *fastq |while read id</div>
<div>do</div>
<div>echo $id</div>
<div>~/biosoft/fastx_toolkit_0.0.13/bin/fastq_quality_filter<strong> -v -q 20 -p 80 -Q33</strong>  -i $id -o tmp ;</div>
<div>~/biosoft/fastx_toolkit_0.0.13/bin/fastx_trimmer <strong>-v -f 1 -l 27</strong> <strong>-i tmp  -Q33 -z</strong> -o ${id%%.*}_clean.fq.gz ;</div>
<div>done</div>
<div>rm tmp</div>
<div></div>
<div>##<strong> discarded 12%~~49%%</strong></div>
<div>ls *_clean.fq.gz | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done</div>
<div></div>
<div>mkdir QC_results</div>
<div>mv *zip *html QC_results</div>
</blockquote>
</div>
<div>这个代码是我自己根据文章的理解写出的，因为我本身不擅长miRNA数据分析，所以在进行QC的时候参数选择可能并不是那么友好，如果有高手能指正就最好了，可以直接打我电话告诉我，或者发邮箱给我，邮箱用户名是jmzeng1314，是163邮箱。</div>
<div>
<div>~/biosoft/fastx_toolkit_0.0.13/bin/fastq_quality_filter<strong> -v -q 20 -p 80 -Q33</strong>  -i $id -o tmp ;</div>
<div>~/biosoft/fastx_toolkit_0.0.13/bin/fastx_trimmer <strong>-v -f 1 -l 27</strong> <strong>-i tmp  -Q33 -z</strong> -o ${id%%.*}_clean.fq.gz ;</div>
<div>最后得到的clean.fq.gz系列文件，就是我需要进行比对的序列啦。</div>
<div></div>
<div></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1703.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>自学miRNA-seq分析第二讲~学习资料的搜集</title>
		<link>http://www.bio-info-trainee.com/1697.html</link>
		<comments>http://www.bio-info-trainee.com/1697.html#comments</comments>
		<pubDate>Sat, 25 Jun 2016 08:51:07 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[自学]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1697</guid>
		<description><![CDATA[因为我也是完全从零开始入门miRNA-seq分析，所以收集的资料比较齐全，我首先 &#8230; <a href="http://www.bio-info-trainee.com/1697.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>因为我也是完全从零开始入门miRNA-seq分析，所以收集的资料比较齐全，我首先看了部分中文资料，了解了miRNA测序是怎么回事，该分析什么，然后主要围绕着上一篇提到的文献里面的分析步骤来搜索资料。传送门：<a href="http://www.bio-info-trainee.com/1693.html">自学miRNA-seq分析第一讲~文献选择与解</a></p>
<p>我首先拿到了miRNA定义：<a href="http://nar.oxfordjournals.org/content/34/suppl_1/D135.full">http://nar.oxfordjournals.org/content/34/suppl_1/D135.full </a>，当然基本上每个研究miRNA的文章都会在前言里面写到这个，我只是随意列出一个而已。<span id="more-1697"></span></p>
<p>MicroRNAs (miRNAs) are <strong>small RNA molecules</strong>, which are<strong> ∼22 nt sequences</strong> that have an important role in the translational regulation and degradation of mRNA by the base's pairing to the 3′-untranslated regions (3′-UTR) of the mRNAs. The miRNAs are derived from the <strong>precursor transcripts of ∼70–120 nt sequences</strong>, which fold to <strong>form as stem–loop structures</strong>, which are thought to be highly conserved in the evolution of genomes. Previous analyses have suggested that<strong> ∼1% of all human genes are miRNA genes,</strong> which regulate the production of protein for 10% or more of all human coding genes。</p>
<p>然后我比较纠结的问题是参考序列如何选择，因为miRNA序列很少，把它map到3G大小的人类基因组有点浪费计算资源，正好我的服务器又坏了，不想太麻烦，想用自己的个人电脑搞定这个学习过程。我看到很多帖子提到的都是比对到参考miRNA数据库(miRNA count: 28645 entries)，用bowtie ：  <a href="http://www.mirbase.org/">http://www.mirbase.org/</a> ，从这个数据库，我明白了前体miRNA和成熟的miRNA的区别，前体miRNA长度一般是<strong>∼70–120 </strong>碱基，前体miRNA一般是茎环结果，也就是发夹结构，所以叫做hairpin。成熟之后，一般<strong>∼22 个碱基，</strong>在miRNA数据库很容易下载到这些数据，现在的miRNA版本来说，人类这个物种已知的成熟miRNA共有2588条序列，而前体miRNA共有1881条序列，我下载（下载时间2016年6月 ）的代码是：</p>
<blockquote>
<div> wget <a href="ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.gz">ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.gz</a>   <strong>##　28645　reads</strong></div>
<div> wget <a href="ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.zip">ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.zip</a>   <strong>##   35828 reads </strong></div>
<div> wget <a href="ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.zip">ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.zip</a></div>
<div> wget <a href="ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3">ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3</a> ##</div>
<div> wget <a href="ftp://mirbase.org/pub/mirbase/CURRENT/miFam.dat.zip">ftp://mirbase.org/pub/mirbase/CURRENT/miFam.dat.zip</a></div>
<div></div>
<div> grep sapiens mature.fa |wc  　<strong># 2588 </strong></div>
<div> grep sapiens hairpin.fa |wc      <strong> # 1881 </strong></div>
<div>## Homo sapiens</div>
<div>perl -alne '{if(/^&gt;/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;<span style="color: #ff00ff;"><strong>s/U/T/g</strong> </span>if !/&gt;/;print }' hairpin.fa &gt;hairpin.human.fa</div>
<div>perl -alne '{if(/^&gt;/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;<span style="color: #ff00ff;"><strong>s/U/T/g</strong></span> if !/&gt;/;print }' mature.fa &gt;mature.human.fa</div>
<div>这里值得一提的是miRBase数据库下载的序列，居然都是用U表示的，也就是说就是miRNA序列，而不是转录成该miRNA的基因序列，而我们测序的都是基因序列。</div>
</blockquote>
<p>通过这个代码制作的 hairpin.human.fa 和 mature.human.fa 就是本次数据分析的参考基因组。</p>
<p>搜集资料的过程中，我看到了一篇文献讲挖掘1000genomes的数据找到位于miRNA的snp位点，<a href="https://genomemedicine.biomedcentral.com/articles/10.1186/gm363">https://genomemedicine.biomedcentral.com/articles/10.1186/gm363</a> ，看起来比较新奇，不过跟本次学习过程没有关系，我就是记录一下，有空回来学习学习。</p>
<p>同时，我看到了一些博客讲解如何分析miRNA数据：<a href="http://genomespot.blogspot.com/2013/08/quick-alignment-of-microrna-seq-data-to.html">http://genomespot.blogspot.com/2013/08/quick-alignment-of-microrna-seq-data-to.html</a></p>
<p>还有很多公司讲数据分析流程：</p>
<blockquote><p><a href="http://bioinfo5.ugr.es/miRanalyzer/miRanalyzer_tutorial.html">http://bioinfo5.ugr.es/miRanalyzer/miRanalyzer_tutorial.html</a></p>
<p><a href="http://www.partek.com/sites/default/files/Assets/UserGuideMicroRNAPipeline.pdf">http://www.partek.com/sites/default/files/Assets/UserGuideMicroRNAPipeline.pdf</a></p>
<p><a href="http://partek.com/Tutorials/microarray/microRNA/miRNA_tutorial.pdf">http://partek.com/Tutorials/microarray/microRNA/miRNA_tutorial.pdf</a></p>
<p><a href="http://www.arraystar.com/reviews/microrna-sequencing-data-analysis-guideline/">http://www.arraystar.com/reviews/microrna-sequencing-data-analysis-guideline/</a></p>
<p><a href="http://bioinfo5.ugr.es/sRNAbench/sRNAbench_tutorial.pdf">http://bioinfo5.ugr.es/sRNAbench/sRNAbench_tutorial.pdf</a></p>
<p><a href="http://seqcluster.readthedocs.io/mirna_annotation.html">http://seqcluster.readthedocs.io/mirna_annotation.html</a></p></blockquote>
<p>耶鲁大学好像做得不错： <a href="http://www.yale.edu/giraldezlab/miRNA.html">http://www.yale.edu/giraldezlab/miRNA.html</a></p>
<p>中国有个南方基因： <a href="http://www.southgene.com/newsshow.php?cid=55&amp;id=73">http://www.southgene.com/newsshow.php?cid=55&amp;id=73</a></p>
<p>miRNA研究整套方案  <a href="http://wenku.baidu.com/view/5f38577a31b765ce05081429.html?re=view">http://wenku.baidu.com/view/5f38577a31b765ce05081429.html?re=view</a></p>
<p>Biostar 讨论帖子：</p>
<p><a href="https://www.biostars.org/p/3344/">https://www.biostars.org/p/3344/</a></p>
<p><a href="https://www.biostars.org/p/98486/">https://www.biostars.org/p/98486/</a></p>
<p>miRNA-seq数据处理实战指南：　<a href="http://bib.oxfordjournals.org/content/early/2015/04/17/bib.bbv019.full">http://bib.oxfordjournals.org/content/early/2015/04/17/bib.bbv019.full</a></p>
<p>直接用一个包也可以搞定：　<a href="http://bioconductor.org/packages/release/bioc/html/easyRNASeq.html">http://bioconductor.org/packages/release/bioc/html/easyRNASeq.html</a></p>
<p>ｇｉｔｈｕｂ流程：miRNA Analysis Pipeline v0.2.7　　　<a href="https://github.com/bcgsc/mirna/tree/master/v0.2.7">https://github.com/bcgsc/mirna/tree/master/v0.2.7</a></p>
<p><a href="https://tools.thermofisher.com/content/sfs/manuals/CO25176_0512.pdf">https://tools.thermofisher.com/content/sfs/manuals/CO25176_0512.pdf</a></p>
<p>miRNA annotation　　：　　<a href="http://seqcluster.readthedocs.io/mirna_annotation.html">http://seqcluster.readthedocs.io/mirna_annotation.html</a></p>
<p>开发的网页版分析工具：　<a href="https://wiki.uio.no/projects/clsi/images/2/2f/HTS_2014_miRNA_analysis_Lifeportal_14_final.pdf">https://wiki.uio.no/projects/clsi/images/2/2f/HTS_2014_miRNA_analysis_Lifeportal_14_final.pdf</a></p>
<p>Ｒ　ｐａｃｋａｇｅ　也很好用：　<a href="http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf">http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf</a></p>
<p>一个培训：　<a href="http://www.training.prace-ri.eu/uploads/tx_pracetmo/NGSdataAnalysisWithChipster.pdf">http://www.training.prace-ri.eu/uploads/tx_pracetmo/NGSdataAnalysisWithChipster.pdf</a></p>
<p>可视化IGV User Guide：　　<a href="http://www.broadinstitute.org/igv/book/export/html/6">http://www.broadinstitute.org/igv/book/export/html/6</a></p>
<p>比较特殊的是新的miRNA预测，miRNA靶基因预测，这块研究太多软件了，并没有成型的流程和标准。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1697.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>自学miRNA-seq分析第一讲~文献选择与解读</title>
		<link>http://www.bio-info-trainee.com/1693.html</link>
		<comments>http://www.bio-info-trainee.com/1693.html#comments</comments>
		<pubDate>Sat, 25 Jun 2016 08:29:11 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[bioStar]]></category>
		<category><![CDATA[miRNA-seq]]></category>
		<category><![CDATA[文献]]></category>
		<category><![CDATA[自学]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1693</guid>
		<description><![CDATA[前些天逛bioStar论坛的时候看到了一个问题，是关于miRNA分析，提问者从N &#8230; <a href="http://www.bio-info-trainee.com/1693.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>前些天逛bioStar论坛的时候看到了一个问题，是关于miRNA分析，提问者从NCBI的SRA数据下载文献提供的原始数据，然后处理的时候有些不懂，我看到他列出的数据是iron torrent测序仪的，而且我以前还没玩过miRNA-seq的数据分析， 就抽空自学了一下。因为我有RNA-seq的基础，所以理解学习起来比较简单。特记录一下自己的学习过程，希望对后学者有帮助。<span id="more-1693"></span></p>
<p>这里选择的文章是2014年发表的，<span lang="ZH-CN">作者用</span>ET-1<span lang="ZH-CN">刺激</span>human iPSCs (hiPSC-CMs) <span lang="ZH-CN">细胞前后，想看看</span> miRNA和mRNA<span lang="ZH-CN">表达量的变化，我并没有细看该文章的生物学意义，仅仅从数据分析的角度解读一下这篇文章，mRNA<span lang="ZH-CN">表达量用的是Affymetrix Human Genome U133 Plus 2.0 Array，分析起来特别容易，就是得到表达矩阵，然后用limma这个包找找差异表达基因即可。但是mRNA分析起来就有点麻烦了，作者用的是iron torrent测序仪，但是从SRA数据中心下载的是已经去掉接头的测序数据，fastq格式的，所以这里其实并不需要考虑测序仪的特异性。</span></span></p>
<p>关于该文章的几个资料收集如下：</p>
<blockquote>
<div>## paper : <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0108051">http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0108051</a></div>
<div>## Aggarwal P, Turner A, Matter A, Kattman SJ et al. RNA expression profiling of human iPSC-derived cardiomyocytes in a cardiac hypertrophy model. PLoS One 2014;9(9):e108051. PMID: 25255322</div>
<div>## The accession numbers are 1. SuperSeries (mRNA+miRNA) - GSE60293</div>
<div>## 2. mRNA expression array - GSE60291  (Affymetrix Human Genome U133 Plus 2.0 Array)</div>
<div>## 3. miRNA-Seq - GSE60292  (Ion Torrent)</div>
<div>## GEO   : <a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60292">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60292</a></div>
<div>## FTP   : <a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045420">ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045420</a></div>
</blockquote>
<div>仔细看看该文章做了哪些分析，然后才能自己模仿，得到同样的数据分析结果。</div>
<div>
<p>该文章处理数据的流程是：<br />
Ion Torrent's Torrent Suite version 3.6 was used for basecalling<br />
Raw sequencing reads were aligned using the <strong>SHRiMP2 aligner</strong> and were aligned against the human reference genome <strong>(hg19)</strong> for novel miRNA prediction and then against a custom reference sequence file containing <strong>miRBase v.20 known human miRNA hairpins, tRNA, rRNA,</strong> adapter sequences and predicted novel miRNA sequences.(Genome_build: <strong>hg19, miRBase v.20 human miRNA hairpins</strong>)</p>
<p>The <strong>miRDeep2 package (default parameters)</strong> was used to predict novel (as yet undescribed) miRNAs</p>
<p>Alignments with less than 17 bp matches and a custom 3′ end phred q-score threshold of 17 were filtered out.</p>
<p>miRNA quanitification was done using <strong>HTSeq v0.5.3p3</strong> using the default union parameter.<br />
Differential miRNA expression was analyzed using the <strong>DESeq (v.1.12.1) R/Bioconductor package</strong></p>
<p>In this study, differentially expressed genes that had a false discovery rate cutoff at 10% (FDR&lt; = 0.1), a log<sub>2</sub> fold change greater than 1.5 and less than −1.5 were considered significant.</p>
<p>Target gene prediction was performed using the <strong>TargetScan (version 6.2)</strong> database</p>
<p>We also used <strong>miRTarBase (version 4.3),</strong> to identify targets that have been experimentally validated</p>
<p>## miR-Deep2 and miReap  ## predict exact precursor sequence according from mature sequence .</p>
</div>
<div>文章提到了fastq数据质量控制标准，数据比对工具，比对的参考基因组（两条比对线路），miRNA表达量的得到，新的miRNA预测，miRNA靶基因预测，这也是我们学习miRNA-seq的数据分析的标准套路， 而且作者给出了所有的分析结果，我们完全可以通过自己的学习来重现他的分析过程。</div>
<div>
<p>Supplementary_files_format_and_content: <strong>tab-delimited text files containing raw read counts for known mature human miRNAs.（表达矩阵）</strong></p>
<p>We detected<strong> 836 known human mature miRNAs</strong> in the control-CMs and <strong>769 in the ET1-CMs</strong></p>
<p>Based on our miRNA-Seq data, we predicted <strong>506 sequences to be potentially novel, as yet undescribed miRNAs.</strong></p>
<p>In order to validate the expression profiles of the miRNAs detected, <strong>we performed RT-qPCR on a subset of five known human mature and five of our predicted novel miRNAs.</strong></p>
<p>we obtained a total of<strong> 1,922 predicted miRNA-mRNA pairs</strong> represented by 309 genes and 174 known mature human miRNAs.  （）</p>
</div>
<div>当然仅仅是套路分析无法发文章的，所以他结合了 miRNA和mRNA 进行网络分析，还做了少量湿实验来验证，最后还扯了一些生物学意义，当然这种纯粹理论分析肯定不好扯什么治病救人的伟大理想。</div>
<div></div>
<div>下一篇我会讲自学miRNA-seq分析搜集到的参考资料</div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1693.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
