<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; alignment</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/alignment/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>自学CHIP-seq分析第五讲~测序数据比对</title>
		<link>http://www.bio-info-trainee.com/1742.html</link>
		<comments>http://www.bio-info-trainee.com/1742.html#comments</comments>
		<pubDate>Tue, 05 Jul 2016 00:42:39 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[alignment]]></category>
		<category><![CDATA[bowtie2]]></category>
		<category><![CDATA[CHIP-seq]]></category>
		<category><![CDATA[mapping]]></category>
		<category><![CDATA[unique]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1742</guid>
		<description><![CDATA[比对本质是是很简单的了，各种mapping工具层出不穷，我们一般常用的就是BWA &#8230; <a href="http://www.bio-info-trainee.com/1742.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>比对本质是是很简单的了，各种mapping工具层出不穷，我们一般常用的就是BWA和bowtie了，我这里就挑选bowtie2吧，反正别人已经做好了各种工具效果差异的比较，我们直接用就好了，代码如下：</p>
<blockquote><p>## step5 : alignment to hg19/ using bowtie2 to do alignment<br />
## ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/biosoft/bowtie/hg19_index /hg19.fa ~/biosoft/bowtie/hg19_index/hg19<br />
## cat &gt;run_bowtie2.sh<br />
ls *.fastq | while read id ;<br />
do<br />
echo $id<br />
#~/biosoft/bowtie/bowtie2-2.2.9/<span style="color: #ff0000;">bowtie2 -p 8 -x</span> ~/biosoft/bowtie/hg19_index/hg19 -U $id -S ${id%%.*}.sam 2&gt;${id%%.*}.align.log;<br />
<span style="color: #ff0000;">#samtools view -bhS -q 30</span> ${id%%.*}.sam &gt; ${id%%.*}.bam ## -F 1548 https://broadinstitute.github.io/picard/explain-flags.html<br />
# -F 0x4 remove the reads that didn't match<br />
samtools sort ${id%%.*}.bam ${id%%.*}.sort ## prefix for the output<br />
# samtools view -bhS a.sam | samtools sort -o - ./ &gt; a.bam<br />
samtools index ${id%%.*}.sorted.bam<br />
done</p></blockquote>
<p>这个索引~/biosoft/bowtie/hg19_index/hg19需要自己提取建立好，见前文</p>
<p>初步比对的sam文件到底该如何过滤，我查了很多文章都没有给出个子丑寅卯，各执一词，我也没办法给大家一个标准，反正我测试了好几种，看起来call peaks的差异不大，就是得不到文章给出的那些结果！！</p>
<p>一般来说，初步比对的sam文件只能选取unique mapping的结果，所以我用了#samtools view -bhS -q 30，但是结果并没什么改变，有人说是peak caller这些工具本身就会做这件事，所以取决于你下游分析所选择的工具。</p>
<p>给大家看比对的日志吧：</p>
<blockquote><p>SRR1042593.fastq<br />
16902907 reads; of these:<br />
16902907 (100.00%) were unpaired; of these:<br />
667998 (3.95%) aligned 0 times<br />
12467095 (73.76%) aligned exactly 1 time<br />
3767814 (22.29%) aligned &gt;1 times<br />
96.05% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.<br />
SRR1042594.fastq<br />
60609833 reads; of these:<br />
60609833 (100.00%) were unpaired; of these:<br />
9165487 (15.12%) aligned 0 times<br />
39360173 (64.94%) aligned exactly 1 time<br />
12084173 (19.94%) aligned &gt;1 times<br />
84.88% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.<br />
SRR1042595.fastq<br />
14603295 reads; of these:<br />
14603295 (100.00%) were unpaired; of these:<br />
918028 (6.29%) aligned 0 times<br />
10403045 (71.24%) aligned exactly 1 time<br />
3282222 (22.48%) aligned &gt;1 times<br />
93.71% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.<br />
SRR1042596.fastq<br />
65911151 reads; of these:<br />
65911151 (100.00%) were unpaired; of these:<br />
10561790 (16.02%) aligned 0 times<br />
42271498 (64.13%) aligned exactly 1 time<br />
13077863 (19.84%) aligned &gt;1 times<br />
83.98% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.<br />
SRR1042597.fastq<br />
22210858 reads; of these:<br />
22210858 (100.00%) were unpaired; of these:<br />
1779568 (8.01%) aligned 0 times<br />
15815218 (71.20%) aligned exactly 1 time<br />
4616072 (20.78%) aligned &gt;1 times<br />
91.99% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.<br />
SRR1042598.fastq<br />
58068816 reads; of these:<br />
58068816 (100.00%) were unpaired; of these:<br />
8433671 (14.52%) aligned 0 times<br />
37527468 (64.63%) aligned exactly 1 time<br />
12107677 (20.85%) aligned &gt;1 times<br />
85.48% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.<br />
SRR1042599.fastq<br />
24019489 reads; of these:<br />
24019489 (100.00%) were unpaired; of these:<br />
1411095 (5.87%) aligned 0 times<br />
17528479 (72.98%) aligned exactly 1 time<br />
5079915 (21.15%) aligned &gt;1 times<br />
94.13% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.<br />
SRR1042600.fastq<br />
76361026 reads; of these:<br />
76361026 (100.00%) were unpaired; of these:<br />
8442054 (11.06%) aligned 0 times<br />
50918615 (66.68%) aligned exactly 1 time<br />
17000357 (22.26%) aligned &gt;1 times<br />
88.94% overall alignment rate<br />
[samopen] SAM header is present: 93 sequences.</p></blockquote>
<p>可以看到比对非常成功！！！我这里就不用表格的形式来展现了，毕竟我又不是给客户写报告，大家就将就着看吧。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1742.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>新的比对工具MOSAIK</title>
		<link>http://www.bio-info-trainee.com/1457.html</link>
		<comments>http://www.bio-info-trainee.com/1457.html#comments</comments>
		<pubDate>Tue, 15 Mar 2016 10:55:20 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基础软件]]></category>
		<category><![CDATA[alignment]]></category>
		<category><![CDATA[MOSAIK]]></category>
		<category><![CDATA[sam]]></category>
		<category><![CDATA[比对]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1457</guid>
		<description><![CDATA[功能：序列比对，类似于BWA，Bowtie 优点：全平台，甚至支持pacbio的 &#8230; <a href="http://www.bio-info-trainee.com/1457.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>功能：序列比对，类似于BWA，Bowtie</div>
<div>优点：全平台，甚至支持pacbio的三代测序长reads</div>
<div>算法：是hash index，跟其它bwt算法不太一样</div>
<div>官网：<a href="https://github.com/wanpinglee/MOSAIK" target="_blank">https://github.com/wanpinglee/MOSAIK</a></div>
<div>paper：<a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090581">http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090581</a></div>
<div>
<div>
<div>作者：WP Lee - ‎2014 - ‎<a href="https://scholar.google.com/scholar?um=1&amp;ie=UTF-8&amp;lr&amp;cites=8963892741176779202">被引用次数：70</a> - ‎<a href="https://scholar.google.com/scholar?um=1&amp;ie=UTF-8&amp;lr&amp;q=related:wmGLkvQkZnzaqM:scholar.google.com/">相关文章</a></div>
</div>
</div>
<p><span id="more-1457"></span></p>
<div>
<pre>Overview:

MOSAIK is a stable, sensitive and open-source program for mapping second and 
third-generation sequencing reads to a reference genome. Uniquely among current 
mapping tools, MOSAIK can align reads generated by all the major sequencing 
technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, 
Ion Torrent and Pacific BioSciences SMRT.</pre>
</div>
<h1><span style="color: #ff0000;">一，软件安装</span></h1>
<div>
<div>软件下载地址：<a href="https://github.com/wanpinglee/MOSAIK/archive/master.zip">https://github.com/wanpinglee/MOSAIK/archive/master.zip</a></div>
</div>
<div>下载压缩包，解压后进入src源码目录，然后make即可！</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/03/11.png"><img class="alignnone size-full wp-image-1458" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/03/11.png" alt="1" width="389" height="153" /></a></div>
<div>这些程序就可以用啦！</div>
<div>里面有四个软件，所以需要四个步骤来完成比对！</div>
<div>build和jump是对参考基因组建立索引</div>
<div>build同时需要对测序数据进行索引</div>
<div>aligner是把两个索引进行比对！</div>
<div>text是把比对的结果转为其它可读格式，通常是sam比对格式</div>
<h1><span style="color: #ff0000;">二，输入数据准备</span></h1>
<div>比对当然需要测序的fastq格式reads和fa格式的参考基因组啦！</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/03/21.png"><img class="alignnone size-full wp-image-1459" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/03/21.png" alt="2" width="554" height="202" /></a></div>
<div>我是下载的<a href="http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html" target="_blank">http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html</a>  里面的数据，因为之所以要用这个软件，也是因为找人体内病毒整合的需求！</div>
<div>PE测序的reads，参考基因组是病毒和人类</div>
<h1><span style="color: #ff0000;">三，运行命令</span></h1>
<div>下面是一个完整的脚本</div>
<div><span style="color: #4f81bd; font-size: medium;"><b>首先对参考基因组构建索引</b></span></div>
<div>
<blockquote>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">Mosaik_bin=~/bio-soft/MOSAIK/bin  #设置好程序安装目录</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">##for gib virus reference genome</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">$Mosaik_bin/MosaikBuild -fr gibVirus.fa -oa gibVirus.fa.bin -st illumina -assignQual 40</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">$Mosaik_bin/MosaikJump -ia gibVirus.fa.bin -out gibVirus.JumpDb -hs 15</span></div>
</blockquote>
</div>
<blockquote>
<div>这两个步骤是构建hash索引，对这个60M的压缩包病毒基因组集合，时间是</div>
<div>
<div>MosaikBuild CPU time: 15.660 s, wall time: 18.146 s</div>
</div>
<div>
<div>MosaikJump CPU time: 329.031 s, wall time: 331.672 s</div>
<div>还可以接受，但是输出的index文件就有点难以接受了！！！！</div>
</div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">333M Mar 11 19:55 gibVirus.fa.bin</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">60M Aug 13  2013 gibVirus.fa.gz</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">5.0G Mar 11 20:04 gibVirus.JumpDb_keys.jmp</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">1 Mar 11 19:59 gibVirus.JumpDb_meta.jmp</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">1.3G Mar 11 20:04 gibVirus.JumpDb_positions.jmp</span></div>
<div>如果是对人的hg19基因组来说，消耗的时间如下：</div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">MosaikBuild CPU time: 183.642 s, wall time: 184.658 s</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">MosaikJump CPU time: 3985.608 s, wall time: 3995.323 s</span></div>
<div><span style="font-family: Monaco,Consolas,Courier,Lucida Console,monospace;">一个多小时，还行！</span></div>
</blockquote>
<p><span style="color: #4f81bd; font-size: medium;"><b>对参考基因组建好了索引，还需要对测序数据构建索引！</b></span></p>
<div>
<blockquote>
<div>$Mosaik_bin/MosaikBuild  -q L526401A_1.fq.gz -q2 L526401A_2.fq.gz -out L526401A.bin -st illumina</div>
</blockquote>
</div>
<blockquote>
<div>数据双端测序，每个1.6G左右数据，构建索引耗时如下：</div>
</blockquote>
<div>
<blockquote>
<div># reads written:          53060622</div>
<div># bases written:        5304891143</div>
<div></div>
<div>MosaikBuild CPU time: 388.969 s, wall time: 391.149 s</div>
</blockquote>
</div>
<p><span style="color: #4f81bd; font-size: medium;"><b>接下来就比对！</b></span></p>
<div>
<blockquote>
<div>ANN_PATH=~/bio-soft/MOSAIK/src/networkFile</div>
<div>$Mosaik_bin/MosaikAligner -in L526401A.bin  \</div>
<div>-out L526401A.bin.aligned \</div>
<div>-ia ../Mosaik_JumpDb/hg19Virus.fa.bin \</div>
<div>-j ../Mosaik_JumpDb/hg19Virus.JumpDb \</div>
<div>-annpe $ANN_PATH/2.1.26.pe.100.0065.ann -annse $ANN_PATH/2.1.26.se.100.005.ann</div>
</blockquote>
</div>
<p><span style="color: #4f81bd; font-size: medium;"><b>比对的结果就是那个L526401A.bin.aligned，但是还需要用MosaikText转换成sam格式方便阅读！</b></span></p>
<div>
<blockquote>
<div>$Mosaik_bin/MosaikText -in<span class="Apple-converted-space"> </span>L526401A.bin.aligned  -sam L526401A.bin.aligned.sam -u</div>
</blockquote>
</div>
<blockquote>
<div>其实它github里面有测试数据，你跑一遍就懂了！</div>
<div></div>
</blockquote>
<h1><span style="color: #ff0000;">四，数据结果解读</span></h1>
<div>都是sam格式了就不比解释了</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1457.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
