为什么使用bowtie而不是bowtie2

昨天，我们在生信技能树讨论了miRNA-seq数据分析流程，并且提出来了一个问题，就是为什么现在很多流程仍然是使用bowtie而不是bowtie2，见：miRNAseq数据分析这么多年了它的流程也没有固定

如果你不想看后面的探索过程，只需要记住，如果你不想允许任何碱基错配，而且每个序列的reads也只需要一个最好的比对情况的输出，那么选择下面的参数即可：

-n 0 -m1 --best --strata

我确实懒得去探索bowtie2有没有可能通过调整参数达到同样的目的，或者其它一千多个比对工具是否提供了这样的选项，但是这个 -n 0 -m1 —best —strata 参数组合已经满足了miRNA数据分析啦。

查看测试数据

# 去fastq的前10万行，意味着2.5万条reads
zcat SRR1542719.fastq.gz |head -100000 > tmp.fq
fastq_to_fasta -i tmp.fq -o tmp.fa

前几行如下所示：

==> tmp.fa <==
>SRR1542719.1 A38I2:00333:01555 length=24
TGGGGTGAAGTAGGTTGTATAGTT
>SRR1542719.2 A38I2:02156:01418 length=23
TGGGATGTAGTAGGTTGTATAGT

==> tmp.fq <==
@SRR1542719.1 A38I2:00333:01555 length=24
TGGGGTGAAGTAGGTTGTATAGTT
+SRR1542719.1 A38I2:00333:01555 length=24
:;;;,7:6/6:::/8277B<AAA<
@SRR1542719.2 A38I2:02156:01418 length=23
TGGGATGTAGTAGGTTGTATAGT
+SRR1542719.2 A38I2:02156:01418 length=23
///(////8///92:4:>BBBBB

我已经构建好了bowtie的index文件，而且存放在了上层目录；

4.2M Apr 29 15:41 ../hsa-mature-bowtie-index.1.ebwt
7.1K Apr 29 15:41 ../hsa-mature-bowtie-index.2.ebwt
 24K Apr 29 15:41 ../hsa-mature-bowtie-index.3.ebwt
 15K Apr 29 15:41 ../hsa-mature-bowtie-index.4.ebwt
4.2M Apr 29 15:41 ../hsa-mature-bowtie-index.rev.1.ebwt
7.1K Apr 29 15:41 ../hsa-mature-bowtie-index.rev.2.ebwt

关于如何构建bowtie的index文件，大家可以参考五年前我在生信菜鸟团博客分享的一篇文章学会miRNA-seq分析。

fasta和fastq格式测序数据均可使用bowtie比对

首先使用默认参数，命令如下：

bowtie ../hsa-mature-bowtie-index -f tmp.fa -S tmp.fa.sam
bowtie ../hsa-mature-bowtie-index tmp.fq -S tmp.fq.sam

效果如下：

$bowtie ../hsa-mature-bowtie-index -f tmp.fa -S tmp.fa.sam
# reads processed: 25000
# reads with at least one reported alignment: 20393 (81.57%)
# reads that failed to align: 4607 (18.43%)
Reported 20393 alignments

$bowtie ../hsa-mature-bowtie-index tmp.fq -S tmp.fq.sam
# reads processed: 25000
# reads with at least one reported alignment: 20393 (81.57%)
# reads that failed to align: 4607 (18.43%)
Reported 20393 alignments

简单查看输出的sam文件的第2列：

 20364 0
 29 16
 4607 4

单端测序数据的sam文件里面的flag值为0代表正向比对成功到参考基因组，而16代表负向比对到参考基因组，4代表比对失败！有意思的是仅仅是29条序列的flag是16，而且我看了看，它们的reads长度都实在是太短了。

输入的fastq序列是2.5万条，所以输出的sam里面的序列比对情况也是2.5万条记录。由此可见，bowtie比对的时候，默认就是哪怕你的read比对到参考基因组多个位置，也只是输出一个比对记录。

至于它是否允许错配，可以进去看sam文件。

然后仔细检查bowtie的参数

软件参数详解必须看官网：http://bowtie-bio.sourceforge.net/manual.shtml 有几个重要的参数组合：

Example 1: -a
Example 2: -k 3
Example 3: -k 6
Example 4: default (-k 1)
Example 5: -a --best
Example 6: -a --best --strata
Example 7: -a -m 3
Example 8: -a -m 5
Example 9: -a -m 3 --best --strata

需要3选一的参数：

-a ，在允许的错配碱基数量下的全部可能的比对情况都输出
-k，如果出现多比对，输出可能的比对情况小于k的那些
-m，如果出现多比对，输出可能的比对情况大于m的那些

需要2选1的参数

-v或者-n
-v或者-n 指定允许最大错配数，为[0-3] ，因为允许错配，所以有更大的概率出现多比对情况

需要组合的参数

—best —strata 必须连起来使用，而且必须在 a,k,m 三选一

对miRNA来说的最佳参数

因为miRNA的测序片段很小，通常我们是不允许错配，参数选择如下：

bowtie -a --best --strata ../hsa-mature-bowtie-index -f tmp.fa -S tmp.fa.sam
bowtie -a --best --strata ../hsa-mature-bowtie-index tmp.fq -S tmp.fq.sam

$ bowtie -a --best --strata ../hsa-mature-bowtie-index -f tmp.fa -S tmp.fa.sam
# reads processed: 25000
# reads with at least one reported alignment: 20393 (81.57%)
# reads that failed to align: 4607 (18.43%)
Reported 21836 alignments

$bowtie -a --best --strata ../hsa-mature-bowtie-index tmp.fq -S tmp.fq.sam
# reads processed: 25000
# reads with at least one reported alignment: 20393 (81.57%)
# reads that failed to align: 4607 (18.43%)
Reported 21836 alignments

bowtie -n 0 -m1 --best --strata ../hsa-mature-bowtie-index -f tmp.fa -S tmp.fa.sam
bowtie -n 0 -m1 --best --strata ../hsa-mature-bowtie-index tmp.fq -S tmp.fq.sam

$bowtie -n 0 -m1 --best --strata ../hsa-mature-bowtie-index -f tmp.fa -S tmp.fa.sam
# reads processed: 25000
# reads with at least one reported alignment: 19157 (76.63%)
# reads that failed to align: 5554 (22.22%)
# reads with alignments suppressed due to -m: 289 (1.16%)
Reported 19157 alignments

$bowtie -n 0 -m1 --best --strata ../hsa-mature-bowtie-index tmp.fq -S tmp.fq.sam
# reads processed: 25000
# reads with at least one reported alignment: 19157 (76.63%)
# reads that failed to align: 5554 (22.22%)
# reads with alignments suppressed due to -m: 289 (1.16%)
Reported 19157 alignments

比较两次结果

bowtie ../hsa-mature-bowtie-index tmp.fq -S tmp.default.sam
bowtie -n 0 -m1 --best --strata ../hsa-mature-bowtie-index tmp.fq -S tmp.miRNA.sam

$cat tmp.miRNA.sam| grep -v '^@' |cut -f 2|sort |uniq -c
 20470 0
 54 16
 5554 4

$cat tmp.default.sam | grep -v '^@' |cut -f 2|sort |uniq -c
 20364 0
 29 16
 4607 4

之前使用bowtie默认参数，比对是20393 (81.57%)，现在换成miRNA最佳参数，是 19157 (76.63%)。

也就是说，添加参数，使得结果更为严苛，一些reads本来是可以比对成功的，现在失败了，我们就探索一下具体的几个reads看看。

paste tmp.default.sam tmp.miRNA.sam |grep -v '^@' |cut -f 1,2,3,6,16,17,18,21|awk '{if($2==0 && $6==4)print}'|wc

# 查看其中一个：
tmp.default.sam:SRR1542719.24997 0 hsa-let-7c-5p 1 255 22M * 0 0 TGAGATAGTAGGTTGTATGGTT ?CABAABBA::2:>BAAAC>C= XA:i:1 MD:Z:4G17 NM:i:1 XM:i:2
tmp.miRNA.sam:SRR1542719.24997 4 * 0 0 * * 0 0 TGAGATAGTAGGTTGTATGGTT ?CABAABBA::2:>BAAAC>C= XM:i:0

虽然都是22M，但是使用默认参数，会有一个错配，所以显示 MD:Z:4G17，但是加上了这个 -n 0 -m1 —best —strata 参数组合后，很明显它就不会被认为是成功比对啦！

文末友情宣传

强烈建议你推荐我们生信技能树给身边的博士后以及年轻生物学PI，帮助他们多一点数据认知，让科研更上一个台阶：

生信爆款入门-全球听（买一得五）（第4期），你的生物信息学入门课
数据挖掘第2期（两天变三周，实力加量），医学生/临床医师首选技能提高课
生信技能树的2019年终总结，你的生物信息学成长宝藏
2020学习主旋律，B站74小时免费教学视频为你领路，还等什么，看啊！！！

2025年8月
一	二	三	四	五	六	日
« 九
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31