unique | 生信菜鸟团

很多时候，我们都要选取unique mapped的reads，尤其是在RNA-seq和CHIP-seq的时候，但是如何保留，各种教程都不一致，我稍微总结了一下，是因为使用的比对工具不一样导致的！但是主要都反应在sam文件的一系列tag里面~

首先对bwa来说，如果它遇到一个reads可以比对到参考基因在的多个序列，只会随机的选取一个位置来输出到sam文件，但是会加上一个tag是XS:I:<N>来告诉我们第二好的比对情况的比对得分是多少，bowtie也是一样。但是它们都有参数来决定是否只对每个reads输出一条信息，还是输出全部的信息，在bwa是-a的参数，在bowtie里面是-m参数。

但是bowtie2里面取消了这个参数，它们都必须用XS:I:<N>这个tag来挑选unique mapped的reads

但是如果是用hisat来比对的话，决定是否是唯一比对的却是NH这个tag信息。默认情况下一条reads可以输出多条比对结果。

我想起了再补充吧，其实应该找几个例子用IGV看看，就明白了，可是我暂时没有时间了，只是觉得这个很重要，就提一下。

比对本质是是很简单的了，各种mapping工具层出不穷，我们一般常用的就是BWA和bowtie了，我这里就挑选bowtie2吧，反正别人已经做好了各种工具效果差异的比较，我们直接用就好了，代码如下：

## step5 : alignment to hg19/ using bowtie2 to do alignment
## ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/biosoft/bowtie/hg19_index /hg19.fa ~/biosoft/bowtie/hg19_index/hg19
## cat >run_bowtie2.sh
ls *.fastq | while read id ;
do
echo $id
#~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id -S ${id%%.*}.sam 2>${id%%.*}.align.log;
#samtools view -bhS -q 30 ${id%%.*}.sam > ${id%%.*}.bam ## -F 1548 https://broadinstitute.github.io/picard/explain-flags.html
# -F 0x4 remove the reads that didn't match
samtools sort ${id%%.*}.bam ${id%%.*}.sort ## prefix for the output
# samtools view -bhS a.sam | samtools sort -o - ./ > a.bam
samtools index ${id%%.*}.sorted.bam
done

这个索引~/biosoft/bowtie/hg19_index/hg19需要自己提取建立好，见前文

初步比对的sam文件到底该如何过滤，我查了很多文章都没有给出个子丑寅卯，各执一词，我也没办法给大家一个标准，反正我测试了好几种，看起来call peaks的差异不大，就是得不到文章给出的那些结果！！

一般来说，初步比对的sam文件只能选取unique mapping的结果，所以我用了#samtools view -bhS -q 30，但是结果并没什么改变，有人说是peak caller这些工具本身就会做这件事，所以取决于你下游分析所选择的工具。

给大家看比对的日志吧：

SRR1042593.fastq
16902907 reads; of these:
16902907 (100.00%) were unpaired; of these:
667998 (3.95%) aligned 0 times
12467095 (73.76%) aligned exactly 1 time
3767814 (22.29%) aligned >1 times
96.05% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042594.fastq
60609833 reads; of these:
60609833 (100.00%) were unpaired; of these:
9165487 (15.12%) aligned 0 times
39360173 (64.94%) aligned exactly 1 time
12084173 (19.94%) aligned >1 times
84.88% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042595.fastq
14603295 reads; of these:
14603295 (100.00%) were unpaired; of these:
918028 (6.29%) aligned 0 times
10403045 (71.24%) aligned exactly 1 time
3282222 (22.48%) aligned >1 times
93.71% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042596.fastq
65911151 reads; of these:
65911151 (100.00%) were unpaired; of these:
10561790 (16.02%) aligned 0 times
42271498 (64.13%) aligned exactly 1 time
13077863 (19.84%) aligned >1 times
83.98% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042597.fastq
22210858 reads; of these:
22210858 (100.00%) were unpaired; of these:
1779568 (8.01%) aligned 0 times
15815218 (71.20%) aligned exactly 1 time
4616072 (20.78%) aligned >1 times
91.99% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042598.fastq
58068816 reads; of these:
58068816 (100.00%) were unpaired; of these:
8433671 (14.52%) aligned 0 times
37527468 (64.63%) aligned exactly 1 time
12107677 (20.85%) aligned >1 times
85.48% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042599.fastq
24019489 reads; of these:
24019489 (100.00%) were unpaired; of these:
1411095 (5.87%) aligned 0 times
17528479 (72.98%) aligned exactly 1 time
5079915 (21.15%) aligned >1 times
94.13% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042600.fastq
76361026 reads; of these:
76361026 (100.00%) were unpaired; of these:
8442054 (11.06%) aligned 0 times
50918615 (66.68%) aligned exactly 1 time
17000357 (22.26%) aligned >1 times
88.94% overall alignment rate
[samopen] SAM header is present: 93 sequences.

可以看到比对非常成功！！！我这里就不用表格的形式来展现了，毕竟我又不是给客户写报告，大家就将就着看吧。

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

Tag Archives: unique

关于multiple mapping我想说的

自学CHIP-seq分析第五讲~测序数据比对

2025年10月
一	二	三	四	五	六	日
« 九
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31