十一 23

关于multiple mapping我想说的

很多时候,我们都要选取unique mapped的reads,尤其是在RNA-seq和CHIP-seq的时候,但是如何保留,各种教程都不一致,我稍微总结了一下,是因为使用的比对工具不一样导致的!但是主要都反应在sam文件的一系列tag里面~

首先对bwa来说,如果它遇到一个reads可以比对到参考基因在的多个序列,只会随机的选取一个位置来输出到sam文件,但是会加上一个tag是XS:I:<N>来告诉我们第二好的比对情况的比对得分是多少,bowtie也是一样。但是它们都有参数来决定是否只对每个reads输出一条信息,还是输出全部的信息,在bwa是-a的参数,在bowtie里面是-m参数。

但是bowtie2里面取消了这个参数,它们都必须用XS:I:<N>这个tag来挑选unique mapped的reads

但是如果是用hisat来比对的话,决定是否是唯一比对的却是NH这个tag信息。默认情况下一条reads可以输出多条比对结果。

我想起了再补充吧,其实应该找几个例子用IGV看看,就明白了,可是我暂时没有时间了,只是觉得这个很重要,就提一下。

 

05

自学CHIP-seq分析第五讲~测序数据比对

比对本质是是很简单的了,各种mapping工具层出不穷,我们一般常用的就是BWA和bowtie了,我这里就挑选bowtie2吧,反正别人已经做好了各种工具效果差异的比较,我们直接用就好了,代码如下:

## step5 : alignment to hg19/ using bowtie2 to do alignment
## ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/biosoft/bowtie/hg19_index /hg19.fa ~/biosoft/bowtie/hg19_index/hg19
## cat >run_bowtie2.sh
ls *.fastq | while read id ;
do
echo $id
#~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id -S ${id%%.*}.sam 2>${id%%.*}.align.log;
#samtools view -bhS -q 30 ${id%%.*}.sam > ${id%%.*}.bam ## -F 1548 https://broadinstitute.github.io/picard/explain-flags.html
# -F 0x4 remove the reads that didn't match
samtools sort ${id%%.*}.bam ${id%%.*}.sort ## prefix for the output
# samtools view -bhS a.sam | samtools sort -o - ./ > a.bam
samtools index ${id%%.*}.sorted.bam
done

这个索引~/biosoft/bowtie/hg19_index/hg19需要自己提取建立好,见前文

初步比对的sam文件到底该如何过滤,我查了很多文章都没有给出个子丑寅卯,各执一词,我也没办法给大家一个标准,反正我测试了好几种,看起来call peaks的差异不大,就是得不到文章给出的那些结果!!

一般来说,初步比对的sam文件只能选取unique mapping的结果,所以我用了#samtools view -bhS -q 30,但是结果并没什么改变,有人说是peak caller这些工具本身就会做这件事,所以取决于你下游分析所选择的工具。

给大家看比对的日志吧:

SRR1042593.fastq
16902907 reads; of these:
16902907 (100.00%) were unpaired; of these:
667998 (3.95%) aligned 0 times
12467095 (73.76%) aligned exactly 1 time
3767814 (22.29%) aligned >1 times
96.05% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042594.fastq
60609833 reads; of these:
60609833 (100.00%) were unpaired; of these:
9165487 (15.12%) aligned 0 times
39360173 (64.94%) aligned exactly 1 time
12084173 (19.94%) aligned >1 times
84.88% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042595.fastq
14603295 reads; of these:
14603295 (100.00%) were unpaired; of these:
918028 (6.29%) aligned 0 times
10403045 (71.24%) aligned exactly 1 time
3282222 (22.48%) aligned >1 times
93.71% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042596.fastq
65911151 reads; of these:
65911151 (100.00%) were unpaired; of these:
10561790 (16.02%) aligned 0 times
42271498 (64.13%) aligned exactly 1 time
13077863 (19.84%) aligned >1 times
83.98% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042597.fastq
22210858 reads; of these:
22210858 (100.00%) were unpaired; of these:
1779568 (8.01%) aligned 0 times
15815218 (71.20%) aligned exactly 1 time
4616072 (20.78%) aligned >1 times
91.99% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042598.fastq
58068816 reads; of these:
58068816 (100.00%) were unpaired; of these:
8433671 (14.52%) aligned 0 times
37527468 (64.63%) aligned exactly 1 time
12107677 (20.85%) aligned >1 times
85.48% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042599.fastq
24019489 reads; of these:
24019489 (100.00%) were unpaired; of these:
1411095 (5.87%) aligned 0 times
17528479 (72.98%) aligned exactly 1 time
5079915 (21.15%) aligned >1 times
94.13% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042600.fastq
76361026 reads; of these:
76361026 (100.00%) were unpaired; of these:
8442054 (11.06%) aligned 0 times
50918615 (66.68%) aligned exactly 1 time
17000357 (22.26%) aligned >1 times
88.94% overall alignment rate
[samopen] SAM header is present: 93 sequences.

可以看到比对非常成功!!!我这里就不用表格的形式来展现了,毕竟我又不是给客户写报告,大家就将就着看吧。