自学CHIP-seq分析第五讲~测序数据比对

ulwvfje — Tue, 05 Jul 2016 00:42:39 +0000

比对本质是是很简单的了，各种mapping工具层出不穷，我们一般常用的就是BWA和bowtie了，我这里就挑选bowtie2吧，反正别人已经做好了各种工具效果差异的比较，我们直接用就好了，代码如下：

## step5 : alignment to hg19/ using bowtie2 to do alignment
## ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/biosoft/bowtie/hg19_index /hg19.fa ~/biosoft/bowtie/hg19_index/hg19
## cat >run_bowtie2.sh
ls *.fastq | while read id ;
do
echo $id
#~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id -S ${id%%.*}.sam 2>${id%%.*}.align.log;
#samtools view -bhS -q 30 ${id%%.*}.sam > ${id%%.*}.bam ## -F 1548 https://broadinstitute.github.io/picard/explain-flags.html
# -F 0x4 remove the reads that didn't match
samtools sort ${id%%.*}.bam ${id%%.*}.sort ## prefix for the output
# samtools view -bhS a.sam | samtools sort -o - ./ > a.bam
samtools index ${id%%.*}.sorted.bam
done

这个索引~/biosoft/bowtie/hg19_index/hg19需要自己提取建立好，见前文

初步比对的sam文件到底该如何过滤，我查了很多文章都没有给出个子丑寅卯，各执一词，我也没办法给大家一个标准，反正我测试了好几种，看起来call peaks的差异不大，就是得不到文章给出的那些结果！！

一般来说，初步比对的sam文件只能选取unique mapping的结果，所以我用了#samtools view -bhS -q 30，但是结果并没什么改变，有人说是peak caller这些工具本身就会做这件事，所以取决于你下游分析所选择的工具。

给大家看比对的日志吧：

SRR1042593.fastq
16902907 reads; of these:
16902907 (100.00%) were unpaired; of these:
667998 (3.95%) aligned 0 times
12467095 (73.76%) aligned exactly 1 time
3767814 (22.29%) aligned >1 times
96.05% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042594.fastq
60609833 reads; of these:
60609833 (100.00%) were unpaired; of these:
9165487 (15.12%) aligned 0 times
39360173 (64.94%) aligned exactly 1 time
12084173 (19.94%) aligned >1 times
84.88% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042595.fastq
14603295 reads; of these:
14603295 (100.00%) were unpaired; of these:
918028 (6.29%) aligned 0 times
10403045 (71.24%) aligned exactly 1 time
3282222 (22.48%) aligned >1 times
93.71% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042596.fastq
65911151 reads; of these:
65911151 (100.00%) were unpaired; of these:
10561790 (16.02%) aligned 0 times
42271498 (64.13%) aligned exactly 1 time
13077863 (19.84%) aligned >1 times
83.98% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042597.fastq
22210858 reads; of these:
22210858 (100.00%) were unpaired; of these:
1779568 (8.01%) aligned 0 times
15815218 (71.20%) aligned exactly 1 time
4616072 (20.78%) aligned >1 times
91.99% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042598.fastq
58068816 reads; of these:
58068816 (100.00%) were unpaired; of these:
8433671 (14.52%) aligned 0 times
37527468 (64.63%) aligned exactly 1 time
12107677 (20.85%) aligned >1 times
85.48% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042599.fastq
24019489 reads; of these:
24019489 (100.00%) were unpaired; of these:
1411095 (5.87%) aligned 0 times
17528479 (72.98%) aligned exactly 1 time
5079915 (21.15%) aligned >1 times
94.13% overall alignment rate
[samopen] SAM header is present: 93 sequences.
SRR1042600.fastq
76361026 reads; of these:
76361026 (100.00%) were unpaired; of these:
8442054 (11.06%) aligned 0 times
50918615 (66.68%) aligned exactly 1 time
17000357 (22.26%) aligned >1 times
88.94% overall alignment rate
[samopen] SAM header is present: 93 sequences.

可以看到比对非常成功！！！我这里就不用表格的形式来展现了，毕竟我又不是给客户写报告，大家就将就着看吧。

新的比对工具MOSAIK

ulwvfje — Tue, 15 Mar 2016 10:55:20 +0000

功能：序列比对，类似于BWA，Bowtie

优点：全平台，甚至支持pacbio的三代测序长reads

算法：是hash index，跟其它bwt算法不太一样

官网：https://github.com/wanpinglee/MOSAIK

paper：http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090581

作者：WP Lee - ‎2014 - ‎被引用次数：70 - ‎相关文章

Overview:

MOSAIK is a stable, sensitive and open-source program for mapping second and 
third-generation sequencing reads to a reference genome. Uniquely among current 
mapping tools, MOSAIK can align reads generated by all the major sequencing 
technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, 
Ion Torrent and Pacific BioSciences SMRT.

一，软件安装

软件下载地址：https://github.com/wanpinglee/MOSAIK/archive/master.zip

下载压缩包，解压后进入src源码目录，然后make即可！

这些程序就可以用啦！

里面有四个软件，所以需要四个步骤来完成比对！

build和jump是对参考基因组建立索引

build同时需要对测序数据进行索引

aligner是把两个索引进行比对！

text是把比对的结果转为其它可读格式，通常是sam比对格式

二，输入数据准备

比对当然需要测序的fastq格式reads和fa格式的参考基因组啦！

我是下载的http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html 里面的数据，因为之所以要用这个软件，也是因为找人体内病毒整合的需求！

PE测序的reads，参考基因组是病毒和人类

三，运行命令

下面是一个完整的脚本

首先对参考基因组构建索引

Mosaik_bin=~/bio-soft/MOSAIK/bin #设置好程序安装目录

##for gib virus reference genome

$Mosaik_bin/MosaikBuild -fr gibVirus.fa -oa gibVirus.fa.bin -st illumina -assignQual 40

$Mosaik_bin/MosaikJump -ia gibVirus.fa.bin -out gibVirus.JumpDb -hs 15

这两个步骤是构建hash索引，对这个60M的压缩包病毒基因组集合，时间是

MosaikBuild CPU time: 15.660 s, wall time: 18.146 s

MosaikJump CPU time: 329.031 s, wall time: 331.672 s

还可以接受，但是输出的index文件就有点难以接受了！！！！

333M Mar 11 19:55 gibVirus.fa.bin

60M Aug 13 2013 gibVirus.fa.gz

5.0G Mar 11 20:04 gibVirus.JumpDb_keys.jmp

1 Mar 11 19:59 gibVirus.JumpDb_meta.jmp

1.3G Mar 11 20:04 gibVirus.JumpDb_positions.jmp

如果是对人的hg19基因组来说，消耗的时间如下：

MosaikBuild CPU time: 183.642 s, wall time: 184.658 s

MosaikJump CPU time: 3985.608 s, wall time: 3995.323 s

一个多小时，还行！

对参考基因组建好了索引，还需要对测序数据构建索引！

$Mosaik_bin/MosaikBuild -q L526401A_1.fq.gz -q2 L526401A_2.fq.gz -out L526401A.bin -st illumina

数据双端测序，每个1.6G左右数据，构建索引耗时如下：

# reads written: 53060622

# bases written: 5304891143

MosaikBuild CPU time: 388.969 s, wall time: 391.149 s

接下来就比对！

ANN_PATH=~/bio-soft/MOSAIK/src/networkFile

$Mosaik_bin/MosaikAligner -in L526401A.bin \

-out L526401A.bin.aligned \

-ia ../Mosaik_JumpDb/hg19Virus.fa.bin \

-j ../Mosaik_JumpDb/hg19Virus.JumpDb \

-annpe $ANN_PATH/2.1.26.pe.100.0065.ann -annse $ANN_PATH/2.1.26.se.100.005.ann

比对的结果就是那个L526401A.bin.aligned，但是还需要用MosaikText转换成sam格式方便阅读！

$Mosaik_bin/MosaikText -in L526401A.bin.aligned -sam L526401A.bin.aligned.sam -u

其实它github里面有测试数据，你跑一遍就懂了！

四，数据结果解读

都是sam格式了就不比解释了