生信菜鸟团 » bowtie

Bowtie算法第六讲-tally法对bwt索引进行搜索

ulwvfje — Sat, 18 Apr 2015 14:46:04 +0000

因为要讲搜索，所以我选择了一个长一点的字符串来演示多种情况的搜索

perl rotation_one_by_one.pl atgtgtcgtagctcgtnncgt

程序运行的结果如下

$ATGTGTCGTAGCTCGTNNCGT 21

AGCTCGTNNCGT$ATGTGTCGT 9

ATGTGTCGTAGCTCGTNNCGT$ 0

CGT$ATGTGTCGTAGCTCGTNN 18

CGTAGCTCGTNNCGT$ATGTGT 6

CGTNNCGT$ATGTGTCGTAGCT 13

CTCGTNNCGT$ATGTGTCGTAG 11

GCTCGTNNCGT$ATGTGTCGTA 10

GT$ATGTGTCGTAGCTCGTNNC 19

GTAGCTCGTNNCGT$ATGTGTC 7

GTCGTAGCTCGTNNCGT$ATGT 4

GTGTCGTAGCTCGTNNCGT$AT 2

GTNNCGT$ATGTGTCGTAGCTC 14

NCGT$ATGTGTCGTAGCTCGTN 17

NNCGT$ATGTGTCGTAGCTCGT 16

T$ATGTGTCGTAGCTCGTNNCG 20

TAGCTCGTNNCGT$ATGTGTCG 8

TCGTAGCTCGTNNCGT$ATGTG 5

TCGTNNCGT$ATGTGTCGTAGC 12

TGTCGTAGCTCGTNNCGT$ATG 3

TGTGTCGTAGCTCGTNNCGT$A 1

TNNCGT$ATGTGTCGTAGCTCG 15

它的BWT及索引是

T 21

T 9

$ 0

N 18

T 6

T 13

G 11

A 10

C 19

C 7

T 4

T 2

C 14

N 17

T 16

G 20

G 8

G 5

C 12

G 3

A 1

G 15

然后得到它的tally文件如下

接下来用我们的perl程序在里面找字符串

第一次我测试 GTGTCG 这个字符串，程序可以很清楚的看到它的查找过程。

perl search_char.pl GTGTCG tm.tally

your last char is G

start is 7 ; and end is 13

now it is number 5 and the char is C

start is 3 ; and end is 6

now it is number 4 and the char is T

start is 17 ; and end is 19

now it is number 3 and the char is G

start is 10 ; and end is 11

now it is number 2 and the char is T

start is 19 ; and end is 20

now it is number 1 and the char is G

start is 11 ; and end is 12

It is just one perfect match !

The index is 2

第二次我测试一个多重匹配的字符串GT，在原字符串出现了五次的

perl search_char.pl GT tm.tally

your last char is T

start is 15 ; and end is 22

now it is number 1 and the char is G

start is 8 ; and end is 13

we find more than one perfect match!!!

8 13

One of the index is 11

One of the index is 10

One of the index is 19

One of the index is 7

One of the index is 4

One of the index is 2

One of the index is 14

惨了，这个是很严重的bug，不知道为什么，对于多个匹配总是会多出那么一点点的结果。

去转换矩阵里面查看，可知，前面两个结果11和10是错误的。

CTCGTNNCGT$ATGTGTCGTAG 11

GCTCGTNNCGT$ATGTGTCGTA 10

GT$ATGTGTCGTAGCTCGTNNC 19

GTAGCTCGTNNCGT$ATGTGTC 7

GTCGTAGCTCGTNNCGT$ATGT 4

GTGTCGTAGCTCGTNNCGT$AT 2

GTNNCGT$ATGTGTCGTAGCTC 14

最后我们测试未知字符串的查找。

perl search_char.pl ACATGTGT tm.tally

your last char is T

start is 15 ; and end is 22

now it is number 7 and the char is G

start is 8 ; and end is 13

now it is number 6 and the char is T

start is 19 ; and end is 21

now it is number 5 and the char is G

start is 11 ; and end is 12

now it is number 4 and the char is T

start is 20 ; and end is 21

now it is number 3 and the char is A

start is 2 ; and end is 3

now it is number 2 and the char is C

start is 3 ; and end is 3

we can just find the last 6 char ,and it is ATGTGT

原始字符串是ATGTGTCGTAGCTCGTNNCGT，所以查找的挺对的！！！

[perl]

$a=$ARGV[0];

$a=uc $a;

open FH,"<$ARGV[1]";

while(){

chomp;

@F=split;

$hash_count_atcg{$F[0]}++;

$hash{$.}=$_;

# the first line is $ and the last char and the last index !

}

$all_a=$hash_count_atcg{'A'};

$all_c=$hash_count_atcg{'C'};

$all_g=$hash_count_atcg{'G'};

$all_n=$hash_count_atcg{'N'};

$all_t=$hash_count_atcg{'T'};

#print "$all_a\t$all_c\t$all_g\t$all_t\n";

$len_a=length $a;

$end_a=$len_a-1;

#print "your query is $a\n";

#print "and the length of your query is $len_a \n";

$after=substr($a,$end_a,1);

#we fill search your query from the last char !

if ($after eq 'A') {

$start=2;

$end=$all_a+1;

}

elsif ($after eq 'C') {

$start=$all_a+1;

$end=$all_a+$all_c+1;

}

elsif ($after eq 'G') {

$start=$all_a+$all_c+1;

$end=$all_a+$all_c+$all_g+1;

}

elsif ($after eq 'T'){

$start=$all_a+$all_c+$all_g+$all_n+1;

$end=$all_a+$all_c+$all_g+$all_t+$all_n+1;

}

else {die "error !!! we just need A T C G !!!\n"}

print "your last char is $after\n ";

print "start is $start ; and end is $end \n";

foreach (reverse (1..$end_a)){

$after=substr($a,$_,1);

$before=substr($a,$_-1,1);

($start,$end)=&find_level($after,$before,$start,$end);

print "now it is number $_ and the char is $before \n ";

print "start is $start ; and end is $end \n";

if ($_ > 1 && $start == $end) {

$find_char=substr($a,$_);

$find_len=length $find_char;

print "we can just find the last $find_len char ,and it is $find_char \n";

#return "miss";

last;

}

if ($_ == 1) {

if (($end-$start)==1) {

print "It is just one perfect match ! \n";

my @F_start=split/\s+/,$hash{$end};

print "The index is $F_start[1]\n";

#return $F_start[1];

last;

}

else {

print "we find more than one perfect match!!!\n";

print "$start\t$end\n";

foreach (($start-1)..$end) {

my @F_start=split/\s+/,$hash{$_};

print "One of the index is $F_start[1]\n";

}

#return "multiple";

last;

}

sub find_level{

my($after,$before,$start,$end)=@_;

my @F_start=split/\s+/,$hash{$start};

my @F_end=split/\s+/,$hash{$end};

if ($before eq 'A') {

return ($F_start[2]+1,$F_end[2]+1);

}

elsif ($before eq 'C') {

return ($all_a+$F_start[3]+1,$all_a+$F_end[3]+1);

}

elsif ($before eq 'G') {

return ($all_a+$all_c+1+$F_start[4],$all_a+$all_c+1+$F_end[4]);

}

elsif ($before eq 'T') {

return ($all_a+$all_c+$all_g+$all_n+1+$F_start[5],$all_a+$all_c+$all_g+1+$all_n+$F_end[5]);

}

else {die "error !!! we just need A T C G !!!\n"}

}

[/perl]

原始字符串是atgtgtcgtagctcgtnncgt

Bowtie 算法第四讲

ulwvfje — Sat, 18 Apr 2015 11:51:18 +0000

由于之前就简单的看了看bowtie作者的ppt，没有完全吃透就开始敲代码了，写了十几个程序最后我自己都搞不清楚进展到哪一步了，所以我现在整理一下，从新开始！！！

首先，bowtie的作用就是在一个大字符串里面搜索一个小字符串！那么本身就有一个非常笨的复杂方法来搜索，比如，大字符串长度为100万，小字符串为10，那么就依次取出大字符串的10个字符来跟小字符串比较即可，这样的算法是非常不经济的，我简单用perl代码实现一下。

[perl]

#首先读取大字符串的fasta文件

open FH ,"<$ARGV[0]";

$i=0;

while () {

next if /^>/;

chomp;

$a.=(uc);

}

#print "$a\n";

#然后接受我们的小的查询字符串

$query=uc $ARGV[1];

$len=length $a;

$len_query=length $query;

$a=$a.'$'.$a;

#然后依次循环取大字符串来精确比较！

foreach (0..$len-1){

if (substr($a,$_,$len_query) eq $query){

print "$_\n";

#last;

}

[/perl]

这样在时间复杂度非常恐怖，尤其是对人的30亿碱基。

正是因为这样的查询效率非常低，所以我们才需要用bwt算法来构建索引，然后根据tally来进行查询

其中构建索引有三种方式，我首先讲最效率最低的那种索引构造算法，就是依次取字符串进行旋转，然后排序即可。

[perl]

$a=uc $ARGV[0];

$len=length $a;

$a=$a.'$'.$a;

foreach (0..$len){

$hash{substr($a,$_,$len+1)}=$_;

}

#print "$_\t$hash{$_}\n" foreach sort keys %hash;

print substr($_,-1),"\t$hash{$_}\n" foreach sort keys %hash;

[/perl]

这个算法从时间复杂度来讲是非常经济的，对小字符串都是瞬间搞定！！！

perl rotation_one_by_one.pl atgcgtanngtc 这个字符串的BWT矩阵索引如下！

C 12

T 6

$ 0

T 11

G 3

T 2

C 4

N 9

N 8

A 7

G 5

G 10

A 1

但同样的，它也有一个无法避免的弊端，就是内存消耗太恐怖。对于30亿的人类碱基来说，这样旋转会生成30亿乘以30亿的大矩阵，一般的服务器根本hold不住的。

最后我讲一下，这个BWT矩阵索引如何还原成原字符串，这个没有算法的差别，因为就是很简单的原理。

[perl]

#first read the tally !!!

#首先读取上面输出的BWT矩阵索引文件。

open FH,"<$ARGV[0]";

$hash_count{'A'}=0;

$hash_count{'C'}=0;

$hash_count{'G'}=0;

$hash_count{'T'}=0;

while(){

chomp;

@F=split;

$hash_count{$F[0]}++;

$hash{$.}="$F[0]\t$F[1]\t$hash_count{$F[0]}";

#print "$hash{$.}\n";

}

$all_a=$hash_count{'A'};

$all_c=$hash_count{'C'};

$all_g=$hash_count{'G'};

$all_t=$hash_count{'T'};

$all_n=$hash_count{'N'};

#start from the first char !

$raw='';

&restore(1);

sub restore{

my($num)=@_;

my @F=split/\t/,$hash{$num};

$raw.=$F[0];

my $before=$F[0];

if ($before eq 'A') {

$new=$F[2]+1;

}

elsif ($before eq 'C') {

$new=1+$all_a+$F[2];

}

elsif ($before eq 'G') {

$new=1+$all_a+$all_c+$F[2];

}

elsif ($before eq 'N') {

$new =1+$all_a+$all_c+$all_g+$F[2];

}

elsif ($before eq 'T') {

$new=1+$all_a+$all_c+$all_g+$all_n+$F[2];

}

elsif ($before eq '$') {

chop $raw;

$raw = reverse $raw;

print "$raw\n";

exit;

}

else {die "error !!! we just need A T C N G !!!\n"}

#print "$F[0]\t$new\n";

&restore($new);

}

[/perl]

自己动手写bowtie第三讲：序列查询。

ulwvfje — Sat, 28 Mar 2015 02:24:31 +0000

查询需要根据前面建立的索引来做。

这是一个比较复杂的过程，我也是看了bowtie的作者的ppt才慢慢弄懂的，感觉自己也不可能三言两语就说清楚，一般都是辅助图片，动画，再经过多方交流才能慢慢理解。

所以大家呢，就自己去看ppt，看懂那个查询算法。（ppt及代码在我的群里面有共享，欢迎大家加群交流）

这里我简单讲讲我的程序

首先读取索引文件，统计好A,C,G,T的总数

然后把查询序列从最后一个字符往前面回溯。

我创建了一个子函数，专门来处理回溯的问题

每次接受四个参数（左右两端的碱基，上下的阈值），并返回两个参数（新的上下两个阈值）

大家要看懂阈值是如何更新迭代，这样动态的一个个回溯字符串，一个个迭代阈值。

直到四种临界情况的出现。

第一是上下阈值已经相等了，但是我们还没有回溯完全，那就说明字符串只能查找后几个字符，前面还有字符是无法匹配的

第二种情况是上下阈值已经相等了，正巧我们也回溯到了最后一个字符串，那么我们就找到了精确匹配。

第三种情况是已经进行到了最后一个字符串，但是上下阈值还有差值，那么就找到了多个精确匹配点。

最后一种情况是各种非法字符。

然后我简单的测序了一下在病毒的5K基因组里面的精确匹配情况，好像效果还挺好的

但是在酵母里面还有一个问题没有解决，就是取前二十个字符串排序的问题，不够精确，需要重新审视排序结果进行局部优化，可能是需要用堆排序发，具体我还得考虑一个星期，只能等下周上课再看看了，平时太忙了，基本没时间码代码。

这里贴上我的代码给大家看看，

[perl]

$a='CGCTATGTACTGGATGCGCTGGCAAACGAGCCTGCCGTAAG';

while(<>){

chomp;

@F=split;

$hash_count_atcg{$F[0]}++;

$hash{$.}=$_;

}

$all_a=$hash_count_atcg{'A'};

$all_c=$hash_count_atcg{'C'};

$all_g=$hash_count_atcg{'G'};

$all_t=$hash_count_atcg{'T'};

#print "$all_a\t$all_c\t$all_g\t$all_t\n";

$len_a=length $a;

$end_a=$len_a-1;

print "your query is $a\n";

print "and the length of your query is $len_a \n";

foreach (reverse (0..$end_a)){

$after=substr($a,$_,1);

$before=substr($a,$_-1,1);

#对第一个字符进行找阈值的时候，我们需要人为的定义起始点！

if($_ == $end_a){

if ($after eq 'A') {

$start=1;

$end=$all_a;

}

elsif ($after eq 'C') {

$start=$all_a+1;

$end=$all_a+$all_c;

}

elsif ($after eq 'G') {

$start=$all_a+$all_c+1;

$end=$all_a+$all_c+$all_g;

}

elsif ($after eq 'T'){

$start=$all_a+$all_c+$all_g+1;

$end=$all_a+$all_c+$all_g+$all_t;

}

else {print "error !!! we just need A T C G !!!\n";exit;}

}

#如果阈值已经无法继续分割，但是字符串还未查询完

if ($_ > 0 && $start == $end) {

$find_char=substr($a,$_);

$find_len=length $find_char;

#这里需要修改，但是不影响完全匹配了

print "we can just find the last $find_len char ,and it is $find_char \n";

exit;

}

#如果进行到了最后一个字符

if ($_ == 0) {

if ($start == $end) {

print "It is just one perfect match ! \n";

my @F_start=split/\s+/,$hash{$start};

print "The index is $F_start[1]\n";

exit;

}

else {

print "we find more than one perfect match!!!\n";

#print "$start\t$end\n";

foreach ($start..$end) {

my @F_start=split/\s+/,$hash{$_};

print "One of the index is $F_start[1]\n";

}

exit;

}

($start,$end)=&find_level($after,$before,$start,$end);

}

sub find_level{

my($after,$before,$start,$end)=@_;

my @F_start=split/\s+/,$hash{$start};

my @F_end=split/\s+/,$hash{$end};

if ($before eq 'A') {

return ($F_start[2],$F_end[2]);

}

elsif ($before eq 'C') {

return ($all_a+$F_start[3],$all_a+$F_end[3]);

}

elsif ($before eq 'G') {

return ($all_a+$all_c+$F_start[4],$all_a+$all_c+$F_end[4]);

}

elsif ($before eq 'T') {

return ($all_a+$all_c+$all_g+$F_start[5],$all_a+$all_c+$all_g+$F_end[5]);

}

else {print "sorry , I can't find the right match!!!\n";}

}

#perl -alne '{next if />/;$all.=$_;}END{print substr($all,308,10)}' lambda_virus.fa

[/perl]

Snp-calling流程（BWA+SAMTOOLS+BCFTOOLS）

ulwvfje — Mon, 23 Mar 2015 12:20:25 +0000

比对可以选择BWA或者bowtie，测序数据可以是单端也可以是双端，我这里简单讲一个，但是脚本都列出来了。而且我选择的是bowtie比对，然后单端数据。

首先进入hg19的目录，对它进行两个索引

samtools faidx hg19.fa

Bowtie2-build hg19.fa hg19

我这里随便从26G的测序数据里面选取了前1000行做了一个tmp.fa文件，进入tmp.fa这个文件的目录进行操作

Bowtie的使用方法详解见http://www.bio-info-trainee.com/?p=398

bowtie2 -x ../../../ref-database/hg19 -U tmp1.fa -S tmp1.sam

samtools view -bS tmp1.sam > tmp1.bam

samtools sort tmp1.bam tmp1.sorted

samtools index tmp1.sorted.bam

samtools mpileup -d 1000 -gSDf ../../../ref-database/hg19.fa tmp1.sorted.bam |bcftools view -cvNg - >tmp1.vcf

然后就能看到我们产生的vcf变异格式文件啦！

当然，我们可能还需要对VCF文件进行再注释！

要看懂以上流程及命令，需要掌握BWA，bowtie，samtools，bcftools，

数据格式fasta，fastq，sam，vcf，pileup

如果是bwa把参考基因组索引化，然后aln得到后缀树，然后sampe对双端数据进行比对

首先bwa index 然后选择算法，进行索引。

然后aln脚本批量处理

==> bwa_aln.sh <==

while read id

echo $id

bwa aln hg19.fa $id >$id.sai

done <$1

然后sampe脚本批量处理

==> bwa_sampe.sh <==

while read id

echo $id

bwa sampe hg19.fa $id*sai $id*single >$id.sam

done <$1

然后是samtools的脚本

==> samtools.sh <==

while read id

echo $id

samtools view -bS $id.sam > $id.bam

samtools sort $id.bam $id.sorted

samtools index $id.sorted.bam

done <$1

然后是bcftools的脚本

==> bcftools.sh <==

while read id

echo $id

samtools mpileup -d 1000 -gSDf ref.fa $id*sorted.bam |bcftools view -cvNg - >$id.vcf

done <$1

==> mpileup.sh <==

while read id

echo $id

samtools mpileup -d 100000 -f hg19.fa $id*sorted.bam >$id.mpileup

done <$1

自己动手写bowtie第一讲：BWT算法详解并建立索引

ulwvfje — Fri, 20 Mar 2015 12:27:42 +0000

首先，什么是BWT，可以参考博客

http://www.cnblogs.com/xudong-bupt/p/3763814.html

他讲的非常好。

一个长度为n的串A1A2A3...An经过旋转可以得到

A1A2A3...An

A2A3...AnA1

A3...AnA1A2

...

AnA1A2A3...

n个串，每个字符串的长度都是n。

对这些字符串进行排序，这样它们之前的顺序就被打乱了，打乱的那个顺序就是index，需要输出。

首先我们测试一个简单的字符串acaacg$,总共六个字符，加上一个$符号，下次再讲$符号的意义。

实现以上功能是比较简单的，代码如下

但是这是对于6个字符串等小片段字符串，如果是是几千万个字符的字符串，这样转换就会输出千万的平方个字符串组成的正方形数组，是很恐怖的数据量。所以在转换的同时就不能把整个千万字符储存在内存里面。

在生物学领域，是这样的，这千万个千万个碱基的方阵，我们取每个字符串的前20个字符串就足以对它们进行排序，当然这只是近视的，我后面会讲精确排序，而且绕过内存的方法。

Perl程序如下

[perl]

while (<>){

next if />/;

chomp;

$a.=$_;

}

$a.='$';

$len=length $a;

$i=0;

print "first we transform it !!!\n";

foreach (0..$len-1){

$up=substr($a,0,$_);

$down=substr($a,$_);

#print "$down$up\n";

#$hash{"$down$up"}=$i;

$key=substr("$down$up",0,20);

$key=$key.”\t”.substr("$down$up",$len-1);

$hash{$key}=$i;

$i++;

}

print "then we sort it\n";

foreach (sort keys %hash){

$first=substr($_,0,1);

$len=length;

$last=substr($_,$len-1,1);

#print "$first\t$last\t$hash{$_}\n";

print "$_\t$hash{$_}\n";

}

[/perl]

运行的结果如下

个人觉得这样排序是极好的，但是暂时还没想到如何解决不够精确的问题！！！

参考：

http://tieba.baidu.com/p/1504205984

http://www.cnblogs.com/xudong-bupt/p/3763814.html

bowtie简单使用

ulwvfje — Fri, 20 Mar 2015 07:49:25 +0000

首先进入bowtie的主页，千万要谷歌！！！

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

主页里面有下载链接，也有索引文件，当然索引文件只有人类等模式生物

下载之后是个压缩包，解压即可使用

可以看到绿色的就是命令，可以添加到环境变量使用，也可以直接用全路径使用它！！！

然后example文件夹里面有所有的测试文件。

二、准备数据

我们就用软件自带的测试数据

三、运行命令

分为两步，首先索引，然后比对！！！

索引，bowtie2-build your-fastq-file.fa your-index-name

然后你的目录就产生了六个索引文件，我给索引取名是tmp，你们可以随便取名字

然后比对，分两种，一是单端测序数据，二是双端数据

重点参数的-x 和 –S ，单端是-U 双端是-1 -2

Bowtie –x tmp –U reads.fa –S hahahhha.sam

Bowtie –x tmp -1 reads1.fa -2 reads2.fa –S hahahha.sam

四：输出文件解读

就是输出了sam文件咯，这个就看我的sam文件格式讲解哈

转录组比对软件tophat的使用

ulwvfje — Sat, 14 Mar 2015 12:52:49 +0000

转录组比对软件tophat的使用

为什么要用这个软件？：因为转录组reads比对到基因组reads用bwa和bowtie的效果都不够好，所以我们选择tophat

它做了什么？：tophat把测序的转录组的原始reads比对到了参考基因组上面，并且输出了bam（二进制的sam）文件比对结果给我们。（fastq--->bam）

一：下载安装该软件

其实一般的生信服务器自然会有高手给安装好了，你只需调用即可，这里我给大家演示一下如何安装。

wget http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz

我这里简单下载二进制版本的，直接解压即可使用，里面除了tophat之外，还有好几个小工具，也是挺实用的。

二：准备数据

数据当然是测序的转录组的原始reads啦，fastq格式的

还有人的参考基因组，这里选择人的hg19，很容易就下载了，bowtie官网里面有下载并且索引好了所有文件。如果你是自己下载的hg19，需要用bowtie先进行索引，见我的bowtie简单使用教程

三：运行命令

不需要看那么多的参数，首先要能用

/home/jmzeng/bio-soft/tophat-2.0.13.Linux_x86_64/tophat2 /home/jmzeng/ref-database/hg19 case1.fq

拆开看就是tophat2 hg19 case1.fq 即可，不需要设置任何参数，等需要优化的时候再去思考参数的意义，我这里是单端测序的命令。

不过我的习惯是用脚本解决问题，一定要批量化运行的，我设置了30个CPU，也设置了输出目录，还加上了一个转录本

也许是要等明天才能看结果啦

好像用了30个cpu跑的很快，每个两个多小时就跑完啦

四：输出文件解读

这三个文件夹里面就输出文件，每个样本一个输出文件夹，

任意进入一个文件夹可以看到

通常，我们只需要那个accepted_hits.bam文件，是我们的测序reads成功比对到hg19什么的比对情况。

接下来就可以用很多软件来处理这个bam文件了