生信菜鸟团 » samtools

不要想当然的使用生信软件，读文档，勤搜索！

ulwvfje — Mon, 06 Feb 2017 02:35:42 +0000

最近在写一篇很有趣的文章，一张图说清楚wgs,wes,rna-seq,chip-seq的异同点！

需要用到一些测试数据，我准备拿17号染色体的40437407-40486397这约48Kb碱基区域来举例子，就需要把这个区域的bam提取出来。

我分别找了以前处理的wgs,wes,rna-seq,chip-seq公共数据，原始bam非常大，尤其是WGS的，45G的bam文件，所以只能抽取17号染色体的40437407-40486397这约48Kb碱基区域，以前我做mpileup或者其它都是用的-r 参数，所以我想当然的使用下面的代码：

samtools view -h -r chr17:40437407-40486397 your.sorted.merge.bam |samtools view -bS - >wes.bam

发现始终不对，让我着实郁闷，我就Google了一下，https://www.biostars.org/p/48719/

才明白，samtools的view命令的-r参数不再是用来指定坐标了！

samtools view -h control_1.sort.bam "chr17:40437407-40486397" |samtools view -bS - >RNA-seq.bam

所以我修改了命令，完成了提取指定区域比对的reads的bam文件这个需求！

samtools view -h

Usage: samtools view [options] || [region ...]

Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-q INT only include reads with mapping quality >= INT [0]
-l STR only include reads in library STR [null]
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
-f INT only include reads with all bits set in INT set in FLAG [0]
-F INT only include reads with none of the bits set in INT set in FLAG [0]
-x STR read tag to strip (repeatable) [null]
-B collapse the backward CIGAR operation
-s FLOAT integer part sets seed of random number generator [0];
rest sets fraction of templates to subsample [no subsampling]
-@, --threads INT
number of BAM/CRAM compression threads [0]
-? print long help, including note about region specification
-S ignored (input format is auto-detected)
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
-T, --reference FILE
Reference sequence FASTA FILE [null]

仅仅对感兴趣的基因call variation

ulwvfje — Mon, 14 Nov 2016 07:20:18 +0000

有这个需求，是因为我们经常对某些细胞系进行有针对性的设计变异，比如BAF155的R1064K呀，H3F3A的K27呀，那我我们拿到高通量测序数据的时候，就肯定希望可以快速的看看这个基因是否被突变成功了。现在比对几乎不耗费什么时间了，但是得到的sam要sort的时候还是蛮耗费时间的。假设，我们已经得到了所有样本的sort好的bam文件，想看看自己设计的基因突变是否成功了，可以有针对性的只call 某个基因的突变！

代码很简单：

grep H3F3A ~/reference/gtf/gencode/protein_coding.hg19.position
samtools mpileup -r chr1:226249552-226259702 -ugf ~/reference/genome/hg19/hg19.fa *sorted.bam | bcftools call -vmO z -o H3F3A.vcf.gz
gunzip H3F3A.vcf.gz
~/biosoft/ANNOVAR/annovar/convert2annovar.pl -format vcf4old H3F3A.vcf >H3F3A.annovar
~/biosoft/ANNOVAR/annovar/annotate_variation.pl -buildver hg19 --geneanno --outfile H3F3A.anno H3F3A.annovar ~/biosoft/ANNOVAR/annovar/humandb/
~/biosoft/ANNOVAR/annovar/annotate_variation.pl -buildver hg19 --dbtype knownGene --geneanno --outfile H3F3A.anno H3F3A.annovar ~/biosoft/ANNOVAR/annovar/humandb/

需要自己制作好基因的起始终止坐标文件，这样就可以找到自己的基因的位置，比如我的H3F3A是chr1:226249552-226259702，用bcftoolls简单的call variation即可，得到的vcf文件用annovar注释一下，看看是否在自己设计的蛋白质的某个位点的氨基酸！

PS:需要自己安装annovar，可以看我以前的博客！

是不是很简单呀~

仔细探究samtools的rmdup是如何行使去除PCR重复reads功能的

ulwvfje — Sat, 12 Nov 2016 01:51:30 +0000

在做这个去除PCR重复reads时候必须要明白为什么要做这个呢？WGS？WES？RNA-SEQ?CHIP-SEQ？都需要吗？随机打断测序才需要？特异性捕获不需要？

搞明白了，我们就开始做，首先拿一个小的单端测序数据比对结果来做测试！

samtools rmdup -s tmp.sorted.bam tmp.rmdup.bam

[bam_rmdupse_core] 25 / 53 = 0.4717 in library

我们的测试数据里面有53条records根据软件算出了25条reads都是PCR的duplicate，所以去除了！

samtools rmdup 的官方说明书见： http://www.htslib.org/doc/samtools.html

samtools rmdup [-sS]

只需要开始-s的标签，就可以对单端测序进行去除PCR重复。其实对单端测序去除PCR重复很简单的~，因为比对flag情况只有0,4,16，只需要它们比对到染色体的起始终止坐标一致即可，flag很容易一致。但是对于双端测序就有点复杂了~

Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).

然后我们再拿一个小的双端测序数据来测试一下：

samtools rmdup tmp.sorted.bam tmp.rmdup.bam

[bam_rmdup_core] processing reference chr10...

[bam_rmdup_core] 2 / 12 = 0.1667 in library

很明显可以看出，去除PCR重复不仅仅需要它们比对到染色体的起始终止坐标一致，尤其是flag，在双端测序里面一大堆的flag情况，所以我们的94741坐标的5个reads，一个都没有去除！

这样的话，双端测序数据，用samtools rmdup效果就很差，所以很多人建议用picard工具的MarkDuplicates 功能~~~

The optimal solution depends on many factors - the consensus seems to be the the picard markduplicates could be the best current solution.

The appropriateness of duplicate removal depends on coverage - one would want to only remove artificial duplicates and keep the natural duplicates.

MarkDuplicates is "more correct" in the strict sense. Rmdup is more efficient simply because it does handle those tough cases. Rmdup works for single-end, too, but it cannot do paired-end and single-end at the same time. It does not work properly for mate-pair reads if read lengths are different.

用samtools idxstats来对de novo的转录组数据计算表达量

ulwvfje — Mon, 31 Oct 2016 09:16:48 +0000

de novo的转录组数据，比对的时候一般用的是自己组装好的trinity.fasta序列(挑选最长蛋白的转录本序列)来做参考，用bowtie2等工具直接将原始序列比对即可。所以比对 sam/bam文件本身就包含了参考序列的每一条转录本序列ID，直接对 sam/bam文件进行counts就知道每一个基因的表达量啦！

本来我是准备自己写脚本对sam文件进行counts就好，但是发现了samtools自带这样的工具：http://www.htslib.org/doc/samtools.html

如果是针对基因组序列，那么这个功能用处不大，但是针对转录本序列，统计出来的就是我们想要的转录本表达量。

samtools idxstats tmp.bowtie2.sorted.bam |head
TR3|c0_g1_i1 1276 418 0
TR6|c0_g1_i1 1271 10 0
TR6|c0_g1_i2 944 5 0
TR6|c0_g1_i3 1281 4 0
TR6|c0_g1_i4 1224 53 0
TR6|c0_g1_i5 855 16 0
TR19|c0_g1_i2 1428 19 0
TR19|c0_g1_i3 2536 624 0
TR19|c0_g1_i4 3072 105 0
TR19|c0_g1_i5 1685 0 0

软件官网说明书，说的很清楚：

samtools idxstats in.sam|in.bam|in.cram

Retrieve and print stats in the index file corresponding to the input file. Before calling idxstats, the input BAM file must be indexed by samtools index.

The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. It is written to stdout.

第三列，就是我们想要的表达量数据啦，比对到每个转录本序列的reads数量。

大家从我的转录本序列ID上面如果可以看出些什么问题，欢迎跟我交流，直接给我email就好了，jmzeng1314@163.com

现在知道了每个转录本的表达量，把每个样本都做一下，就知道表达矩阵了，做差异分析就很简单了。但是得到的是差异转录本列表，不明白这些ID背后的意义，需要取注释，才能做下一步分析。

ls *sorted.bam |while read id
do
echo $id ${id%%.*}.t.counts
nohup samtools idxstats $id 1>${id%%.*}.t.counts 2>/dev/null &
done

根据比对的bam文件来对peaks区域可视化

ulwvfje — Tue, 02 Aug 2016 13:52:53 +0000

之前分析了好几个公共项目，拿到的peaks都很诡异，搞得我一直怀疑是不是自己分析错了。终于，功夫不负有心人，我分析了一个数据，它的peaks非常完美！！！可以证明，我的分析流程以及peaks绘图代码并没有错！数据来自于http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74311，是关于H3K27ac_ChIP-Seq_LOUCY，组蛋白修饰的CHIP-seq数据，很容易就下载了作者上传的测序数据，然后跑了我的流程！https://github.com/jmzeng1314/NGS-pipeline/tree/master/CHIPseq

本文的重点在于讲解如何查看自己的peaks是否是正确的！我是直接用比对的bam文件来用samtools depth命令来获取peaks区域的测序深度，从而画图的，代码见step5-peaks-view-samtools-depth.R

在终端调用我的代码画图命令如下；

Rscript ~/scripts/peakView.R ../unique_peaks.bed ../../SRR2774675.unique.sorted.bam ../../SRR2774676.unique.sorted.bam
Rscript ~/scripts/peakView.R ../unique_peaks.bed ../../SRR2774675.unique.sorted.bam ../../SRR2774676.unique.sorted.bam

下面随便看两个peaks，很明显是双峰模型，而且IP的测序深度远高于INPUT，数据非常棒！

然后我不得不指出如果CHIP-seq实验失败，那么peaks会很诡异，首先你会看到测序深度大多都在10以下，即使有部分测序深度很高的，也是IP和INPUT的测序深度压根就没有差异，下面就是一个典型的失败案例！

这种实验失败的数据，实在是无法分析。而转录因子的CHIP-seq实验失败率还挺高的，所以一定要有control，否则再怎么分析也是 rubbish in rubbish out

GATK使用注意事项

ulwvfje — Mon, 06 Jul 2015 23:27:05 +0000

GATK这个软件在做snp-calling的时候使用率非常高，因为之前一直是简单粗略的看看snp情况而已，所以没有具体研究它。

这些天做一些外显子项目以找snp为重点，所以想了想还是用起它，报错非常多，调试了好久才成功。

所以记录一些注意事项!

GATK软件本身是受版权保护的，所以需要申请才能下载使用，大家自己去broad institute申请即可。

下载软件就可以直接使用，java软件不需要安装，但是需要你的机器上面有java，当然软件只是个开始，重点是你还得下载很多配套数据，https://software.broadinstitute.org/gatk/download/bundle（ps:这个链接可能会失效，下面的文件，请自己谷歌找到地址哈。），而且这个时候要明确你的参考基因组版本了！！！ b36/b37/hg18/hg19/hg38，记住b37和hg19并不是完全一样的，有些微区别哦！！！

比如我选择了hg19

第一点是hg19的下载：这个下载地址非常多，常用的就是NCBI，ensembl和UCSC了，但是这里推荐用这个脚本下载

for i in $(seq 1 22) X Y M;

do echo $i;

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;

done

gunzip *.gz

for i in $(seq 1 22) X Y M;

do cat chr${i}.fa >> hg19.fasta;

done

rm -fr chr*.fasta

看得懂shell脚本的应该知道这是一个个的下载hg19的染色体，再用cat按照染色体的顺序拼接起来，因为GATK后面的一些步骤对染色体顺序要求非常变态，如果下载整个hg19，很难保证染色体顺序是1-22，X,Y,M。如下

然后需要对下载的hg19进行索引（bwa和samtools）和建立dict文件（用picard）

bwa index -a bwtsw hg19.fasta

samtools faidx hg19.fasta

然后还要下载几个参考文件，这个是可以选择的.

对我的hg19来说，就应该是去，ftp://ftp.broadinstitute.org/bundle/hg19/ 下载咯。

最后，所有必备的文件如下：

231M Jul 2 05:14 1000G_phase1.indels.hg19.sites.vcf
1.2M Jul 2 10:45 1000G_phase1.indels.hg19.sites.vcf.idx
11G Jul 2 08:05 dbsnp_138.hg19.vcf
2.5K Jul 1 04:31 hg19.dict
3.0G Jun 30 21:29 hg19.fasta
6.6K Jun 30 22:54 hg19.fasta.amb
944 Jun 30 22:54 hg19.fasta.ann
2.9G Jun 30 22:54 hg19.fasta.bwt
788 Jul 2 01:53 hg19.fasta.fai
739M Jun 30 22:54 hg19.fasta.pac
1.5G Jun 30 23:23 hg19.fasta.sa
87M Jul 2 05:37 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
2.3M Jul 2 10:45 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx

接下来开始跑程序

第一步就是生成sam文件啦bwa mem -t 12 -M hg19.fasta tmp*fq >tmp.sam

第二步是sort，我用的是picard工具java -Xmx100g -jar AddOrReplaceReadGroups.jar I=tmp.sam O=tmp.sorted.bam

SORT_ORDER=coordinate

CREATE_INDEX=true

RGID=tmp

RGLB="pe"

RGPU="HiSeq-2000"

RGSM=PC3-2

RGCN="Human Genetics of Infectious Disease"

RGDS=hg19 RGPL=illumina

VALIDATION_STRINGENCY=SILENT

第三步是去除PCR重复，我还是选择用picard工具

java -Xmx100g -jar MarkDuplicates.jar

CREATE_INDEX=true REMOVE_DUPLICATES=True

ASSUME_SORTED=True VALIDATION_STRINGENCY=LENIENT

I=tmp.sorted.bam OUTPUT=tmp.dedup.bam METRICS_FILE=tmp.metrics

第四步是终于要开始用GATK啦，主要是确定要进行重新比对的区域，这个步骤分成三个小步骤：

首先用RealignerTargetCreator找到需要重新比对的区域，输出文件intervals

java -Xmx200g -jar ~/apps/gatk/GenomeAnalysisTK.jar

-R hg19.fasta #这里需要用这个参考基因组，所以参考基因组特别重要，DICT也要按照流程生成

-T RealignerTargetCreator

-I tmp.dedup.bam -o tmp.intervals

-known /home/ldzeng/EXON/ref/1000G_phase1.indels.hg19.sites.vcf

这一步骤好像非常耗时

可以看到，我总共就测试了5014个reads，结果就花了近半个小时才搞定，只有947个reads被过滤了。

输出的tmp.intervals 文件是一个1404946行的文件

chr1:13957-13958

chr1:46402-46403

chr1:47190-47191

chr1:52185-52188

chr1:53234-53236

chr1:55249-55250

chr1:63735-63738

人的外显子只有二三十万，所以我暂时也不确定这个文件是什么！

然后用输出的 tmp.intervals 做输入文件来进行重新比对，也就是用IndelRealigner在这些区域内进行重新比对

java -Xmx150g -jar ~/apps/gatk/GenomeAnalysisTK.jar \

-R hg19.fasta \

-T IndelRealigner \

-targetIntervals tmp.intervals \

-I tmp.dedup.bam -o tmp.dedup.realgn.bam \

-known /home/ldzeng/EXON/ref/1000G_phase1.indels.hg19.sites.vcf

我只需要它的重新比对，所以后面的一些功能没有怎么用，一个是call snp，一个是算比对质量值

java -Xmx200g -jar ~apps/gatk/GenomeAnalysisTK.jar

-nct 20 -T HaplotypeCaller -R hg19.fasta

-I tmp.dedup.realgn.bam

-o tmp.gatk.vcf

最后输出的文件如下

639K Jul 5 10:17 tmp1.fq
639K Jul 5 10:19 tmp2.fq
1.5M Jul 5 10:26 tmp.dedup.bai
403K Jul 5 10:26 tmp.dedup.bam
12K Jul 5 12:02 tmp.gatk.vcf
3.4K Jul 5 12:02 tmp.gatk.vcf.idx
32M Jul 5 11:24 tmp.intervals
950 Jul 5 10:26 tmp.metrics
1.5M Jul 5 11:31 tmp.realgn.bai
409K Jul 5 11:31 tmp.realgn.bam
1.6M Jul 5 10:20 tmp.sam
1.5M Jul 5 10:23 tmp.sorted.bai
399K Jul 5 10:23 tmp.sorted.bam

备注：GATK对基因组要求一个字典文件

使用picard工具包的CreateSequenceDictionary.jar生成。以hg19.fa为例，生成的命令为：

java -Xmx2g -jar /path_to_picard/CreateSequenceDictionary.jar R=hg19.fa O=hg19.dict

Samtools无法同时得到mpileup格式的数据和bcftools格式的数据

ulwvfje — Mon, 01 Jun 2015 01:47:15 +0000

来自于： https://www.biostars.org/p/63429/

I'm using samtools mpileup and would like to generate both a pileup file and a vcf file as output. I can see how to generate one or the other, but not both (unless I run mpileup twice). I suspect I am missing something simple.

Specifically, calling mpileup with the -g or -u flag causes it to compute genotype likelihoods and output a bcf. Leaving these flags off just gives a pileup. Is there any way to get both, without redoing the work of producing the pileup file? Can I get samtools to generate the bcf _from_ the pileup file in some way? Generating the bcf from the bam file, when I already have the pileup, seems wasteful.

Thanks for any help!

我写了脚本来运行，才发现我居然需要两个重复的步骤来得到mpileup格式的数据和bcftools格式的数据，而这很明显的重复并且浪费时间的工作

for i in *sam

echo $i

samtools view -bS $i >${i%.*}.bam

samtools sort ${i%.*}.bam ${i%.*}.sorted

samtools index ${i%.*}.sorted.bam

samtools mpileup -f /home/jmzeng/ref-database/hg19.fa ${i%.*}.sorted.bam >${i%.*}.mpileup

samtools mpileup -guSDf /home/jmzeng/ref-database/hg19.fa ${i%.*}.sorted.bam | bcftools view -cvNg - > ${i%.*}.vcf

Done

我想得到mpileup格式，是因为后续的varscan等软件需要这个文件来call snp

而得到bcftools格式可以直接用bcftools进行snp-calling

samtools mpileup 命令只有用了-g或者-u那么就只会输出bcf文件

如果想得到mpileup格式的数据，就只能用-f参数。

bcftools doesn't work on pileup format data. It works on bcf/vcf files.
samtools provides a script called sam2vcf.pl, which works on the output of "samtools pileup". However, this command is deserted in newer versions. The output of "samtools mpileup" does not satisfy the requirement of sam2vcf.pl. You can check the required pileup format on lines 95-99, which is different from output of "samtools mpileup".

Samtools安装及使用

ulwvfje — Sun, 29 Mar 2015 13:45:27 +0000

一、下载安装该软件。

网上可以搜索到下载地址，解压之后make即可

一般都会报错

In file included from bam_cat.c:41:0:

htslib-1.1/htslib/bgzf.h:34:18: fatal error: zlib.h: No such file or directory

#include

compilation terminated.

make: *** [bam_cat.o] Error 1

然后，居然就通过了，晕。有时候我实在是搞不定linux系统一些具体的原理，但是反正就是能用！学会搜索，学会试错即可。

直到两年后我才理解（linux下的软件安装需要指定路径，而且是自己有权限的路径，2016年11月23日10:12:11），比如安装下面的方式来安装软件：

mkdir -p ~/biosoft/myBin
echo 'export PATH=/home/jianmingzeng/biosoft/myBin/bin:$PATH' >>~/.bashrc
source ~/.bashrc
cd ~/biosoft
mkdir cmake && cd cmake
wget http://cmake.org/files/v3.3/cmake-3.3.2.tar.gz
tar xvfz cmake-3.3.2.tar.gz
cd cmake-3.3.2
./configure --prefix=/home/jianmingzeng/biosoft/myBin ## 这里非常重要
make
make install

但是有些电脑会报另外一个错

#include

compilation terminated.

make: *** [bam_tview_curses.o] Error 1

我也顺便解决一下，因为以前我的服务器遇到过，也是很纠结的。

sudo apt-get install libncurses5-dev

二．准备数据及使用，见我的snp-caling流程

http://www.bio-info-trainee.com/?p=439

samtools view -bS tmp1.sam > tmp1.bam

samtools sort tmp1.bam tmp1.sorted

samtools index tmp1.sorted.bam

samtools mpileup -d 1000 -gSDf ../../../ref-database/hg19.fa tmp1.sorted.bam |bcftools view -cvNg – >tmp1.vcf

因为这个软件都是与bwa和bowtie等能产生sam文件的软件合作才能使用。

其中这个软件参数还是蛮多的，但是常用的就那么几个，网上也很容易找到教程

简单附上一点资料

samtools是一个用于操作sam和bam文件的工具合集。包含有许多命令。以下是常用命令的介绍

1. view

view命令的主要功能是：将sam文件转换成bam文件；然后对bam文件进行各种操作，比如数据的排序(不属于本命令的功能)和提取(这些操作是对bam文件进行的，因而当输入为sam文件的时候，不能进行该操作)；最后将排序或提取得到的数据输出为bam或sam（默认的）格式。

bam文件优点：bam文件为二进制文件，占用的磁盘空间比sam文本文件小；利用bam二进制文件的运算速度快。

view命令中，对sam文件头部的输入(-t或-T）和输出(-h)是单独的一些参数来控制的。

Usage: samtools view [options] | [region1 [...]]默认情况下不加 region，则是输出所有的 region. Options:

-b output BAM 默认下输出是 SAM 格式文件，该参数设置输出 BAM 格式 -h print header for the SAM output 默认下输出的 sam 格式文件不带 header，该参数设定输出sam文件时带 header 信息 -H print header only (no alignments) -S input is SAM 默认下输入是 BAM 文件，若是输入是 SAM 文件，则最好加该参数，否则有时候会报错。

例子：

#将sam文件转换成bam文件$ samtools view -bS abc.sam > abc.bam$ samtools view -b -S abc.sam -o abc.bam

#提取比对到参考序列上的比对结果$ samtools view -bF 4 abc.bam > abc.F.bam #提取paired reads中两条reads都比对到参考序列上的比对结果，只需要把两个4+8的值12作为过滤参数即可$ samtools view -bF 12 abc.bam > abc.F12.bam #提取没有比对到参考序列上的比对结果$ samtools view -bf 4 abc.bam > abc.f.bam #提取bam文件中比对到caffold1上的比对结果，并保存到sam文件格式$ samtools view abc.bam scaffold1 > scaffold1.sam #提取scaffold1上能比对到30k到100k区域的比对结果$ samtools view abc.bam scaffold1:30000-100000 > scaffold1_30k-100k.sam #根据fasta文件，将 header 加入到 sam 或 bam 文件中$ samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam

2. sort

sort对bam文件进行排序。

Usage: samtools sort [-n] [-m ] -m 参数默认下是 500,000,000 即500M（不支持K，M，G等缩写）。对于处理大数据时，如果内存够用，则设置大点的值，以节约时间。-n 设定排序方式按short reads的ID排序。默认下是按序列在fasta文件中的顺序（即header）和序列从左往右的位点排序。

例子：

$ samtools sort abc.bam abc.sort$ samtools view abc.sort.bam | less -S

3.merge

将2个或2个以上的已经sort了的bam文件融合成一个bam文件。融合后的文件不需要则是已经sort过了的。

Usage: samtools merge [-nr] [-h inh.sam] [...] Options: -n sort by read names -r attach RG tag (inferred from file names) -u uncompressed BAM output -f overwrite the output BAM if exist -1 compress level 1 -R STR merge file in the specified region STR [all] -h FILE copy the header in FILE to [in1.bam] Note: Samtools' merge does not reconstruct the @RG dictionary in the header. Users must provide the correct header with -h, or uses Picard which properly maintains the header dictionary in merging.

4.index

必须对bam文件进行默认情况下的排序后，才能进行index。否则会报错。

建立索引后将产生后缀为.bai的文件，用于快速的随机处理。很多情况下需要有bai文件的存在，特别是显示序列比对情况下。比如samtool的tview命令就需要；gbrowse2显示reads的比对图形的时候也需要。

Usage: samtools index [out.index]

例子：

#以下两种命令结果一样$ samtools index abc.sort.bam$ samtools index abc.sort.bam abc.sort.bam.bai

5. faidx

对fasta文件建立索引,生成的索引文件以.fai后缀结尾。该命令也能依据索引文件快速提取fasta文件中的某一条（子）序列

Usage: samtools faidx [ [...]] 对基因组文件建立索引$ samtools faidx genome.fasta#生成了索引文件genome.fasta.fai,是一个文本文件，分成了5列。第一列是子序列的名称；第二列是子序列的长度；个人认为“第三列是序列所在的位置”，因为该数字从上往下逐渐变大，最后的数字是genome.fasta文件的大小；第4和5列不知是啥意思。于是通过此文件，可以定位子序列在fasta文件在磁盘上的存放位置，直接快速调出子序列。 #由于有索引文件，可以使用以下命令很快从基因组中提取到fasta格式的子序列$ samtools faidx genome.fasta scffold_10 > scaffold_10.fasta

6. tview

tview能直观的显示出reads比对基因组的情况，和基因组浏览器有点类似。

Usage: samtools tview [ref.fasta] 当给出参考基因组的时候，会在第一排显示参考基因组的序列，否则，第一排全用N表示。按下 g ，则提示输入要到达基因组的某一个位点。例子“scaffold_10:1000"表示到达第10号scaffold的第1000个碱基位点处。使用H(左）J（上）K（下）L（右）移动显示界面。大写字母移动快，小写字母移动慢。使用空格建向左快速移动（和 L 类似），使用Backspace键向左快速移动（和 H 类似）。Ctrl+H 向左移动1kb碱基距离； Ctrl+L 向右移动1kb碱基距离可以用颜色标注比对质量，碱基质量，核苷酸等。30～40的碱基质量或比对质量使用白色表示；20～30黄色；10～20绿色；0～10蓝色。使用点号'.'切换显示碱基和点号；使用r切换显示read name等还有很多其它的使用说明，具体按？键来查看。

参考：samtools的说明文档：http://samtools.sourceforge.net/samtools.shtml

http://www.plob.org/2014/01/26/7112.html

Snp-calling流程（BWA+SAMTOOLS+BCFTOOLS）

ulwvfje — Mon, 23 Mar 2015 12:20:25 +0000

比对可以选择BWA或者bowtie，测序数据可以是单端也可以是双端，我这里简单讲一个，但是脚本都列出来了。而且我选择的是bowtie比对，然后单端数据。

首先进入hg19的目录，对它进行两个索引

samtools faidx hg19.fa

Bowtie2-build hg19.fa hg19

我这里随便从26G的测序数据里面选取了前1000行做了一个tmp.fa文件，进入tmp.fa这个文件的目录进行操作

Bowtie的使用方法详解见http://www.bio-info-trainee.com/?p=398

bowtie2 -x ../../../ref-database/hg19 -U tmp1.fa -S tmp1.sam

samtools view -bS tmp1.sam > tmp1.bam

samtools sort tmp1.bam tmp1.sorted

samtools index tmp1.sorted.bam

samtools mpileup -d 1000 -gSDf ../../../ref-database/hg19.fa tmp1.sorted.bam |bcftools view -cvNg - >tmp1.vcf

然后就能看到我们产生的vcf变异格式文件啦！

当然，我们可能还需要对VCF文件进行再注释！

要看懂以上流程及命令，需要掌握BWA，bowtie，samtools，bcftools，

数据格式fasta，fastq，sam，vcf，pileup

如果是bwa把参考基因组索引化，然后aln得到后缀树，然后sampe对双端数据进行比对

首先bwa index 然后选择算法，进行索引。

然后aln脚本批量处理

==> bwa_aln.sh <==

while read id

echo $id

bwa aln hg19.fa $id >$id.sai

done <$1

然后sampe脚本批量处理

==> bwa_sampe.sh <==

while read id

echo $id

bwa sampe hg19.fa $id*sai $id*single >$id.sam

done <$1

然后是samtools的脚本

==> samtools.sh <==

while read id

echo $id

samtools view -bS $id.sam > $id.bam

samtools sort $id.bam $id.sorted

samtools index $id.sorted.bam

done <$1

然后是bcftools的脚本

==> bcftools.sh <==

while read id

echo $id

samtools mpileup -d 1000 -gSDf ref.fa $id*sorted.bam |bcftools view -cvNg - >$id.vcf

done <$1

==> mpileup.sh <==

while read id

echo $id

samtools mpileup -d 100000 -f hg19.fa $id*sorted.bam >$id.mpileup

done <$1