不要想当然的使用生信软件，读文档，勤搜索！

ulwvfje — Mon, 06 Feb 2017 02:35:42 +0000

最近在写一篇很有趣的文章，一张图说清楚wgs,wes,rna-seq,chip-seq的异同点！

需要用到一些测试数据，我准备拿17号染色体的40437407-40486397这约48Kb碱基区域来举例子，就需要把这个区域的bam提取出来。

我分别找了以前处理的wgs,wes,rna-seq,chip-seq公共数据，原始bam非常大，尤其是WGS的，45G的bam文件，所以只能抽取17号染色体的40437407-40486397这约48Kb碱基区域，以前我做mpileup或者其它都是用的-r 参数，所以我想当然的使用下面的代码：

samtools view -h -r chr17:40437407-40486397 your.sorted.merge.bam |samtools view -bS - >wes.bam

发现始终不对，让我着实郁闷，我就Google了一下，https://www.biostars.org/p/48719/

才明白，samtools的view命令的-r参数不再是用来指定坐标了！

samtools view -h control_1.sort.bam "chr17:40437407-40486397" |samtools view -bS - >RNA-seq.bam

所以我修改了命令，完成了提取指定区域比对的reads的bam文件这个需求！

samtools view -h

Usage: samtools view [options] || [region ...]

Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-q INT only include reads with mapping quality >= INT [0]
-l STR only include reads in library STR [null]
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
-f INT only include reads with all bits set in INT set in FLAG [0]
-F INT only include reads with none of the bits set in INT set in FLAG [0]
-x STR read tag to strip (repeatable) [null]
-B collapse the backward CIGAR operation
-s FLOAT integer part sets seed of random number generator [0];
rest sets fraction of templates to subsample [no subsampling]
-@, --threads INT
number of BAM/CRAM compression threads [0]
-? print long help, including note about region specification
-S ignored (input format is auto-detected)
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
-T, --reference FILE
Reference sequence FASTA FILE [null]

Bioconductor系列之pasillaBamSubset

ulwvfje — Sun, 19 Jul 2015 02:46:35 +0000

这个包主要有bam文件测试数据
> biocLite("pasillaBamSubset")
BioC_mirror: http://bioconductor.orgUsing Bioconductor version 3.0 (BiocInstaller 1.16.5), R version 3.1.2.
Installing package(s) 'pasillaBamSubset'
trying URL 'http://bioconductor.org/packages/3.0/data/experiment/bin/windows/contrib/3.1/pasillaBamSubset_0.3.1.zip'
Content type 'application/zip' length 31514402 bytes (30.1 Mb)
打开pasillaBamSubset包的安装地址就可以看到里面有几个bam文件
Several functions are available for reading BAM files into R:
而且加载包的同时也引入了几个读取bam文件的函数
readGAlignments()
readGAlignmentPairs()
readGAlignmentsList()
scanBam()

加载包就可以看到用两个函数得到包自带的数据文件的地址，主要是有很多人不一定把包安装在C盘，所以用函数来定位文件更加安全一点
> library(pasillaBamSubset)
> untreated1_chr4()
[1] "C:/Program Files/R/R-3.1.2/library/pasillaBamSubset/extdata/untreated1_chr4.bam"
> untreated3_chr4()
[1] "C:/Program Files/R/R-3.1.2/library/pasillaBamSubset/extdata/untreated3_chr4.bam"

接下来我们就看看如何读取这些bam文件的
library(pasillaBamSubset)
un1 <- untreated1_chr4() # single-end reads library(GenomicAlignments) reads1 <- readGAlignments(un1) cvg1 <- coverage(reads1) 查看reads1这个结果，可以看到把这个bam文件都读成了一个数据对象GAlignments object，
针对着个数据对象有很多操作，其中一个coverage操作是来自于GenomicFeatures
或者GenomicAlignments函数的，可以算出测序覆盖情况。
可以看到这个bam文件里面的比对情况大多几种在4号染色体里面
> cvg1$chr4
integer-Rle of length 1351857 with 122061 runs
Lengths: 891 27 5 12 13 45 5 12 13 ... 5 1 1 3 10
Values : 0 1 2 3 4 5 4 3 2 ... 12 11 10 6
> mean(cvg1$chr4)
[1] 11.33746
> max(cvg1$chr4)[1] 5627
可以看到平均测序深度是11.3X，最大测序深度是5627X

生信菜鸟团 » bam

不要想当然的使用生信软件，读文档，勤搜索！

Bioconductor系列之pasillaBamSubset