用R包BayesPeak来对CHIP-seq数据call peaks

BayesPeak也是peaks caller家族一员,用的人也不少,我这次也试了一下,因为是R的bioconductor系列包,所以直接在R里面安装就好,但是有几个点需要注意,我比对的基因组不只是Chr1~22,X,Y,M,还有一些contig和scaffold,需要在bam文件里面去除的,而且BayesPeak比较支持读取BED文件,可以直接转为GRanges对象,虽然它号称可以使用多核,但是计算速度还是非常慢。

### step6.7 peak calling by BayesPeak(R bioconductor package)
# Bayesian Analysis of ChIP-seq data
## BayesPeak fits a Markov model to the data (the aligned reads) via Markov Chain Monte Carlo (MCMC) techniques.

# 有博客里面提到I've used BayesPeak running in R. It is much easier to install than MACS (failed for me), which require some (strange to me) files.
# 首先要把bowtie2比对好的alignment文件bam格式转换为bed格式: http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
## In particular, the chromosome, start position, end position and DNA strand appear in the 1st, 2nd, 3rd and 6th columns respectively.
#### software : http://bioconductor.org/packages/release/bioc/html/BayesPeak.html
#### readme: http://bioconductor.org/packages/release/bioc/vignettes/BayesPeak/inst/doc/BayesPeak.pdf
学习R的bioconductor系列包很容易,先看看例子即可examples:

library(BayesPeak) ## 一般例子都会读取包自带的测试文件
tFile=file.path(system.file(package='BayesPeak'),'extdata','H3K4me3reduced.bed')
cFile=file.path(system.file(package='BayesPeak'),'extdata','Inputreduced.bed')

raw.output <- bayespeak(tFile, cFile, chr = "chr16", start = 9.2E7, end = 9.5E7, job.size = 6E6)
output <- summarize.peaks(raw.output, method = "lowerbound")
## the function summarize.peaks will do : Filtering of unenriched jobs/Filtering of unenriched bins/Assembly of enriched bins/Conversion of bins to peaks

write.table(as.data.frame(output), file = "H3K4me3output.txt", quote = FALSE)
## write.csv(as.data.frame(output), file = "H3K4me3output.csv", quote = FALSE)
## 可以借助多线程来加快运行速度:
library(parallel)
## 还需要 检查pp这个阈值的选取 # A “potentially enriched” bin is defined as any bin with PP > 0.01.
The output of the algorithm is the Posterior Probability (often abbreviated to PP) of each bin being enriched.
The PP value is useful not only for calling the peaks, but could also be used in downstream analyses - for
example, to weight observations when searching for a novel transcription factor motif. The PP value is not
to be confused with the p value from hypothesis testing

> min.job <- min(raw.output$peaks$job)
> max.job <- max(raw.output$peaks$job)
> par(mfrow = c(2,2), ask = TRUE)
> for(i in min.job:max.job) {plot.PP(raw.output, job = i, ylim = c(0,50))}
When the coverage is sparse and therefore less information is available, the PP values tend to be more
uniformly spread over the interval [0,1], as above. This means that the distinction between peaks and
background is harder to make, which is usually a result of poor enrichment,

raw.output <- bayespeak(tFile, cFile,use.multicore = TRUE, mc.cores = 4)
i <- 324
plot.PP(raw.output, job = i, ylim = c(0,50))

看完了例子,就可以开始处理自己的数据啦:

############ first change bam files to bed files :
ls *sorted.bam |while read id ;do ~/biosoft/bedtools/bedtools2/bin/bedtools bamtobed -i $id > ${id%%.*}.bed ;done
但是要过滤掉特殊染色体(chr6_cox_hap2,chrUn_gl000214),仅仅保留CHR1-22,X,Y,M
ls *bed |while read id ;do grep -v "_" $id >${id%%.*}.clean_bed;done

下面是我处理自己的数据的完整代码,很简单:

############ Then do peak calling in R by BayesPeak
library(BayesPeak)
library(parallel)
workdir=getwd()
tFile=file.path(workdir,'SRR1042593.clean_bed')
cFile=file.path(workdir,'SRR1042594.clean_bed')
raw.output <- bayespeak(tFile, cFile,use.multicore = TRUE, mc.cores = 8)
output <- summarize.peaks(raw.output, method = "lowerbound")
write.table(as.data.frame(output), file = "Xu_MUT_rep1.txt", quote = FALSE)

Comments are closed.