用R包BayesPeak来对CHIP-seq数据call peaks

ulwvfje — Tue, 05 Jul 2016 15:25:46 +0000

BayesPeak也是peaks caller家族一员，用的人也不少，我这次也试了一下，因为是R的bioconductor系列包，所以直接在R里面安装就好，但是有几个点需要注意，我比对的基因组不只是Chr1~22,X,Y,M，还有一些contig和scaffold，需要在bam文件里面去除的，而且BayesPeak比较支持读取BED文件，可以直接转为GRanges对象，虽然它号称可以使用多核，但是计算速度还是非常慢。

### step6.7 peak calling by BayesPeak(R bioconductor package)
# Bayesian Analysis of ChIP-seq data
## BayesPeak fits a Markov model to the data (the aligned reads) via Markov Chain Monte Carlo (MCMC) techniques.

# 有博客里面提到I've used BayesPeak running in R. It is much easier to install than MACS (failed for me), which require some (strange to me) files.
# 首先要把bowtie2比对好的alignment文件bam格式转换为bed格式： http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
## In particular, the chromosome, start position, end position and DNA strand appear in the 1st, 2nd, 3rd and 6th columns respectively.
#### software : http://bioconductor.org/packages/release/bioc/html/BayesPeak.html
#### readme: http://bioconductor.org/packages/release/bioc/vignettes/BayesPeak/inst/doc/BayesPeak.pdf
学习R的bioconductor系列包很容易，先看看例子即可examples:

library(BayesPeak) ## 一般例子都会读取包自带的测试文件
tFile=file.path(system.file(package='BayesPeak'),'extdata','H3K4me3reduced.bed')
cFile=file.path(system.file(package='BayesPeak'),'extdata','Inputreduced.bed')

raw.output <- bayespeak(tFile, cFile, chr = "chr16", start = 9.2E7, end = 9.5E7, job.size = 6E6)
output <- summarize.peaks(raw.output, method = "lowerbound")
## the function summarize.peaks will do : Filtering of unenriched jobs/Filtering of unenriched bins/Assembly of enriched bins/Conversion of bins to peaks

write.table(as.data.frame(output), file = "H3K4me3output.txt", quote = FALSE)
## write.csv(as.data.frame(output), file = "H3K4me3output.csv", quote = FALSE)
## 可以借助多线程来加快运行速度：
library(parallel)
## 还需要检查pp这个阈值的选取 # A “potentially enriched” bin is defined as any bin with PP > 0.01.
The output of the algorithm is the Posterior Probability (often abbreviated to PP) of each bin being enriched.
The PP value is useful not only for calling the peaks, but could also be used in downstream analyses - for
example, to weight observations when searching for a novel transcription factor motif. The PP value is not
to be confused with the p value from hypothesis testing

> min.job <- min(raw.output$peaks$job)
> max.job <- max(raw.output$peaks$job)
> par(mfrow = c(2,2), ask = TRUE)
> for(i in min.job:max.job) {plot.PP(raw.output, job = i, ylim = c(0,50))}
When the coverage is sparse and therefore less information is available, the PP values tend to be more
uniformly spread over the interval [0,1], as above. This means that the distinction between peaks and
background is harder to make, which is usually a result of poor enrichment,

raw.output <- bayespeak(tFile, cFile,use.multicore = TRUE, mc.cores = 4)
i <- 324
plot.PP(raw.output, job = i, ylim = c(0,50))

看完了例子，就可以开始处理自己的数据啦：

############ first change bam files to bed files :
ls *sorted.bam |while read id ;do ~/biosoft/bedtools/bedtools2/bin/bedtools bamtobed -i $id > ${id%%.*}.bed ;done
但是要过滤掉特殊染色体(chr6_cox_hap2,chrUn_gl000214)，仅仅保留CHR1-22,X,Y,M
ls *bed |while read id ;do grep -v "_" $id >${id%%.*}.clean_bed;done

下面是我处理自己的数据的完整代码，很简单：

############ Then do peak calling in R by BayesPeak
library(BayesPeak)
library(parallel)
workdir=getwd()
tFile=file.path(workdir,'SRR1042593.clean_bed')
cFile=file.path(workdir,'SRR1042594.clean_bed')
raw.output <- bayespeak(tFile, cFile,use.multicore = TRUE, mc.cores = 8)
output <- summarize.peaks(raw.output, method = "lowerbound")
write.table(as.data.frame(output), file = "Xu_MUT_rep1.txt", quote = FALSE)

自学CHIP-seq分析第六讲~寻找peaks

ulwvfje — Tue, 05 Jul 2016 12:17:31 +0000

CHIP-seq测序的本质还是目标片段捕获测序，跟WES不同的是，它不是通过固定的芯片探针来固定的捕获基因组上面特定序列，而是根据你选择的IP不同，你细胞或者机体状态不同，捕获到的序列差异很大！而我们研究的重点，就是捕获到的差异。而我们对CHIP-seq测序数据寻找peaks的本质就是得到所有测序数据比对在全基因组之后在正个基因组上面的测序深度里面寻找比较突出的。比如对WES数据来说，各个外显子，或者外显子的5端到3端，理论上测序深度应该是一致的，都是50X~200X，画一个测序深度曲线，应该是近似于一条直线。对我们的CHIP-seq测序数据来说，在所捕获的区域上面，理论上测序深度是绝对不一样的，应该是近似于一个山峰。而那些覆盖度高的地方，山顶，就是我们的IP所结合的热点，也就是我们想要找的peaks，在IGV里面看到大致是下面这样：

可以看到测序的reads分布是绝对的不均匀的！我们通常说的CHIP-seq测序的IP，可以是各个组蛋白的各个修饰位点对应的抗体，或者是各种转录因子的抗体，等等

如何定义热点呢？通俗地讲，热点是这样一些位置，这些位置多次被测得的read所覆盖（我们测的是一个细胞群体，read出现次数多，说明该位置被TF结合的几率大）。那么，read数达到多少才叫多？这就要用到统计检验喽。假设TF在基因组上的分布是没有任何规律的，那么，测序得到的read在基因组上的分布也必然是随机的，某个碱基上覆盖的read的数目应该服从二项分布。

具体统计学原理直接看原创吧：http://www.plob.org/2014/05/08/7227.html

为了达到作者文献里面的结果，我换了8个软件：MACS2/HOMER/SICERpy/PePr/SWEMBL/SISSRs/BayesPeak/PeakRanger，我这里就不一一介绍peaks caller软件的安装以及使用了，因为MACS2是最常用的，我就简单贴一下我关于MACS2的学习代码：

## step6 : peak calling
### step6.1: with MACS2
## 我先看了看说明书：
macs2 callpeak -t TF_1.bam -c Input.bam -n mypeaks
We used the following options:
-t: This is the only required parameter for MACS, refers to the name of the file with the ChIP-seq data
-c: The control or mock data file
-n: The name string of the experiment
MAC2 creates 4 files (mypeaks peaks.narrowPeak, mypeaks summits.bed, mypeaks peaks.xls and mypeaks model.r)
# MACS首先的工作是要确定一个模型，这个模型最关键的参数就是峰宽d。这个d就是bw(band width)，而它的一半就是shiftsize。

### 然后根据文章确定了下载的测序数据的分类
GSM1278641 Xu_MUT_rep1_BAF155_MUT SRR1042593
GSM1278642 Xu_MUT_rep1_Input SRR1042594
GSM1278643 Xu_MUT_rep2_BAF155_MUT SRR1042595
GSM1278644 Xu_MUT_rep2_Input SRR1042596
GSM1278645 Xu_WT_rep1_BAF155 SRR1042597
GSM1278646 Xu_WT_rep1_Input SRR1042598
GSM1278647 Xu_WT_rep2_BAF155 SRR1042599
GSM1278648 Xu_WT_rep2_Input SRR1042600
## 这里有个很奇怪的问题，input的测序数据居然比IP的测序数据多？？？
848M Jun 28 14:31 SRR1042593.bam
2.7G Jun 28 14:52 SRR1042594.bam
716M Jun 28 14:58 SRR1042595.bam
2.9G Jun 28 15:20 SRR1042596.bam
1.1G Jun 28 15:28 SRR1042597.bam
2.6G Jun 28 15:48 SRR1042598.bam
1.2G Jun 28 15:58 SRR1042599.bam
3.5G Jun 28 16:26 SRR1042600.bam
## 我没有想明白为什么
## http://www2.uef.fi/documents/1698400/2466431/Macs2/f4d12870-34f9-43ef-bf0d-f5d087267602
## http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120977/我首先用的是下面这些代码
nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.bam -t SRR1042593.bam -f BAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.bam -t SRR1042595.bam -f BAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.bam -t SRR1042597.bam -f BAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.bam -t SRR1042599.bam -f BAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
得到的peaks少的可怜，我第一次检查，以为是因为自己没有sort 比对的bam文件导致
## forget to sort the bam files:
## 首先把bam文件sort好，构建了inde，然后继续运行！
nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.sorted.bam -t SRR1042593.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.sorted.bam -t SRR1042595.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.sorted.bam -t SRR1042597.sorted.bam -f BAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.sorted.bam -t SRR1042599.sorted.bam -f BAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
##此时得到peaks跟上面为sort的bam文件得到的peaks一模一样，看来不是这个原因

##然后我怀疑是不是作者上传数据的时候把input和IP标记反了，所以我认为的调整过来

## Then change the control and treatment
nohup time ~/.local/bin/macs2 callpeak -t SRR1042594.sorted.bam -c SRR1042593.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -t SRR1042596.sorted.bam -c SRR1042595.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -t SRR1042598.sorted.bam -c SRR1042597.sorted.bam -f BAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -t SRR1042600.sorted.bam -c SRR1042599.sorted.bam -f BAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &

##结果，压根就没有peaks了！！！！看了作者并没有搞错

##接下来我怀疑是自己用samtools view -bhS -q 30 处理了sam文件，这个标准太严格了！！

##

## then just use the sam files.
nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.sam -t SRR1042593.sam -f SAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.sam -t SRR1042595.sam -f SAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.sam -t SRR1042597.sam -f SAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.sam -t SRR1042599.sam -f SAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
## 也没有多几个peaks，最后我只能想到是我的p值太严格了
## then chang the criteria for p values :

https://github.com/taoliu/MACS/

nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.sam -t SRR1042593.sam -f SAM -p 0.01 -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.sam -t SRR1042595.sam -f SAM -p 0.01 -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.sam -t SRR1042597.sam -f SAM -p 0.01 -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.sam -t SRR1042599.sam -f SAM -p 0.01 -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
##我大大减小了P值的标准，结果是输出一大堆的peaks
18919 Xu_MUT_rep1_peaks.xls
36277 Xu_MUT_rep2_peaks.xls
32494 Xu_WT_rep1_peaks.xls
56080 Xu_WT_rep2_peaks.xls
问题是这些peaks根本就都是假阳性！！！
我手动的check了几个之前严格过滤条件下的peaks，的确可以看到测序深度是两个山峰形状的曲线
## check some peaks 手动的 ## chr1 121484235 121485608
## masc results :
samtools depth -r chr10:42385331-42385599 SRR1042593.sorted.bam
samtools depth -r chr10:42385331-42385599 SRR1042594.sorted.bam
samtools depth -r chr20:45810382-45810662 SRR1042593.sorted.bam
samtools depth -r chr20:45810382-45810662 SRR1042594.sorted.bam
##我也check了paper里面得到的peak，但是在我的比对文件里面，肉眼看起来根本不像，所以我很纠结~~~~
paper results:
chr20 45796362 46384917
chr1 121482722 121485861
samtools depth -r chr1:121482722-121485861 SRR1042593.sorted.bam
samtools depth -r chr1:121482722-121485861 SRR1042594.sorted.bam
samtools depth -r chr20:45796362-46384917 SRR1042593.sorted.bam
samtools depth -r chr20:45796362-46384917 SRR1042594.sorted.bam

很不幸，最后还是没能达到作者的结果，我没搞清楚是为什么，我还用了BayesPeak/PeakRanger这两个软件，结果也不咋地。

peak finder软件大全： http://wodaklab.org/nextgen/data/peakfinders.html

Peak Calling for ChIP-Seq :　http://epigenie.com/guide-peak-calling-for-chip-seq/

生信菜鸟团 » BayesPeak

用R包BayesPeak来对CHIP-seq数据call peaks

自学CHIP-seq分析第六讲~寻找peaks