生信菜鸟团 » peaks

一个ChIP-seq实战-超级简单-2小时搞定！

ulwvfje — Tue, 10 Jan 2017 03:14:22 +0000

请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。

软件请用最新版，尤其是samtools等被我存储在系统环境变量的，考虑到读者众多，一般的软件我都会自带版本信息的！

我用两个小时，不代表你是两个小时就学会，有些朋友反映学了两个星期才学会，这很正常，没毛病，不要异想天开两个小时就达到我的水平。

本次讲解选取的文章是为了探索PRC1，PCR2这样的蛋白复合物，不是转录因子或者组蛋白的CHIP-seq，请注意区别！

这是一个系列帖子，你可以先看：

一个表达芯片数据处理实例

一个RNA-seq实战-超级简单-2小时搞定！

WES（七）看de novo变异情况

【直播】我的基因组22：用IGV查看具体某个位点是否变异

文章是：RYBP and Cbx7 define specific biological functions of polycomb complexes in mouse embryonic stem cells

https://www.ncbi.nlm.nih.gov/pubmed/23273917

RYBP and Cbx7都是Polycomb repressive complex 1 (PRC1)的组分：

数据都在：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42466

所以用脚本在ftp里面批量下载即可：

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP017/SRP017311

下载地址很容易获取啦！

for ((i=204;i<=209;i++)) ;do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP017/SRP017311/SRR620$i/SRR620$i.sra;done

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 $id;done

图片丢失~~~~~

因为我用fastqc看了看数据质量，代码如下：

ls *fastq |xargs ~/biosoft/fastqc/FastQC/fastqc -t 10

发现3端质量有点问题，我就用了-3 5 --local参数，

首先用bowtie2软件把测序得到的fastq文件比对到mm10参考基因组上面

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620204.fastq| samtools sort -O bam -o ring1B.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620205.fastq| samtools sort -O bam -o cbx7.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620206.fastq| samtools sort -O bam -o suz12.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620207.fastq| samtools sort -O bam -o RYBP.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620208.fastq| samtools sort -O bam -o IgGold.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620209.fastq| samtools sort -O bam -o IgG.bam

接下来需要对bam文件进行简单过滤，包括未比对的和multiple比对的，但是我比较懒，就直接用MACS2软件来call peaks啦！

nohup ~/.local/bin/macs2 callpeak -c ../IgGold.bam -t ../suz12.bam -m 10 30 -p 1e-5 -f BAM -g mm -n suz12 2>suz12.masc2.log &

nohup ~/.local/bin/macs2 callpeak -c ../IgGold.bam -t ../ring1B.bam -m 10 30 -p 1e-5 -f BAM -g mm -n ring1B 2>ring1B.masc2.log &

nohup ~/.local/bin/macs2 callpeak -c ../IgG.bam -t ../cbx7.bam -m 10 30 -p 1e-5 -f BAM -g mm -n cbx7 2>cbx7.masc2.log &

nohup ~/.local/bin/macs2 callpeak -c ../IgG.bam -t ../RYBP.bam -m 10 30 -p 1e-5 -f BAM -g mm -n RYBP 2>RYBP.masc2.log &

大家可以看到RYBP这个CHIP-seq我几乎得不到peaks，哪怕是换了一个control，除非我不用任何control！我用IGV看了看，这个RYBP的确很诡异，我怀疑是作者上传数据出错了！

而且作者在GEO给的PEAKS个数如下：

2754 GSE42466_Cbx7_peaks_10.txt
6982 GSE42466_Ring1b_peaks_10.txt
6872 GSE42466_RYBP_peaks_5.txt
8054 GSE42466_Suz12_peaks_10.txt

首先对这些bam文件批量转换成bw文件。然后批量画图

ls ../*bam |while read id

file=$(basename $id )

sample=${file%%.*}

echo $sample

bamCoverage -b $id -o $sample.bw ## 这里有个参数，-p 10 --normalizeUsingRPKM

computeMatrix reference-point --referencePoint TSS -b 10000 -a 10000 -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed -S $sample.bw --skipZeros -o matrix1_${sample}_TSS.gz --outFileSortedRegions regions1_${sample}_genes.bed

plotHeatmap -m matrix1_${sample}_TSS.gz -out ${sample}.png

done

然后整合所有的chipseq的bam文件，画基因的TSS附近的profile和heatmap图

computeMatrix reference-point -p 10 --referencePoint TSS -b 2000 -a 2000 -S ../*bw -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed --skipZeros -o tmp4.mat.gz

plotHeatmap -m tmp4.mat.gz -out tmp4.merge.png

plotProfile --dpi 720 -m tmp4.mat.gz -out tmp4.profile.pdf --plotFileFormat pdf --perGroup

plotHeatmap --dpi 720 -m tmp4.mat.gz -out tmp4.merge.pdf --plotFileFormat pdf

最后整合所有的chipseq的bam文件，画基因的genebody附近的profile和heatmap图

computeMatrix scale-regions -p 10 -S ../*bw -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed -b 3000 -a 3000 -m 5000 --skipZeros -o tmp5.mat.gz

plotHeatmap -m tmp5.mat.gz -out tmp5.merge.png

plotProfile --dpi 720 -m tmp5.mat.gz -out tmp5.profile.pdf --plotFileFormat pdf --perGroup

plotHeatmap --dpi 720 -m tmp5.mat.gz -out tmp5.merge.pdf --plotFileFormat pdf

下面是输出的图的例子，我只放了tss附近的！

上图可以看到RYBP的peaks的中点在TSS处，而其它peaks都在TSS下游一点点。

用Sequential ChIP (re-ChIP)实验的确可以看到RYBP和CBX7的peaks有重合。

这篇文章一直翻来覆去说这些CHIP-seq实验的peaks的交叉情况：

PRC1的组分异常复杂，包括 Cbx (Cbx2, Cbx4, Cbx6, Cbx7, or Cbx8); Ring1A or Ring1B; PHC (PHC1, PHC2, or PHC3); PCGF (PCGF1, PCGF2, PCGF3, PCGF4, PCGF5, or PCGF6); and RYBP or YAF2.
其中，a Ring1A/B E3 ligase subunit that monoubiquitinates histone H2A at lysine 119 (H2AK119ub)
但不是说都必须要有，而是它们的组合，形成了各种各样的PRC1，但是都统一叫做PRC1。
比如在mouse的ESCs里面，就有两种PRC1，它们的 Cbx7 or RYBP 是不可能共存的！我们可以把它们分别叫做， Cbx7-PRC1, RYBP-PRC1Cbx7 的功能是把 Ring1B 招募到染色质上面，是必须的。它结合的基因多参与 early-lineage commitment of ESCs.
RYBP 可以增强PRC1的酶活性，它结合大基因多参与，regulation of metabolism and cell-cycle progression
RYBP 结合的基因要比 CBX7 结合的基因表达量高。因为CBX7结合的同时，会招募PRC2这个抑制marker。
而PRC2 deposits the histone H3 lysine 27 trimethyl repressive mark (H3K27me3) through the Ezh1/2 histone methyltransferase enzymes.如何描述它们这些peaks的交叉情况呢？
We observed an overlap of RYBP peaks (3,918 in total) with 14%, 42%, and 37% of Cbx7, Ring1B, and Suz12 peaks, respectively
Moreover, although more than 90% of Cbx7 peaks contained Ring1B and Suz12, 20% were also bound by RYBP
尽管RYBP and Cbx7 在大部分情况下都是互相排斥的，但是也在少部分基因组区域存在共定位的现象。Ring1B / Suz12的peaks情况可以被 Cbx7 和 RYBP 的peaks情况说明：
RYBP and Cbx7 都有的地方，有着高Ring1B/Suz12
Cbx7 but not RYBP的地方，Ring1B/Suz12会稍微低一点
RYBP but not Cbx7的地方，Ring1B/Suz12会更低一点
RYBP and Cbx7 都没有的地方，Ring1B/Suz12就最少！RYBP的peaks的中点在TSS处，而其它peaks都在TSS下游一点点。
用Sequential ChIP (re-ChIP)实验的确可以看到RYBP和CBX7的peaks有重合。而且RYBP还有一些peaks是其它PRC1所没有的，说明它可以独立于PRC1发挥作用H2AK119ub 与 Ring1B/Suz12正相关，但是与RYBP只有25.7%交叉，与CBX7有着72%交叉，所以可以把 PRC1 target genes分成3类：
a first set with Cbx7/Ring1B/H2AK119ub; ~~~~GO/KEGG分析，
a second that contains RYBP and lower levels of Ring1B/H2AK119ub
a third set cobound by RYBP/Cbx7/Ring1B and that also contains H2AK119ub.

然后这些所有的gene list都可以拿去做GO/KEGG分析，看看是不是有什么biological meaning ！
genes co-occupied by Ring1B/Cbx7/RYBP and H2AK119ub are involved in system development.
genes containing RYBP/Ring1B/H2AK119ub, but not Cbx7, have a strong association with the M phase of the meiotic cycle and cellular metabolism
genes with Cbx7/Ring1B/H2AK119ub are involved in developmental processes and mesoderm specification,
those containing RYBP/Cbx7/Ring1B/H2AK119ub predominantly represent the ectodermal fate and, to a lesser extent, mesoderm and endoderm fates

超过700的基因有 RYBP/Cbx7/Ring1B的peaks，所以作者敲除Cbx7 看看 RYBP的peaks是否会变化，但是没有做CHIP-seq，只是做了ChIP-qPCR

下面这个结论很重要：
Overall, our ChIP-seq analysis allowed us to identify five types of genes according to the occupancy of PRC1 and PRC2: those with
(1) Ring1B/Cbx7/RYBP and Suz12 (725 genes);
(2) Ring1B/Cbx7/Suz12, but not RYBP (1,527 genes);
(3) Ring1B/RYBP/Suz12, but not Cbx7 (861 genes);
(4) only Ring1B and Suz12 (1,694 genes); or
(5) RYBP but no Polycomb proteins (1,674)

ChIP-Seq文献数据重新分析解读第二例

ulwvfje — Thu, 14 Jul 2016 12:26:58 +0000

paper:2014-BRCA1-PALB2-CHIP-seq:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194113/

这篇文章是朋友推荐的，我觉得作为CHIP-seq学习材料再好不过了，所以推荐给大家。是全基因组范围的BRCA1和PALB2的转录共激活机制的探究。请务必先看我的CHIP-seq自学系列教程，跟着好好学习！数据如下：

GSM997540 BRCA1 SRR553473.sra Read 18878514 spots

GSM997541 PALB2 SRR553474.sra Read 17615498 spots

GSM997542 P_Ser2 SRR553475.sra Read 35396009 spots

没有input作为control，但是数据量是足够了的，首先从NCBI里面把作者上传的数据下载回来：

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182682/SRR553473/SRR553473.sra &

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182683/SRR553474/SRR553474.sra &

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182684/SRR553475/SRR553475.sra &

因为就3个数据，我就没有写批处理了，反正也要具体进去看看每个数据的描述信息。然后我就批量解压了数据，做了质控，然后做了比对，代码如下：

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump $id;done

rm *sra

ls *.fastq | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done

### 36 bp   45 GC%

## cat >runBowtie2.sh

ls *.fastq | while read id ;

do

echo $id

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id   -S ${id%%.*}.sam  2>${id%%.*}.align.log;

samtools view -bhS -q 30  ${id%%.*}.sam > ${id%%.*}.bam

#  -F 1548 https://broadinstitute.github.io/picard/explain-flags.html

#  -F 0x4  remove the reads that didn't match

samtools sort   ${id%%.*}.bam ${id%%.*}.sorted  ## prefix for the output

#samtools view  -bhS     a.sam | samtools sort -o  -  ./ > a.bam

samtools index ${id%%.*}.sorted.bam

done

参考：http://cbsu.tc.cornell.edu/lab/doc/CHIPseq_workshop_20150504_lecture1.pdf

有一个讨论很有意思，大家可以关注一下，就是两个ＩＰ是否可以共用同一个ｉｎｐｕｔ的问题：http://seqanswers.com/forums/showthread.php?t=35377

作者在NCBI还上传了一个BigWiggle 格式文件，他是这样描述这个文件的，BigWiggle files for every ChIP-Seq were generated using Bed Tools and the utility bedGraphToBigWig (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/); these tracks were then uploaded into the UCSC Genome Browser.

paper:2012-Peak identification for ChIP-seq data with no controls.: http://www.ncbi.nlm.nih.gov/pubmed/23266983，我也下载了，但是没有打开看，因为他说的是hg18参考基因组比对的，我觉得很诡异，一个2014年的文章，居然用hg18，我懒得多说了，反正我自己把数据处理一下，接下来用MACS2来call peaks

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553473.sorted.bam -f BAM -g hs -n BRCA1    2>BRCA1.masc2.log &

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553474.sorted.bam -f BAM -g hs -n PALB2    2>PALB2.masc2.log &

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553475.sorted.bam -f BAM -g hs -n P_Ser2    2>P_Ser2.masc2.log &

本来我其实比较喜欢peakranger这个软件的，但是## there's no control , we can't use this tool: ranger /ccat /bcp

~/biosoft/PeakRanger/bin/peakranger ranger --format bam SRR553473.sorted.bam \ ##错误啦

--report --gene_annot_file hg19refGene.txt -q 0.05 -t 4

有一些关于各个peaks caller工具的讨论，大家可以瞧瞧。

Some peak callers work without control data and assume an even background signal, others make use of blacklist tools, that mask regions of the genome e.g. RepeatMasker and the “Duke excluded regions” list that was developed for the ENCODE project.

http://epigenie.com/guide-peak-calling-for-chip-seq/

因为要接下来使用CEAS这个软件，需要wig格式的文件：

## change sort bam files to wig files

nohup samtools depth SRR553473.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553473 description=SRR553473\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553473.wig    &

nohup samtools depth SRR553474.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553474 description=SRR553474\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553474.wig    &

nohup samtools depth SRR553475.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553475 description=SRR553475\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553475.wig    &

CEAS文件还需要bed格式的peaks，而MACS2改的是自定义格式，所以我写了一个脚本来转换

## cat >xls2bed.sh

ls *.xls | while read id ;

do

echo $id

grep '^chr\S' $id |perl -alne '{print "$F[0]\t$F[1]\t$F[2]\t$F[9]\t$F[7]\t+"}' >${id%%.*}.bed

done

bash xls2bed.sh

接下来就很简单啦，用CEAS来画一些图：

cd ~/CHIPseq_test/annotation

nohup ~/.local/bin/ceas --name=BRCA1_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/BRCA12_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553473.wig  2>BRCA1.ceas.log &

nohup ~/.local/bin/ceas --name=PALB2_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/PALB22_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553474.wig  2>PALB2.ceas.log &

nohup ~/.local/bin/ceas --name=P_Ser2_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/P_Ser22_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553475.wig  2>P_Ser2.ceas.log &

然后我还用了用网页版工具ChIPseek来可视化CHIP-seq的peaks结果

结果一个月内是有效的，大家可以点进去瞧瞧(开始时间2016年7月12)http://chipseek.cgu.edu.tw/main_menu.py?job_id=1468305524.156

Alternatively, You may use the job ID: 1468305524.156 to visit ChIPseek latter.

基本就是我前面写的CHIP-seq数据自学系列教程的实践！！！

自学CHIP-seq分析第九讲~CHIP-seq可视化大全

ChIP-Seq文献数据重新分析解读第一例

ulwvfje — Wed, 13 Jul 2016 14:50:22 +0000

文章是：Genome-wide maps of H3K4me2/3 in prostate cancer cell line LNCaP，数据在GEO可以下载。GSE20042，下面的所有分析，需要26G的空间。

作者想看看用 dihydrotestosterone （雄激素）处理了 cancer cell line LNCaP 这个细胞系之后，看看组蛋白甲基化修饰变化，主要是看H3K4me2和H3K4me3这两种组蛋白甲基化区别，分成三组，分别是处理前，处理后4H和16H，共有5个条件的数据，但是有7个fastq文件。

测序仪是：Illumina Genome Analyzer (Homo sapiens)

主要是为了分析差异核小体定位点区别：Model for identifying differential transcription factor binding locations

作者在这里进行数据分析软件(NPS)很旧了，也是哈佛刘小乐实验室出品的，我这里就不用了。

数据处理详情如下：

Bed: Sequence reads were obtained and mapped to the human genome (March, 2006) using the Illumina Genome Analyzer Pipeline.
Peaks: Peak detection was performed with the "Nucleosome Positioning from Sequencing (NPS)" algorithm (http://liulab.dfci.harvard.edu/NPS/)
Processed data file build: hg18

所以我重新重复这个数据分析，用的hg19，还有MACS2z这个软件

作者同时也测了芯片数据：Affymetrix U133 Plus 2.0 microarray data，但是似乎并没有给地址，我们先不管

首先下载数据

cd ~/CHIPseq_test/

mkdir GSE20042_H3K4me2_3 && cd GSE20042_H3K4me2_3

mkdir rawData && cd rawData

for ((i=146;i<153;i++)) ;do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP002/SRP002077/SRR037$i/SRR037$i.sra;done

GSM503903 H3K4me2_Vehicle_ChIPSeq SRR037146(Read 7,530,267 spots )/SRR037147(Read 6,215,981 spots )

GSM503904 H3K4me2_DHT_4h_ChIPSeq SRR037148(Read 6,510,159 spots )/SRR037149(Read 6,246,716 spots )

GSM503905 H3K4me2_DHT_16h_ChIPSeq SRR037150 Read 9,685,845 spots

GSM503906 H3K4me3_Vehicle_ChIPSeq SRR037151 Read 6,755,854 spots

GSM503907 H3K4me3_DHT_4h_ChIPSeq SRR037152 Read 4,761,769 spots

## 可以看到测序量并不大，因为文章比较老了，其实现在一般要测20M的reads

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump $id;done

rm *sra

ls *.fastq | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done

mkdir QC_results

mv *zip *html QC_results/

##接下来做比对

## cat >run_bowtie2.sh 运行这个脚本批量做alignment

ls *.fastq | while read id ;

do

echo $id

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -3 5 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id   -S ${id%%.*}.sam  2>${id%%.*}.align.log;

samtools view -bhS -q 30  ${id%%.*}.sam > ${id%%.*}.bam  ## -F 1548 https://broadinstitute.github.io/picard/explain-flags.html

samtools sort   ${id%%.*}.bam ${id%%.*}.sorted  ## prefix for the output

samtools index ${id%%.*}.sorted.bam

done

然后下载GEO的核小体定位点(peaks)结果：

tar xvf GSE20042_RAW.tar

ls *gz |xargs gunzip

wc -l *txt

  235639 GSM503903_LNCaP_H3K4me2_Vehicle_2Lanes_normalized_peak.txt

  248570 GSM503904_LNCaP_H3K4me2_DHT_4h_2Lanes_normalized_peak.txt

  185892 GSM503905_LNCaP_H3K4me2_DHT_16h_peak.txt

   74491 GSM503906_LNCaP_H3K4me3_Vehicle_normalized_peak.txt

  104022 GSM503907_LNCaP_H3K4me3_DHT_4h_normalized_peak.txt

然后根据比对的bam文件来可视化这些核小体peaks,很诡异，不知道他是如何找到的这些peaks，这些peaks画图之后根本看不出来，后来我才知道，是因为peaks的位点是hg18的坐标，而我用的是自己的bam文件来画图，所以~~~~

画图代码如下：

Rscript ~/CHIPseq_test/peakView.R GSM503907_LNCaP_H3K4me3_DHT_4h_normalized_peak.bed ../rawData/SRR037152.sorted.bam

这个peakView.R代码很简单，就是用samtools depth命令提取每个peaks区域的坐标，然后画曲线即可

然后我用MACS2软件来call peaks 看看：

# http://www2.uef.fi/documents/1698400/2466431/Macs2/f4d12870-34f9-43ef-bf0d-f5d087267602

ls *sorted.bam |while read id;do ( nohup time ~/.local/bin/macs2 callpeak -t $id -f BAM -g hs -n ${id%%.*} 2>${id%%.*}.masc2.log &) ;done

## 这里批量对7个测序文件做peaks callling

mkdir ../MACS2results

mv *bed *xls *Peak *r ../MACS2results

cd ../MACS2results

ls *.xls | while read id ;

echo $id

grep '^chr\S' $id |perl -alne '{print "$F[0]\t$F[1]\t$F[2]\t$F[9]\t$F[7]\t+"}' >${id%%.*}.bed

done

然后重新浏览peaks

Rscript ~/CHIPseq_test/peakView.R SRR037152_peaks.bed ../rawData/SRR037152.sorted.bam

看起来我call的peaks还挺靠谱的，图片以后再上传！

自学CHIP-seq分析第六讲~寻找peaks

ulwvfje — Tue, 05 Jul 2016 12:17:31 +0000

CHIP-seq测序的本质还是目标片段捕获测序，跟WES不同的是，它不是通过固定的芯片探针来固定的捕获基因组上面特定序列，而是根据你选择的IP不同，你细胞或者机体状态不同，捕获到的序列差异很大！而我们研究的重点，就是捕获到的差异。而我们对CHIP-seq测序数据寻找peaks的本质就是得到所有测序数据比对在全基因组之后在正个基因组上面的测序深度里面寻找比较突出的。比如对WES数据来说，各个外显子，或者外显子的5端到3端，理论上测序深度应该是一致的，都是50X~200X，画一个测序深度曲线，应该是近似于一条直线。对我们的CHIP-seq测序数据来说，在所捕获的区域上面，理论上测序深度是绝对不一样的，应该是近似于一个山峰。而那些覆盖度高的地方，山顶，就是我们的IP所结合的热点，也就是我们想要找的peaks，在IGV里面看到大致是下面这样：

可以看到测序的reads分布是绝对的不均匀的！我们通常说的CHIP-seq测序的IP，可以是各个组蛋白的各个修饰位点对应的抗体，或者是各种转录因子的抗体，等等

如何定义热点呢？通俗地讲，热点是这样一些位置，这些位置多次被测得的read所覆盖（我们测的是一个细胞群体，read出现次数多，说明该位置被TF结合的几率大）。那么，read数达到多少才叫多？这就要用到统计检验喽。假设TF在基因组上的分布是没有任何规律的，那么，测序得到的read在基因组上的分布也必然是随机的，某个碱基上覆盖的read的数目应该服从二项分布。

具体统计学原理直接看原创吧：http://www.plob.org/2014/05/08/7227.html

为了达到作者文献里面的结果，我换了8个软件：MACS2/HOMER/SICERpy/PePr/SWEMBL/SISSRs/BayesPeak/PeakRanger，我这里就不一一介绍peaks caller软件的安装以及使用了，因为MACS2是最常用的，我就简单贴一下我关于MACS2的学习代码：

## step6 : peak calling
### step6.1: with MACS2
## 我先看了看说明书：
macs2 callpeak -t TF_1.bam -c Input.bam -n mypeaks
We used the following options:
-t: This is the only required parameter for MACS, refers to the name of the file with the ChIP-seq data
-c: The control or mock data file
-n: The name string of the experiment
MAC2 creates 4 files (mypeaks peaks.narrowPeak, mypeaks summits.bed, mypeaks peaks.xls and mypeaks model.r)
# MACS首先的工作是要确定一个模型，这个模型最关键的参数就是峰宽d。这个d就是bw(band width)，而它的一半就是shiftsize。

### 然后根据文章确定了下载的测序数据的分类
GSM1278641 Xu_MUT_rep1_BAF155_MUT SRR1042593
GSM1278642 Xu_MUT_rep1_Input SRR1042594
GSM1278643 Xu_MUT_rep2_BAF155_MUT SRR1042595
GSM1278644 Xu_MUT_rep2_Input SRR1042596
GSM1278645 Xu_WT_rep1_BAF155 SRR1042597
GSM1278646 Xu_WT_rep1_Input SRR1042598
GSM1278647 Xu_WT_rep2_BAF155 SRR1042599
GSM1278648 Xu_WT_rep2_Input SRR1042600
## 这里有个很奇怪的问题，input的测序数据居然比IP的测序数据多？？？
848M Jun 28 14:31 SRR1042593.bam
2.7G Jun 28 14:52 SRR1042594.bam
716M Jun 28 14:58 SRR1042595.bam
2.9G Jun 28 15:20 SRR1042596.bam
1.1G Jun 28 15:28 SRR1042597.bam
2.6G Jun 28 15:48 SRR1042598.bam
1.2G Jun 28 15:58 SRR1042599.bam
3.5G Jun 28 16:26 SRR1042600.bam
## 我没有想明白为什么
## http://www2.uef.fi/documents/1698400/2466431/Macs2/f4d12870-34f9-43ef-bf0d-f5d087267602
## http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120977/我首先用的是下面这些代码
nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.bam -t SRR1042593.bam -f BAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.bam -t SRR1042595.bam -f BAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.bam -t SRR1042597.bam -f BAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.bam -t SRR1042599.bam -f BAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
得到的peaks少的可怜，我第一次检查，以为是因为自己没有sort 比对的bam文件导致
## forget to sort the bam files:
## 首先把bam文件sort好，构建了inde，然后继续运行！
nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.sorted.bam -t SRR1042593.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.sorted.bam -t SRR1042595.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.sorted.bam -t SRR1042597.sorted.bam -f BAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.sorted.bam -t SRR1042599.sorted.bam -f BAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
##此时得到peaks跟上面为sort的bam文件得到的peaks一模一样，看来不是这个原因

##然后我怀疑是不是作者上传数据的时候把input和IP标记反了，所以我认为的调整过来

## Then change the control and treatment
nohup time ~/.local/bin/macs2 callpeak -t SRR1042594.sorted.bam -c SRR1042593.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -t SRR1042596.sorted.bam -c SRR1042595.sorted.bam -f BAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -t SRR1042598.sorted.bam -c SRR1042597.sorted.bam -f BAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -t SRR1042600.sorted.bam -c SRR1042599.sorted.bam -f BAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &

##结果，压根就没有peaks了！！！！看了作者并没有搞错

##接下来我怀疑是自己用samtools view -bhS -q 30 处理了sam文件，这个标准太严格了！！

##

## then just use the sam files.
nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.sam -t SRR1042593.sam -f SAM -B -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.sam -t SRR1042595.sam -f SAM -B -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.sam -t SRR1042597.sam -f SAM -B -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.sam -t SRR1042599.sam -f SAM -B -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
## 也没有多几个peaks，最后我只能想到是我的p值太严格了
## then chang the criteria for p values :

https://github.com/taoliu/MACS/

nohup time ~/.local/bin/macs2 callpeak -c SRR1042594.sam -t SRR1042593.sam -f SAM -p 0.01 -g hs -n Xu_MUT_rep1 2>Xu_MUT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042596.sam -t SRR1042595.sam -f SAM -p 0.01 -g hs -n Xu_MUT_rep2 2>Xu_MUT_rep2.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042598.sam -t SRR1042597.sam -f SAM -p 0.01 -g hs -n Xu_WT_rep1 2>Xu_WT_rep1.masc2.log &
nohup time ~/.local/bin/macs2 callpeak -c SRR1042600.sam -t SRR1042599.sam -f SAM -p 0.01 -g hs -n Xu_WT_rep2 2>Xu_WT_rep2.masc2.log &
##我大大减小了P值的标准，结果是输出一大堆的peaks
18919 Xu_MUT_rep1_peaks.xls
36277 Xu_MUT_rep2_peaks.xls
32494 Xu_WT_rep1_peaks.xls
56080 Xu_WT_rep2_peaks.xls
问题是这些peaks根本就都是假阳性！！！
我手动的check了几个之前严格过滤条件下的peaks，的确可以看到测序深度是两个山峰形状的曲线
## check some peaks 手动的 ## chr1 121484235 121485608
## masc results :
samtools depth -r chr10:42385331-42385599 SRR1042593.sorted.bam
samtools depth -r chr10:42385331-42385599 SRR1042594.sorted.bam
samtools depth -r chr20:45810382-45810662 SRR1042593.sorted.bam
samtools depth -r chr20:45810382-45810662 SRR1042594.sorted.bam
##我也check了paper里面得到的peak，但是在我的比对文件里面，肉眼看起来根本不像，所以我很纠结~~~~
paper results:
chr20 45796362 46384917
chr1 121482722 121485861
samtools depth -r chr1:121482722-121485861 SRR1042593.sorted.bam
samtools depth -r chr1:121482722-121485861 SRR1042594.sorted.bam
samtools depth -r chr20:45796362-46384917 SRR1042593.sorted.bam
samtools depth -r chr20:45796362-46384917 SRR1042594.sorted.bam

很不幸，最后还是没能达到作者的结果，我没搞清楚是为什么，我还用了BayesPeak/PeakRanger这两个软件，结果也不咋地。

peak finder软件大全： http://wodaklab.org/nextgen/data/peakfinders.html

Peak Calling for ChIP-Seq :　http://epigenie.com/guide-peak-calling-for-chip-seq/