ChIP-Seq文献数据重新分析解读第二例

paper:2014-BRCA1-PALB2-CHIP-seq:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194113/

这篇文章是朋友推荐的，我觉得作为CHIP-seq学习材料再好不过了，所以推荐给大家。是全基因组范围的BRCA1和PALB2的转录共激活机制的探究。请务必先看我的CHIP-seq自学系列教程，跟着好好学习！数据如下：

GSM997540 BRCA1 SRR553473.sra Read 18878514 spots

GSM997541 PALB2 SRR553474.sra Read 17615498 spots

GSM997542 P_Ser2 SRR553475.sra Read 35396009 spots

没有input作为control，但是数据量是足够了的，首先从NCBI里面把作者上传的数据下载回来：

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182682/SRR553473/SRR553473.sra &

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182683/SRR553474/SRR553474.sra &

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182684/SRR553475/SRR553475.sra &

因为就3个数据，我就没有写批处理了，反正也要具体进去看看每个数据的描述信息。然后我就批量解压了数据，做了质控，然后做了比对，代码如下：

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump $id;done

rm *sra

ls *.fastq | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done

### 36 bp   45 GC%

## cat >runBowtie2.sh

ls *.fastq | while read id ;

do

echo $id

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id   -S ${id%%.*}.sam  2>${id%%.*}.align.log;

samtools view -bhS -q 30  ${id%%.*}.sam > ${id%%.*}.bam

#  -F 1548 https://broadinstitute.github.io/picard/explain-flags.html

#  -F 0x4  remove the reads that didn't match

samtools sort   ${id%%.*}.bam ${id%%.*}.sorted  ## prefix for the output

#samtools view  -bhS     a.sam | samtools sort -o  -  ./ > a.bam

samtools index ${id%%.*}.sorted.bam

done

参考：http://cbsu.tc.cornell.edu/lab/doc/CHIPseq_workshop_20150504_lecture1.pdf

有一个讨论很有意思，大家可以关注一下，就是两个ＩＰ是否可以共用同一个ｉｎｐｕｔ的问题：http://seqanswers.com/forums/showthread.php?t=35377

作者在NCBI还上传了一个BigWiggle 格式文件，他是这样描述这个文件的，BigWiggle files for every ChIP-Seq were generated using Bed Tools and the utility bedGraphToBigWig (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/); these tracks were then uploaded into the UCSC Genome Browser.

paper:2012-Peak identification for ChIP-seq data with no controls.: http://www.ncbi.nlm.nih.gov/pubmed/23266983，我也下载了，但是没有打开看，因为他说的是hg18参考基因组比对的，我觉得很诡异，一个2014年的文章，居然用hg18，我懒得多说了，反正我自己把数据处理一下，接下来用MACS2来call peaks

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553473.sorted.bam -f BAM -g hs -n BRCA1    2>BRCA1.masc2.log &

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553474.sorted.bam -f BAM -g hs -n PALB2    2>PALB2.masc2.log &

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553475.sorted.bam -f BAM -g hs -n P_Ser2    2>P_Ser2.masc2.log &

本来我其实比较喜欢peakranger这个软件的，但是## there's no control , we can't use this tool: ranger /ccat /bcp

~/biosoft/PeakRanger/bin/peakranger ranger --format bam SRR553473.sorted.bam \ ##错误啦

--report --gene_annot_file hg19refGene.txt -q 0.05 -t 4

有一些关于各个peaks caller工具的讨论，大家可以瞧瞧。

Some peak callers work without control data and assume an even background signal, others make use of blacklist tools, that mask regions of the genome e.g. RepeatMasker and the “Duke excluded regions” list that was developed for the ENCODE project.

http://epigenie.com/guide-peak-calling-for-chip-seq/

因为要接下来使用CEAS这个软件，需要wig格式的文件：

## change sort bam files to wig files

nohup samtools depth SRR553473.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553473 description=SRR553473\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553473.wig    &

nohup samtools depth SRR553474.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553474 description=SRR553474\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553474.wig    &

nohup samtools depth SRR553475.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553475 description=SRR553475\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553475.wig    &

CEAS文件还需要bed格式的peaks，而MACS2改的是自定义格式，所以我写了一个脚本来转换

## cat >xls2bed.sh

ls *.xls | while read id ;

do

echo $id

grep '^chr\S' $id |perl -alne '{print "$F[0]\t$F[1]\t$F[2]\t$F[9]\t$F[7]\t+"}' >${id%%.*}.bed

done

bash xls2bed.sh

接下来就很简单啦，用CEAS来画一些图：

cd ~/CHIPseq_test/annotation

nohup ~/.local/bin/ceas --name=BRCA1_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/BRCA12_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553473.wig  2>BRCA1.ceas.log &

nohup ~/.local/bin/ceas --name=PALB2_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/PALB22_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553474.wig  2>PALB2.ceas.log &

nohup ~/.local/bin/ceas --name=P_Ser2_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/P_Ser22_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553475.wig  2>P_Ser2.ceas.log &

然后我还用了用网页版工具ChIPseek来可视化CHIP-seq的peaks结果

结果一个月内是有效的，大家可以点进去瞧瞧(开始时间2016年7月12)http://chipseek.cgu.edu.tw/main_menu.py?job_id=1468305524.156

Alternatively, You may use the job ID: 1468305524.156 to visit ChIPseek latter.

基本就是我前面写的CHIP-seq数据自学系列教程的实践！！！

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

ChIP-Seq文献数据重新分析解读第二例

自学CHIP-seq分析第九讲~CHIP-seq可视化大全

2025年7月
一	二	三	四	五	六	日
« 九
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31