生信菜鸟团 » CHIP-seq

ngsplot辅助CHIP-seq数据分析-可视化

ulwvfje — Sun, 01 Jan 2017 02:18:17 +0000

最近在忙一些chip-seq的数据分析项目，它的可视化展现比较复杂一点，自己写程序将会耗费挺长时间的，就想着利用现成的工具，前面试用了deeptools，挺好的，但是有点慢，是python程序，如下：

deeptools辅助CHIP-seq数据分析-可视化

现在换一个R程序，这个非常快速，而且绘图个人觉得稍微美观一点，大家也可以都试试看。

首先软件的github里面有源代码，然后作者还四处宣讲这个包的神奇之处，下面的ppt非常言简意赅的描述了它的功能和强大之处。

github: https://github.com/shenlab-sinai/ngsplot

ppt:http://jura.wi.mit.edu/bio/education/hot_topics/ngsplot/ngsplot_Apr2014.pdf

example:https://drive.google.com/drive/folders/0B1PVLadG_dCKN1liNFY0MVM1Ulk

安装超级简单啦，只需要去Google的云盘里下载软件和测试数据咯

cd ~/biosoft

mkdir ngsplot && cd ngsplot

## download by yourself :https://drive.google.com/drive/folders/0B1PVLadG_dCKN1liNFY0MVM1Ulk

tar -zxvf ngsplot-2.61.tar.gz

tar zxvf ngsplot.eg.bam.tar.gz ## 测试数据非常给力，清楚的说明了，CHIP-seq数据分析-可视化需要什么样的数据。

cp ../ngsplot/example/config.example.txt ./ ## 在后面的测试代码需要用

echo 'export PATH=/home/jianmingzeng/biosoft/ngsplot/ngsplot/bin:$PATH' >>~/.bashrc

echo 'export NGSPLOT=/home/jianmingzeng/biosoft/ngsplot/ngsplot' >>~/.bashrc

source ~/.bashrc

## 需要你的服务器安装好R，并且你自己手动安装好这几个包。

install.packages("doMC", dep=T)

install.packages("caTools", dep=T)

install.packages("utils", dep=T)

source("http://bioconductor.org/biocLite.R")

biocLite( "BSgenome" )

biocLite( "Rsamtools" )

biocLite( "ShortRead" )

使用非常简单，看懂ngs.plot.r的用法即可，一个命令就出图了，如果这个出图不满意，就用replot.r重新选择参数绘制一个新的图！

前提是自己下载好了基因组文件，本软件自带hg19，其余的基因组有：https://github.com/shenlab-sinai/ngsplot/wiki/SupportedGenomes ，但是都放在Google云盘里面，所以需要翻墙才能下载的： https://drive.google.com/drive/folders/0B1PVLadG_dCKNEsybkh5TE9XZ1E

测序数据如下：

有了这些测试数据，而且软件里面还自带了测试代码：

ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.1M.bam -O k4.test

ngs.plot.r -G hg19 -R tss -C config.example.txt -O encode1M.k4k27

如果需要对多个bam文件画图，就根据作者定义的规则来设置好config.example.txt 文件即可

如果你对上面的图不满意，可以用replot.r 来重新根据上面的参数来画图。

replot.r prof -I k4.test.zip -O k4.replot -SE 0 -MW 9 -H 0.3

replot.r heatmap -I encode1M.k4k27.zip -O k4k27.replot -GO hc -RR 80

除了以tss来画图，还可以根据genebody或者其它： tss, tes, genebody, exon, cgi, enhancer, dhs or bed

ngs.plot.r -G hg19 -R genebody -F rnaseq -C hesc.RNAseq.1M.bam -O encode1M.rnaseq

ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.1M.bam:hesc.Input.500K.bam -O k4vsInp

轻轻松松get到作者的意图，然后拿自己的数据就可以做同样的分析图片了！

当然，如果你领悟力比较差，慢慢读人家的github上面的readme吧，实在是太简单了，我都不知道需要我讲什么。

而且运行速度还特快！

当然，也可能是它这个测试文件本来就很小的原因。

The genome files can be found in this Google drive folder: ngs.plot genome folder. A list of the available genomes is listed in this Wiki: SupportedGenomes. A brief list is here (not all): "human (hg18, hg19), chimpanzee (panTro4), rhesus macaque (rheMac2), mouse (mm9, mm10), rat (rn4, rn5), cow (bosTau6), chicken (galGal4), zebrafish (Zv9), drosophila (dm3), Caenorhabditis elegans (ce6, ceX), Saccharomyces cerevisiae (sacCer2, sacCer3), Schizosaccharomyces pombe (Asm294), Arabidopsis thaliana (TAIR10), Zea mays (AGPv3), rice (IRGSP-1.0)".

生物信息数据分析文章就是看图写作文

ulwvfje — Wed, 28 Dec 2016 07:14:39 +0000

首先是从测试原始数据里面得到汇总数据

然后把各种统计汇总数据可视化成图表

最后根据图表来写作文即可。

来源：Genome-wide Mapping of HATs and HDACs Reveals Distinct Functions in Active and Inactive Genes

http://www.sciencedirect.com/science/article/pii/S0092867409008411

比如下面这个图，就是CHIP-seq的数据，比对后根据全基因组的所有基因的区域范围内的reads密度的总结：

故事该怎么写呢？

首先看图例：

A. Profiles of HATs binding across 5’ gene ends, 3’ gene ends and gene body regions of the 1000 most active, intermediately active and least active genes were examined using ChIP-Seq.txStart: transcription start site. txEnd: transcription end site.

B. Profiles of HATs binding across intergenic (5kb away from any gene) or promoter (defined

as +/− 1kb surrounding TSS) DNase HS sites. DNase HS sites were obtained from (Boyle et

al., 2008).

作者做了5个HATs基因的CHIP-seq数据，根据上面的图，可以把它们分成3组，分别是CBP and p300，PCAF (p300/CBP associated factor) and GCN5，MOF and Tip60，它们虽然都是蛋白质的乙酰化酶，但是它们的CHIP-seq数据表现不一致，仔细看上图就明白了。为什么不一致，就需要解释，解释就需要有生物学背景，比如CBP and p300结构上高度同源，前人研究也表明主要是参与转录起始。而PCAF (p300/CBP associated factor) and GCN5是另外一组的高度同源，前人研究参与转录延伸。最后的MOF and Tip60是MYST family of HATs，跟上面的HATs不大一样，前人研究表明它们参与的功能特别多样性，所以在基因上面的结合密度跟其它不一样。最后再扯一扯它们在其它物种的功能如何如何，跟人类比较一下如何如何。再找几个已有的CHIP-seq数据交叉验证一下，再说一下自己也做实验随机验证了一些，因为高通量测序毕竟不是金标准。

下面这张图是把CHIP-seq数据的reads密度和基因的表达量关联起来，也很简单。

故事该怎么写呢?

首先看图例：

C. Correlation between HAT binding and gene expression levels. Genes were grouped to 100

gene (one dot in the figure) sets according to expression level. The HAT binding level in

promoter region was calculated for the same 100 gene sets. The y-axis indicates the HAT

binding level and the x-axis indicates the expression level.

D. Correlation between HAT binding and RNA Pol II binding levels among the 100 gene sets

grouped according to expression levels as defined in panel C. The y-axis indicates the HAT

binding level and the x-axis indicates the Pol II level.

E. Correlation between HAT binding and histone acetylation levels among the 100 gene sets

grouped according to expression levels as defined in panel C. The acetylation level was

calculated by pooling all reads for 18 histone acetylations mapped previously (Wang et al.,

2008). The y-axis indicates the HAT binding level and the x-axis indicates the acetylation level.

图例就很复杂了，但是信息量很少。就是根据转录组数据把基因分区段，不同表达水平的基因组它们的对应的基因的CHIP-seq数据的密码如何，很简单的一个相关图。就是为了说明它们跟基因的表达水平是正相关的。其实表达水平就是polyII的结合密度，也可以看看polyII的结合密度跟这些CHIP-seq的IP的结合密度看看相关性，也能说明同样的结论。

此文的作者把HATs系列酶都做了CHIP-seq数据，同时也把HDACs系列酶也做了CHIPseq数据！~~~

一般人入门生物信息学的时候问题都集中在如何得到可绘图的数据，因为绘图很简单，哪怕是不会R语言，在excel也能做。至于后面的看图写作文，主要是考验生物学底蕴了。

最后说一下下面这个图：

A. Distribution profiles of HDAC6, Tip60, Pol II and H3K36me3 across the active genes were

plotted. The left y-axis indicates tag densities for HDAC6, Tip60 and Pol II. The right axis

indicates tag densities for H3K36me3.

这个没什么好说的了，很明显HATs和HDACs和polyII都是一样的pattern，都代表着转录激活，跟H3K36me6的pattern有显著区别。这个现象很新颖，很有趣，再扯一堆生物学意义就好，为什么HATs和HDACs和polyII都是一样的pattern呢？给自己的假设和猜想。前提是要有生物学背景知识。

而且，如何得到这样的绘图的数据，讲起来就比较复杂了。

对CHIP-seq数据call peaks应该选取unique比对的reads吗？

ulwvfje — Sun, 07 Aug 2016 13:13:18 +0000

对于CHIP-seq数据处理完全是自学的，所以有很多细节得慢慢学习回来，这次记录的就是当我们把测序仪的fastq数据比对到参考基因组之后，应该对比对的结果文件做什么样的处理，然后去给peaks caller软件拿来call peaks呢？我看过博客提到只保留比对质量值大于30的，也看过博客提到只保留unique比对的reads，我这里拿一篇公共数据测试了一下它们的区别！数据描述如下：

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74311

参考流程：https://github.com/jmzeng1314/NGS-pipeline/tree/master/CHIPseq

GSM1916974	H3K27ac ChIP-seq	SRR2774675
GSM1916975	input DNA	SRR2774676

首先在SRA数据库下载 SRR2774675.sra 和 SRR2774676.sra

http://www.ncbi.nlm.nih.gov/sra?term=SRP065184

应用我github的流程很快就可以对比对，我把两种方法处理的比对结果都拿去call peaks，然后得到了，两个peaks文件。

39709 highQuaily_peaks.bed
39709 highQuaily_summits.bed

可以看到两次结果得到的peaks条数并没有显著区别，我们简单看看前几行！

其实用bedtools就可以看看左右两边的文件的交集情况，但是我这里选用了ChIPpeakAnno这个R包集成好的函数，直接得到结果即可！

ChIPpeakAnno 包直接看说明书吧，我这里贴出代码：

library(ChIPpeakAnno)
highPeak <- readPeakFile( 'highQuaily_peaks.bed' )
uniquePeak <- readPeakFile( 'unique_peaks.bed' )
ol <- findOverlapsOfPeaks(highPeak, uniquePeak)
png('overlapVenn.png')
makeVennDiagram(ol)
dev.off()

然后打开画好的韦恩图：

可以看到这两种情况得到的结果几乎没有区别，如果大家感兴趣可以自己看看它们那些独特的peaks到底是什么原因！

结论就是，说明CHIP-seq数据分析的时候，call peaks那个步骤，只保留比对质量值大于30的，或者只保留unique比对的reads，从数据处理的角度来讲差别不大，主要看你具体实验意义。

根据比对的bam文件来对peaks区域可视化

ulwvfje — Tue, 02 Aug 2016 13:52:53 +0000

之前分析了好几个公共项目，拿到的peaks都很诡异，搞得我一直怀疑是不是自己分析错了。终于，功夫不负有心人，我分析了一个数据，它的peaks非常完美！！！可以证明，我的分析流程以及peaks绘图代码并没有错！数据来自于http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74311，是关于H3K27ac_ChIP-Seq_LOUCY，组蛋白修饰的CHIP-seq数据，很容易就下载了作者上传的测序数据，然后跑了我的流程！https://github.com/jmzeng1314/NGS-pipeline/tree/master/CHIPseq

本文的重点在于讲解如何查看自己的peaks是否是正确的！我是直接用比对的bam文件来用samtools depth命令来获取peaks区域的测序深度，从而画图的，代码见step5-peaks-view-samtools-depth.R

在终端调用我的代码画图命令如下；

Rscript ~/scripts/peakView.R ../unique_peaks.bed ../../SRR2774675.unique.sorted.bam ../../SRR2774676.unique.sorted.bam
Rscript ~/scripts/peakView.R ../unique_peaks.bed ../../SRR2774675.unique.sorted.bam ../../SRR2774676.unique.sorted.bam

下面随便看两个peaks，很明显是双峰模型，而且IP的测序深度远高于INPUT，数据非常棒！

然后我不得不指出如果CHIP-seq实验失败，那么peaks会很诡异，首先你会看到测序深度大多都在10以下，即使有部分测序深度很高的，也是IP和INPUT的测序深度压根就没有差异，下面就是一个典型的失败案例！

这种实验失败的数据，实在是无法分析。而转录因子的CHIP-seq实验失败率还挺高的，所以一定要有control，否则再怎么分析也是 rubbish in rubbish out

ChIP-Seq文献数据重新分析解读第二例

ulwvfje — Thu, 14 Jul 2016 12:26:58 +0000

paper:2014-BRCA1-PALB2-CHIP-seq:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194113/

这篇文章是朋友推荐的，我觉得作为CHIP-seq学习材料再好不过了，所以推荐给大家。是全基因组范围的BRCA1和PALB2的转录共激活机制的探究。请务必先看我的CHIP-seq自学系列教程，跟着好好学习！数据如下：

GSM997540 BRCA1 SRR553473.sra Read 18878514 spots

GSM997541 PALB2 SRR553474.sra Read 17615498 spots

GSM997542 P_Ser2 SRR553475.sra Read 35396009 spots

没有input作为control，但是数据量是足够了的，首先从NCBI里面把作者上传的数据下载回来：

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182682/SRR553473/SRR553473.sra &

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182683/SRR553474/SRR553474.sra &

nohup wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX182/SRX182684/SRR553475/SRR553475.sra &

因为就3个数据，我就没有写批处理了，反正也要具体进去看看每个数据的描述信息。然后我就批量解压了数据，做了质控，然后做了比对，代码如下：

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump $id;done

rm *sra

ls *.fastq | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done

### 36 bp   45 GC%

## cat >runBowtie2.sh

ls *.fastq | while read id ;

do

echo $id

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id   -S ${id%%.*}.sam  2>${id%%.*}.align.log;

samtools view -bhS -q 30  ${id%%.*}.sam > ${id%%.*}.bam

#  -F 1548 https://broadinstitute.github.io/picard/explain-flags.html

#  -F 0x4  remove the reads that didn't match

samtools sort   ${id%%.*}.bam ${id%%.*}.sorted  ## prefix for the output

#samtools view  -bhS     a.sam | samtools sort -o  -  ./ > a.bam

samtools index ${id%%.*}.sorted.bam

done

参考：http://cbsu.tc.cornell.edu/lab/doc/CHIPseq_workshop_20150504_lecture1.pdf

有一个讨论很有意思，大家可以关注一下，就是两个ＩＰ是否可以共用同一个ｉｎｐｕｔ的问题：http://seqanswers.com/forums/showthread.php?t=35377

作者在NCBI还上传了一个BigWiggle 格式文件，他是这样描述这个文件的，BigWiggle files for every ChIP-Seq were generated using Bed Tools and the utility bedGraphToBigWig (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/); these tracks were then uploaded into the UCSC Genome Browser.

paper:2012-Peak identification for ChIP-seq data with no controls.: http://www.ncbi.nlm.nih.gov/pubmed/23266983，我也下载了，但是没有打开看，因为他说的是hg18参考基因组比对的，我觉得很诡异，一个2014年的文章，居然用hg18，我懒得多说了，反正我自己把数据处理一下，接下来用MACS2来call peaks

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553473.sorted.bam -f BAM -g hs -n BRCA1    2>BRCA1.masc2.log &

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553474.sorted.bam -f BAM -g hs -n PALB2    2>PALB2.masc2.log &

nohup time ~/.local/bin/macs2 callpeak 　-t SRR553475.sorted.bam -f BAM -g hs -n P_Ser2    2>P_Ser2.masc2.log &

本来我其实比较喜欢peakranger这个软件的，但是## there's no control , we can't use this tool: ranger /ccat /bcp

~/biosoft/PeakRanger/bin/peakranger ranger --format bam SRR553473.sorted.bam \ ##错误啦

--report --gene_annot_file hg19refGene.txt -q 0.05 -t 4

有一些关于各个peaks caller工具的讨论，大家可以瞧瞧。

Some peak callers work without control data and assume an even background signal, others make use of blacklist tools, that mask regions of the genome e.g. RepeatMasker and the “Duke excluded regions” list that was developed for the ENCODE project.

http://epigenie.com/guide-peak-calling-for-chip-seq/

因为要接下来使用CEAS这个软件，需要wig格式的文件：

## change sort bam files to wig files

nohup samtools depth SRR553473.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553473 description=SRR553473\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553473.wig    &

nohup samtools depth SRR553474.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553474 description=SRR553474\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553474.wig    &

nohup samtools depth SRR553475.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR553475 description=SRR553475\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR553475.wig    &

CEAS文件还需要bed格式的peaks，而MACS2改的是自定义格式，所以我写了一个脚本来转换

## cat >xls2bed.sh

ls *.xls | while read id ;

do

echo $id

grep '^chr\S' $id |perl -alne '{print "$F[0]\t$F[1]\t$F[2]\t$F[9]\t$F[7]\t+"}' >${id%%.*}.bed

done

bash xls2bed.sh

接下来就很简单啦，用CEAS来画一些图：

cd ~/CHIPseq_test/annotation

nohup ~/.local/bin/ceas --name=BRCA1_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/BRCA12_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553473.wig  2>BRCA1.ceas.log &

nohup ~/.local/bin/ceas --name=PALB2_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/PALB22_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553474.wig  2>PALB2.ceas.log &

nohup ~/.local/bin/ceas --name=P_Ser2_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  -g hg19.refGene \

-b ~/CHIPseq_test/BRAC1-PALB2/raw/P_Ser22_peaks.bed -w  ~/CHIPseq_test/BRAC1-PALB2/raw/SRR553475.wig  2>P_Ser2.ceas.log &

然后我还用了用网页版工具ChIPseek来可视化CHIP-seq的peaks结果

结果一个月内是有效的，大家可以点进去瞧瞧(开始时间2016年7月12)http://chipseek.cgu.edu.tw/main_menu.py?job_id=1468305524.156

Alternatively, You may use the job ID: 1468305524.156 to visit ChIPseek latter.

基本就是我前面写的CHIP-seq数据自学系列教程的实践！！！

自学CHIP-seq分析第九讲~CHIP-seq可视化大全

ChIP-Seq文献数据重新分析解读第一例

ulwvfje — Wed, 13 Jul 2016 14:50:22 +0000

文章是：Genome-wide maps of H3K4me2/3 in prostate cancer cell line LNCaP，数据在GEO可以下载。GSE20042，下面的所有分析，需要26G的空间。

作者想看看用 dihydrotestosterone （雄激素）处理了 cancer cell line LNCaP 这个细胞系之后，看看组蛋白甲基化修饰变化，主要是看H3K4me2和H3K4me3这两种组蛋白甲基化区别，分成三组，分别是处理前，处理后4H和16H，共有5个条件的数据，但是有7个fastq文件。

测序仪是：Illumina Genome Analyzer (Homo sapiens)

主要是为了分析差异核小体定位点区别：Model for identifying differential transcription factor binding locations

作者在这里进行数据分析软件(NPS)很旧了，也是哈佛刘小乐实验室出品的，我这里就不用了。

数据处理详情如下：

Bed: Sequence reads were obtained and mapped to the human genome (March, 2006) using the Illumina Genome Analyzer Pipeline.
Peaks: Peak detection was performed with the "Nucleosome Positioning from Sequencing (NPS)" algorithm (http://liulab.dfci.harvard.edu/NPS/)
Processed data file build: hg18

所以我重新重复这个数据分析，用的hg19，还有MACS2z这个软件

作者同时也测了芯片数据：Affymetrix U133 Plus 2.0 microarray data，但是似乎并没有给地址，我们先不管

首先下载数据

cd ~/CHIPseq_test/

mkdir GSE20042_H3K4me2_3 && cd GSE20042_H3K4me2_3

mkdir rawData && cd rawData

for ((i=146;i<153;i++)) ;do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP002/SRP002077/SRR037$i/SRR037$i.sra;done

GSM503903 H3K4me2_Vehicle_ChIPSeq SRR037146(Read 7,530,267 spots )/SRR037147(Read 6,215,981 spots )

GSM503904 H3K4me2_DHT_4h_ChIPSeq SRR037148(Read 6,510,159 spots )/SRR037149(Read 6,246,716 spots )

GSM503905 H3K4me2_DHT_16h_ChIPSeq SRR037150 Read 9,685,845 spots

GSM503906 H3K4me3_Vehicle_ChIPSeq SRR037151 Read 6,755,854 spots

GSM503907 H3K4me3_DHT_4h_ChIPSeq SRR037152 Read 4,761,769 spots

## 可以看到测序量并不大，因为文章比较老了，其实现在一般要测20M的reads

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump $id;done

rm *sra

ls *.fastq | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done

mkdir QC_results

mv *zip *html QC_results/

##接下来做比对

## cat >run_bowtie2.sh 运行这个脚本批量做alignment

ls *.fastq | while read id ;

do

echo $id

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -3 5 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id   -S ${id%%.*}.sam  2>${id%%.*}.align.log;

samtools view -bhS -q 30  ${id%%.*}.sam > ${id%%.*}.bam  ## -F 1548 https://broadinstitute.github.io/picard/explain-flags.html

samtools sort   ${id%%.*}.bam ${id%%.*}.sorted  ## prefix for the output

samtools index ${id%%.*}.sorted.bam

done

然后下载GEO的核小体定位点(peaks)结果：

tar xvf GSE20042_RAW.tar

ls *gz |xargs gunzip

wc -l *txt

  235639 GSM503903_LNCaP_H3K4me2_Vehicle_2Lanes_normalized_peak.txt

  248570 GSM503904_LNCaP_H3K4me2_DHT_4h_2Lanes_normalized_peak.txt

  185892 GSM503905_LNCaP_H3K4me2_DHT_16h_peak.txt

   74491 GSM503906_LNCaP_H3K4me3_Vehicle_normalized_peak.txt

  104022 GSM503907_LNCaP_H3K4me3_DHT_4h_normalized_peak.txt

然后根据比对的bam文件来可视化这些核小体peaks,很诡异，不知道他是如何找到的这些peaks，这些peaks画图之后根本看不出来，后来我才知道，是因为peaks的位点是hg18的坐标，而我用的是自己的bam文件来画图，所以~~~~

画图代码如下：

Rscript ~/CHIPseq_test/peakView.R GSM503907_LNCaP_H3K4me3_DHT_4h_normalized_peak.bed ../rawData/SRR037152.sorted.bam

这个peakView.R代码很简单，就是用samtools depth命令提取每个peaks区域的坐标，然后画曲线即可

然后我用MACS2软件来call peaks 看看：

# http://www2.uef.fi/documents/1698400/2466431/Macs2/f4d12870-34f9-43ef-bf0d-f5d087267602

ls *sorted.bam |while read id;do ( nohup time ~/.local/bin/macs2 callpeak -t $id -f BAM -g hs -n ${id%%.*} 2>${id%%.*}.masc2.log &) ;done

## 这里批量对7个测序文件做peaks callling

mkdir ../MACS2results

mv *bed *xls *Peak *r ../MACS2results

cd ../MACS2results

ls *.xls | while read id ;

echo $id

grep '^chr\S' $id |perl -alne '{print "$F[0]\t$F[1]\t$F[2]\t$F[9]\t$F[7]\t+"}' >${id%%.*}.bed

done

然后重新浏览peaks

Rscript ~/CHIPseq_test/peakView.R SRR037152_peaks.bed ../rawData/SRR037152.sorted.bam

看起来我call的peaks还挺靠谱的，图片以后再上传！

使用CEAS软件来对CHIP-seq的peaks进行

ulwvfje — Thu, 07 Jul 2016 13:09:10 +0000

哈佛刘小乐实验室出品的软件，可以跟MACS软件call到的peaks文件无缝连接，实现peaks的注释以及可视化分析

该软件的主页里面很清楚的介绍了它可以做的图，以及每个图的意义：http://liulab.dfci.harvard.edu/CEAS/usermanual.html

我这里简单讲一下该软件如何安装以及使用，我这里还是使用我们的CHIP-seq分析系列教程的测试数据。

## 首先安装软件，是一个python程序，非常好安装

## Download and install CEAS

## http://liulab.dfci.harvard.edu/CEAS/download.html

cd ~/biosoft

mkdir CEAS  &&  cd CEAS

wget  http://liulab.dfci.harvard.edu/CEAS/src/CEAS-Package-1.0.2.tar.gz

tar zxvf CEAS-Package-1.0.2.tar.gz

cd  CEAS-Package-1.0.2

python setup.py install --user

## http://liulab.dfci.harvard.edu/CEAS/usermanual.html

~/.local/bin/ceas --help

## 然后测试软件自带的数据

cd ~/biosoft/CEAS

mkdir testData && cd testData

## Human CD4T+ H3K36me3 ChIP-Seq data

wget http://liulab.dfci.harvard.edu/CEAS/src/H3K36me3_MACS_pval1e-5_peaks.bed.zip

wget http://liulab.dfci.harvard.edu/CEAS/src/H3K36me3.wig.zip

#########################run CEAS in the default mode##########################

$ ceas -g gdb -b bed -w wig

where gdb, bed, and wig stands for a sqlite3 file with a gene annotation table and genome background annotation, a BED file with ChIP regions, and a WIG file with ChIP enrichment signal file, respectively.

#########################example for test data : ####################

wget http://liulab.dfci.harvard.edu/CEAS/src/hg18.refGene.gz

数据详情如下：

300M May 20 2009 H3K36me3.wig

68M May 20 2009 H3K36me3.wig.zip

110K May 20 2009 H3K36me3_MACS_pval1e-5_peaks.bed

43K May 20 2009 H3K36me3_MACS_pval1e-5_peaks.bed.zip

11M Dec 2 2010 hg18.refGene

用测试数据来测试我们的CEAS软件：

~/.local/bin/ceas --name=H3K36me3_ceas --pf-res=20 \

--gn-group-names='Top 10%,Bottom 10%' -g hg18.refGene -b H3K36me3_MACS_pval1e-5_peaks.bed -w H3K36me3.wig

但是有个重点，如何获取wig文件： https://github.com/crazyhottommy/ChIP-seq-analysis

因为我这里只是用了bowtie比对了得到了bam文件，并没有用MACS软件

GSM1278641 Xu_MUT_rep1_BAF155_MUT SRR1042593

GSM1278642 Xu_MUT_rep1_Input SRR1042594

我用的是perl单行命令把bam文件转为wig格式，我这里就拿上面两个样本数据做例子：

samtools depth SRR1042593.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR1042593 description=SRR1042593\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR1042593.wig

samtools depth SRR1042594.sorted.bam | perl -ne 'BEGIN{ print "track type=print wiggle_0 name=SRR1042594 description=SRR1042594\n"}; ($c, $start, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c span=10\n"; };$lastC=$c; next unless $. % 10 ==0;print "$start\t$depth\n" unless $depth<3' > SRR1042594.wig

通过上面的学习，我们学会了该软件的使用，就可以拿自己的数据来玩一玩了。

## 然后处理我们直接的数据

mkdir annotation  &&  cd annotation

wget http://liulab.dfci.harvard.edu/CEAS/src/hg19.refGene.gz ; gunzip hg19.refGene.gz

~/.local/bin/ceas --name=H3K36me3_ceas --pf-res=20 --gn-group-names='Top 10%,Bottom 10%'  \

-g hg19.refGene -b  ../paper_results/GSM1278641_Xu_MUT_rep1_BAF155_MUT.peaks.bed -w ../rawData/SRR1042593.wig

用网页版工具GREAT来对CHIP-seq的peaks进行下游功能分析

ulwvfje — Thu, 07 Jul 2016 12:57:16 +0000

一般做完一个CHIP-seq测序，如果实验设计没有问题，测序质量也OK的话，很容易了根据序列call到符合要求的peaks，或者可以去很多文章或者roadmap里面下载到非常多有意义的peaks文件，一般是BED格式文件，这是就需要对这些peaks进行各种各样的注释以及可视化了，还有根据peaks相关的基因可以做各种各样的下游分析，包括各种pathway数据库的富集，MsigDB数据库注释，gene ontology的注释等等，此时不得不强烈推荐一款网页版工具，是斯坦福大学的学者开发的GREAT。

此工具的出现主要是为了解决基因组上面的非编码区域注释缺乏的问题，而我们CHIP-seq实验得到的peaks结果通常就是在非编码区域

首先进入该工具主页：http://bejerano.stanford.edu/great/public/html/

该工具每次只能上传一个文件，就是我们call出来的peaks记录文件，支持bed格式的：

一般很快就可以出结果啦！

首先会有三个图，都是很常见的，大家随便看看咯

Number of associated genes per region

Binned by orientation and distance to TSS

Binned by absolute distance to TSS

然后就是pathway和GO注释啦

这个网站提供的pathway非常之多，还是蛮全面的，包括KEGG，biocarta,reactome,msigdb等等还有一些signature和gene families，相当于一站式完成了大部分下游分析

GO Molecular Function (no terms)

GO Biological Process (no terms)

GO Cellular Component (no terms)

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
GO Molecular Function has 3,688 terms covering 15,090 (84%) of all 18,041 genes, and 189,388 term - gene associations.

3,688 ontology terms (100%) were tested using an annotation count range of [1, Inf].

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
GO Biological Process has 10,440 terms covering 15,441 (86%) of all 18,041 genes, and 950,065 term - gene associations.

10,440 ontology terms (100%) were tested using an annotation count range of [1, Inf].

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
GO Biological Process has 10,440 terms covering 15,441 (86%) of all 18,041 genes, and 950,065 term - gene associations.

10,440 ontology terms (100%) were tested using an annotation count range of [1, Inf].

Mouse Phenotype (no terms)

Human Phenotype (no terms)

Disease Ontology (no terms)

MSigDB Cancer Neighborhood (no terms)

Placenta Disorders (no terms)

PANTHER Pathway (no terms)

BioCyc Pathway (no terms)

MSigDB Pathway (no terms)

MGI Expression: Detected (no terms)

MSigDB Perturbation (no terms)

MSigDB Predicted Promoter Motifs (no terms)

MSigDB miRNA Motifs (no terms)

InterPro (no terms)

HGNC Gene Families (no terms)

MSigDB Oncogenic Signatures (no terms)

MSigDB Immunologic Signatures (no terms)

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
MSigDB Immunologic Signatures has 1,910 terms covering 16,609 (92%) of all 18,041 genes, and 363,333 term - gene associations.

1,910 ontology terms (100%) were tested using an annotation count range of [1, Inf].

自学CHIP-seq分析第九讲~CHIP-seq可视化大全

ulwvfje — Thu, 07 Jul 2016 12:53:47 +0000

讲到这里，我们的自学CHIP-seq分析系列教程就告一段落了，当然，我会随时查漏补缺，根据读者的反馈来更新着系列教程。其实可视化这已经是一个比较复杂的方向了，不仅仅是针对于CHIP-seq数据。可视化本身是发文章的先决条件，而让人一目了然图片也说明了数据分析人员对数据本身的理解。我这里就列出一些目录和一些工具，和ppt。这个主要靠大家自学了，而且我博客空间有限，就不上传一大堆图片了，大家随便找一些经典的paper里面都会有很多可视化分析。

首先强烈推荐两个网页版工具，针对找到的peaks可视化:

http://chipseek.cgu.edu.tw/

http://bejerano.stanford.edu/great/public/html/

然后再推荐一个哈佛刘小乐实验室出品的软件，也是专门为了作图http://liulab.dfci.harvard.edu/CEAS/usermanual.html

还有一个java工具：也可以可视化CHIP-seq的peaks结果EXPANDER (EXpression Analyzer and DisplayER) is a java-based tool for analysis of gene expression data.http://acgt.cs.tau.ac.il/expander/help/ver7.0Help/html/Input_Data_.htm

然后来随意上传一张图片吧

然后我所了解的图片大概有下面这些，都是有专门的软件，甚至自己写脚本也可以做的：

peaks长度分布柱状图

每个peak的测序情况可视化(IGV,sushi)

测序reads在全基因组各个染色体的分布(Chromosome ideograms)

reads相对基因位置分布统计

peaks相对基因位置分布统计

reads在基因组位置分布统计（染色体分开作图）

peaks在基因组位置分布统计（染色体分开作图）

统计peaks在各种基因组区域(基因上下游，5,3端UTR，启动子，内含子，外显子，基因间区域，microRNA区域)分布情况，条形图和饼图均可

Peak与转录起始位点距离的分析（曲线图和热图）

Average ChIP-Seq Gene Profile

ChIP-Seq Browser Tracks with Peak Calling

visualizes how ChIP regions are distributed over the genome along with their scores or peak heights.

可视化比较whole tiled or mappable regions + whole regions 这两种区域在全基因组各个染色体的百分比（百分比横向条形图）以及在各种genomic features的分布（百分比条形图）

display the average ChIP enrichment signals around TSS and TTS of genes, respectively（一般会把基因分成TOP10%，BOTTOM10%和ALL）

Since exon and intron lengths highly vary from gene to gene, CEAS groups exons (or introns) into multiple classes by length 看它们上面的 ChIP enrichment signals 分布情况

the average ChIP signal profiles on top 10 % , middle 10 %, and bottom 10 % of expressed genes

最后总结一下

其实有个国外的哥们也写过类似的自学教程：

一个实际的CHIP-seq数据分析例子： http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/

CHIP-seq pipeline :　http://www.slideshare.net/COST-events/chipseq-data-analysis

然后大家一定要看这个ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. http://www.ncbi.nlm.nih.gov/pubmed/22955991

自学CHIP-seq分析第八讲~寻找motif

ulwvfje — Thu, 07 Jul 2016 12:45:38 +0000

motif是比较有特征的短序列，会多次出现的，一般认为它的生物学意义重大，做完CHIP-seq分析之后，一般都会寻找motif 。查找有两种，一种是de novo的，要求的输入文件的fasta序列，一般是根据peak的区域的坐标提取好序列。另一种是依赖于数据库的搜寻匹配，很多课题组会将现有的ChIP-seq数据进行整合，提供更全面，更准确的motif数据库。

motif的定义如下：

motif: recurring pattern. eg, sequence motif, structure motif or network motif

DNA sequence motif: short, recurring patterns in DNA that are presumed to have a biological function.

从上边的定义可以看出，其实motif这个单词就是形容一种反复出现的模式，而序列motif往往是DNA上的反复出现的模式，并被假设拥有生物学功能。而且，经常是一些具有序列特异性的蛋白的结合位点（如，转录因子）或者是涉及到重要生物过程的（如，RNA 起始，RNA 终止， RNA 剪切等等）。

摘抄自：http://blog.163.com/zju_whw/blog/static/225753129201532104815301/

motif最先是通过实验的方法发现的，换句话说，不是说有了ChIP-seq才有了motif分析，起始很早人们就开始研究motif了！例如，‘TATAAT’ box在1975年就被pribnow发现了，它与‘上游的‘TTGACA’motif是RNA聚合酶结合位点的特异性序列。而且，当时的人们就知道，不是所有的结合位点都一定完美地与motif匹配，大部分都只匹配了12个碱基中的7-9个。结合位点与motif的匹配程度往往也与蛋白质与DNA的结合强弱有关。目前被人们识别出来的motif也越来越多，如TRANSFAC和JASPAR数据库都有着大量转录因子的motif。而随着ChIP-seq数据的大量产出，motif的研究会进一步深入，有一些课题组会将现有的ChIP-seq数据进行整合，提供更全面，更准确的motif数据库。

从算法上来讲，这是很复杂的，我就不多说了，我这里主要讲best practice：

一篇文献列出了2014年以前的近乎所有知名的A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data 链接见：https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-9-4

最常用的是 meme工具套件：

http://meme-suite.org/ 输入文件是fasta序列，需要对peaks进行转换，根据bed的基因坐标从基因组里面提取对应的序列咯： http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

它里面集成了4个寻找motif 的工具，每个工具都是一篇文章，里面有详细描述具体原理，但是整个网页给人的感觉是too busy，让初学者无从下手。

把自己的fasta序列上传上去即可，还是选取我们本次系列教程的数据

$ ls -lh *fasta

-rw-r--r-- 1 Jimmy 197121 18M Jul 7 19:40 GSM1278641_Xu_MUT_rep1_BAF155_MUT_sequence.fasta

-rw-r--r-- 1 Jimmy 197121 9.9M Jul 7 19:38 GSM1278643_Xu_MUT_rep2_BAF155_MUT_sequence.fasta

-rw-r--r-- 1 Jimmy 197121 26M Jul 7 19:41 GSM1278645_Xu_WT_rep1_BAF155_sequence.fasta

-rw-r--r-- 1 Jimmy 197121 14M Jul 7 19:41 GSM1278647_Xu_WT_rep2_BAF155_sequence.fasta

然后就可以看到所有结果啦，大家可以试试看。

另外一个比较常见的寻找motif工具，是HOMER 这个软件附带的一个perl脚本findMotifsGenome.pl ，但是这个工具不是很好安装，而且对服务器资源要求还有一点，所以我这里就不推荐了。

安装使用如下：

## Download and install homer (Hypergeometric Optimization of Motif EnRichment)
## // http://homer.salk.edu/homer/
## // http://blog.qiubio.com:8080/archives/3024
## pre-install: Ghostscript，seqlogo,blat
cd ~/biosoft
mkdir homer && cd homer
wget http://homer.salk.edu/homer/configureHomer.pl
perl configureHomer.pl -install
perl configureHomer.pl -install hg19

如果是对MACS找到的peaks记录文件，还需提取对应的列给HOMER作为输入文件：

awk '{print $4"\t"$1"\t"$2"\t"$3"\t+"}' sample_peaks.bed >sample_homer.bed

findMotifsGenome.pl sample_homer.bed hg19 motifDir -len 8,10,12

最后得到的文件夹里面有一个详细的网页版报告，所以很多人都喜欢用这个软件，而且HOMER 这个软件是一个大杂烩，能解决几乎所有的高通量测序数据的分析。

最后值得一提的就是现在流行的R的bioconductor系列包，也可以寻找motif：

一般的R包都可以直接从BED文件里面记录的基因坐标来找motif，有点需要输入fasta序列，就需要自己根据bed的基因坐标从基因组里面提取对应的序列咯：

rGADEM (motif discovery): http://bioconductor.org/packages/devel/bioc/html/rGADEM.html

MotIV (motif validation): http://bioconductor.org/packages/devel/bioc/html/MotIV.html

http://lgsun.grc.nia.nih.gov/CisFinder/

http://bioinfo.cs.technion.ac.il/drim/

http://www.ncbi.nlm.nih.gov/pubmed/20736340

还有一个PICS (ChIP-seq): 虽然不是bioconductor的包 http://www.rglab.org/pics-probabilistic-inference-for-chip-seq/ 貌似国内被墙了，无法打开

生信菜鸟团 » CHIP-seq

ngsplot辅助CHIP-seq数据分析-可视化

生物信息数据分析文章就是看图写作文

对CHIP-seq数据call peaks应该选取unique比对的reads吗？

SRR2774675

SRR2774676

根据比对的bam文件来对peaks区域可视化

ChIP-Seq文献数据重新分析解读第二例

自学CHIP-seq分析第九讲~CHIP-seq可视化大全

ChIP-Seq文献数据重新分析解读第一例

使用CEAS软件来对CHIP-seq的peaks进行

用网页版工具GREAT来对CHIP-seq的peaks进行下游功能分析

GO Molecular Function (no terms)

GO Biological Process (no terms)

GO Cellular Component (no terms)

Mouse Phenotype (no terms)

Human Phenotype (no terms)

Disease Ontology (no terms)

MSigDB Cancer Neighborhood (no terms)

Placenta Disorders (no terms)

PANTHER Pathway (no terms)

BioCyc Pathway (no terms)

MSigDB Pathway (no terms)

MGI Expression: Detected (no terms)

MSigDB Perturbation (no terms)

MSigDB Predicted Promoter Motifs (no terms)

MSigDB miRNA Motifs (no terms)

InterPro (no terms)

InterPro (no terms)

HGNC Gene Families (no terms)

MSigDB Oncogenic Signatures (no terms)

MSigDB Immunologic Signatures (no terms)

自学CHIP-seq分析第九讲~CHIP-seq可视化大全

自学CHIP-seq分析第八讲~寻找motif

最常用的是 meme工具套件 ：

最常用的是 meme工具套件：