生信菜鸟团 » CHIP-seq

一个MeDIP-seq实战-超级简单-2小时搞定！

ulwvfje — Wed, 15 Feb 2017 06:34:38 +0000

请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。

软件请用最新版，尤其是samtools等被我存储在系统环境变量的，考虑到读者众多，一般的软件我都会自带版本信息的！

我用两个小时，不代表你是两个小时就学会，有些朋友反映学了两个星期才学会，这很正常，没毛病，不要异想天开两个小时就达到我的水平。

MeDIP-seq 跟ChIP-seq的分析手段是一模一样的，同理hMeDIP-seq，caMeDIP-seq等等，都没有本质上的区别，只是用的抗体不一样而已，请自行搜索基础知识，我只讲数据分析。

一个ChIP-seq实战-超级简单-2小时搞定！

一个RNA-seq实战-超级简单-2小时搞定！

请先看看我前面写的系列，对我而言很简单，因为软件我都安装了，数据我都下载好了，代码我都看得懂，对你，不一定简单，有朋友反映学了两个星期才弄懂，但至少，是可以弄懂的！

paper是Dnmt3L antagonizes DNA methylation at bivalent promoters and favors DNA methylation at gene bodies in ESCs.：https://www.ncbi.nlm.nih.gov/pubmed/24074865 发表在2013年CELL杂志上面，值得重复！

MeDIP-seq 数据在：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44642

首先下载raw data数据：

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP018/SRP018845/SRR764931/SRR764931.sra

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP018/SRP018845/SRR764932/SRR764932.sra

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 $id;done

用fastqc看了看数据质量，发现质量非常赞，我就不需要过滤reads了。代码如下：

ls *fastq |xargs ~/biosoft/fastqc/FastQC/fastqc -t 10

如果要过滤，就用下面的代码：

ls *.fastq | while read id

~/biosoft/sickle/sickle-master/sickle se -t sanger -g -f $id -o ${id%%.*}.trimmed.fq.gz

done

首先用bowtie2软件把测序得到的fastq文件比对到mm10参考基因组上面，就两个数据，我就不写循环了！

对于这种没有control的数据，我们可以直接把peaks-calling 4部曲一起搞定的！

对比对好的bam文件，就可以直接用MACS软件来找peaks啦：

首先对这些bam文件批量转换成bw文件。然后批量画图

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -x ~/reference/index/bowtie/mm10 -U SRR764931.fastq | samtools sort -O bam -o shDnmt3L.bam

## 比对率很高，分别是96.67%(shDnmt3L) 和96.59%(shGFP),这比对率没得说了，非常赞！

samtools index shDnmt3L.bam

~/.local/bin/macs2 callpeak -t shDnmt3L.bam -m 10 30 -p 1e-5 -f BAM -g mm -n shDnmt3L 2>shDnmt3L.masc2.log

bamCoverage -b shDnmt3L.bam -o shDnmt3L.bw ## 这里有个参数，-p 10 --normalizeUsingRPKM

computeMatrix reference-point --referencePoint TSS -b 10000 -a 10000 -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed -S shDnmt3L.bw --skipZeros -o matrix1_shDnmt3L_TSS.gz

plotHeatmap -m matrix1_shDnmt3L_TSS.gz -out shDnmt3L.png

就两个数据，我就没有写循环了，现在你肯定能看懂了吧！

分析，就这样介绍咯！

参考：http://crazyhottommy.blogspot.com/search/label/MeDIP-seq

一个ChIP-seq实战-超级简单-2小时搞定！

ulwvfje — Tue, 10 Jan 2017 03:14:22 +0000

请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。

软件请用最新版，尤其是samtools等被我存储在系统环境变量的，考虑到读者众多，一般的软件我都会自带版本信息的！

我用两个小时，不代表你是两个小时就学会，有些朋友反映学了两个星期才学会，这很正常，没毛病，不要异想天开两个小时就达到我的水平。

本次讲解选取的文章是为了探索PRC1，PCR2这样的蛋白复合物，不是转录因子或者组蛋白的CHIP-seq，请注意区别！

这是一个系列帖子，你可以先看：

一个表达芯片数据处理实例

一个RNA-seq实战-超级简单-2小时搞定！

WES（七）看de novo变异情况

【直播】我的基因组22：用IGV查看具体某个位点是否变异

文章是：RYBP and Cbx7 define specific biological functions of polycomb complexes in mouse embryonic stem cells

https://www.ncbi.nlm.nih.gov/pubmed/23273917

RYBP and Cbx7都是Polycomb repressive complex 1 (PRC1)的组分：

数据都在：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42466

所以用脚本在ftp里面批量下载即可：

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP017/SRP017311

下载地址很容易获取啦！

for ((i=204;i<=209;i++)) ;do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP017/SRP017311/SRR620$i/SRR620$i.sra;done

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 $id;done

图片丢失~~~~~

因为我用fastqc看了看数据质量，代码如下：

ls *fastq |xargs ~/biosoft/fastqc/FastQC/fastqc -t 10

发现3端质量有点问题，我就用了-3 5 --local参数，

首先用bowtie2软件把测序得到的fastq文件比对到mm10参考基因组上面

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620204.fastq| samtools sort -O bam -o ring1B.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620205.fastq| samtools sort -O bam -o cbx7.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620206.fastq| samtools sort -O bam -o suz12.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620207.fastq| samtools sort -O bam -o RYBP.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620208.fastq| samtools sort -O bam -o IgGold.bam

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -3 5 --local -x ~/reference/index/bowtie/mm10 -U SRR620209.fastq| samtools sort -O bam -o IgG.bam

接下来需要对bam文件进行简单过滤，包括未比对的和multiple比对的，但是我比较懒，就直接用MACS2软件来call peaks啦！

nohup ~/.local/bin/macs2 callpeak -c ../IgGold.bam -t ../suz12.bam -m 10 30 -p 1e-5 -f BAM -g mm -n suz12 2>suz12.masc2.log &

nohup ~/.local/bin/macs2 callpeak -c ../IgGold.bam -t ../ring1B.bam -m 10 30 -p 1e-5 -f BAM -g mm -n ring1B 2>ring1B.masc2.log &

nohup ~/.local/bin/macs2 callpeak -c ../IgG.bam -t ../cbx7.bam -m 10 30 -p 1e-5 -f BAM -g mm -n cbx7 2>cbx7.masc2.log &

nohup ~/.local/bin/macs2 callpeak -c ../IgG.bam -t ../RYBP.bam -m 10 30 -p 1e-5 -f BAM -g mm -n RYBP 2>RYBP.masc2.log &

大家可以看到RYBP这个CHIP-seq我几乎得不到peaks，哪怕是换了一个control，除非我不用任何control！我用IGV看了看，这个RYBP的确很诡异，我怀疑是作者上传数据出错了！

而且作者在GEO给的PEAKS个数如下：

2754 GSE42466_Cbx7_peaks_10.txt
6982 GSE42466_Ring1b_peaks_10.txt
6872 GSE42466_RYBP_peaks_5.txt
8054 GSE42466_Suz12_peaks_10.txt

首先对这些bam文件批量转换成bw文件。然后批量画图

ls ../*bam |while read id

file=$(basename $id )

sample=${file%%.*}

echo $sample

bamCoverage -b $id -o $sample.bw ## 这里有个参数，-p 10 --normalizeUsingRPKM

computeMatrix reference-point --referencePoint TSS -b 10000 -a 10000 -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed -S $sample.bw --skipZeros -o matrix1_${sample}_TSS.gz --outFileSortedRegions regions1_${sample}_genes.bed

plotHeatmap -m matrix1_${sample}_TSS.gz -out ${sample}.png

done

然后整合所有的chipseq的bam文件，画基因的TSS附近的profile和heatmap图

computeMatrix reference-point -p 10 --referencePoint TSS -b 2000 -a 2000 -S ../*bw -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed --skipZeros -o tmp4.mat.gz

plotHeatmap -m tmp4.mat.gz -out tmp4.merge.png

plotProfile --dpi 720 -m tmp4.mat.gz -out tmp4.profile.pdf --plotFileFormat pdf --perGroup

plotHeatmap --dpi 720 -m tmp4.mat.gz -out tmp4.merge.pdf --plotFileFormat pdf

最后整合所有的chipseq的bam文件，画基因的genebody附近的profile和heatmap图

computeMatrix scale-regions -p 10 -S ../*bw -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed -b 3000 -a 3000 -m 5000 --skipZeros -o tmp5.mat.gz

plotHeatmap -m tmp5.mat.gz -out tmp5.merge.png

plotProfile --dpi 720 -m tmp5.mat.gz -out tmp5.profile.pdf --plotFileFormat pdf --perGroup

plotHeatmap --dpi 720 -m tmp5.mat.gz -out tmp5.merge.pdf --plotFileFormat pdf

下面是输出的图的例子，我只放了tss附近的！

上图可以看到RYBP的peaks的中点在TSS处，而其它peaks都在TSS下游一点点。

用Sequential ChIP (re-ChIP)实验的确可以看到RYBP和CBX7的peaks有重合。

这篇文章一直翻来覆去说这些CHIP-seq实验的peaks的交叉情况：

PRC1的组分异常复杂，包括 Cbx (Cbx2, Cbx4, Cbx6, Cbx7, or Cbx8); Ring1A or Ring1B; PHC (PHC1, PHC2, or PHC3); PCGF (PCGF1, PCGF2, PCGF3, PCGF4, PCGF5, or PCGF6); and RYBP or YAF2.
其中，a Ring1A/B E3 ligase subunit that monoubiquitinates histone H2A at lysine 119 (H2AK119ub)
但不是说都必须要有，而是它们的组合，形成了各种各样的PRC1，但是都统一叫做PRC1。
比如在mouse的ESCs里面，就有两种PRC1，它们的 Cbx7 or RYBP 是不可能共存的！我们可以把它们分别叫做， Cbx7-PRC1, RYBP-PRC1Cbx7 的功能是把 Ring1B 招募到染色质上面，是必须的。它结合的基因多参与 early-lineage commitment of ESCs.
RYBP 可以增强PRC1的酶活性，它结合大基因多参与，regulation of metabolism and cell-cycle progression
RYBP 结合的基因要比 CBX7 结合的基因表达量高。因为CBX7结合的同时，会招募PRC2这个抑制marker。
而PRC2 deposits the histone H3 lysine 27 trimethyl repressive mark (H3K27me3) through the Ezh1/2 histone methyltransferase enzymes.如何描述它们这些peaks的交叉情况呢？
We observed an overlap of RYBP peaks (3,918 in total) with 14%, 42%, and 37% of Cbx7, Ring1B, and Suz12 peaks, respectively
Moreover, although more than 90% of Cbx7 peaks contained Ring1B and Suz12, 20% were also bound by RYBP
尽管RYBP and Cbx7 在大部分情况下都是互相排斥的，但是也在少部分基因组区域存在共定位的现象。Ring1B / Suz12的peaks情况可以被 Cbx7 和 RYBP 的peaks情况说明：
RYBP and Cbx7 都有的地方，有着高Ring1B/Suz12
Cbx7 but not RYBP的地方，Ring1B/Suz12会稍微低一点
RYBP but not Cbx7的地方，Ring1B/Suz12会更低一点
RYBP and Cbx7 都没有的地方，Ring1B/Suz12就最少！RYBP的peaks的中点在TSS处，而其它peaks都在TSS下游一点点。
用Sequential ChIP (re-ChIP)实验的确可以看到RYBP和CBX7的peaks有重合。而且RYBP还有一些peaks是其它PRC1所没有的，说明它可以独立于PRC1发挥作用H2AK119ub 与 Ring1B/Suz12正相关，但是与RYBP只有25.7%交叉，与CBX7有着72%交叉，所以可以把 PRC1 target genes分成3类：
a first set with Cbx7/Ring1B/H2AK119ub; ~~~~GO/KEGG分析，
a second that contains RYBP and lower levels of Ring1B/H2AK119ub
a third set cobound by RYBP/Cbx7/Ring1B and that also contains H2AK119ub.

然后这些所有的gene list都可以拿去做GO/KEGG分析，看看是不是有什么biological meaning ！
genes co-occupied by Ring1B/Cbx7/RYBP and H2AK119ub are involved in system development.
genes containing RYBP/Ring1B/H2AK119ub, but not Cbx7, have a strong association with the M phase of the meiotic cycle and cellular metabolism
genes with Cbx7/Ring1B/H2AK119ub are involved in developmental processes and mesoderm specification,
those containing RYBP/Cbx7/Ring1B/H2AK119ub predominantly represent the ectodermal fate and, to a lesser extent, mesoderm and endoderm fates

超过700的基因有 RYBP/Cbx7/Ring1B的peaks，所以作者敲除Cbx7 看看 RYBP的peaks是否会变化，但是没有做CHIP-seq，只是做了ChIP-qPCR

下面这个结论很重要：
Overall, our ChIP-seq analysis allowed us to identify five types of genes according to the occupancy of PRC1 and PRC2: those with
(1) Ring1B/Cbx7/RYBP and Suz12 (725 genes);
(2) Ring1B/Cbx7/Suz12, but not RYBP (1,527 genes);
(3) Ring1B/RYBP/Suz12, but not Cbx7 (861 genes);
(4) only Ring1B and Suz12 (1,694 genes); or
(5) RYBP but no Polycomb proteins (1,674)

ngsplot辅助CHIP-seq数据分析-可视化

ulwvfje — Sun, 01 Jan 2017 02:18:17 +0000

最近在忙一些chip-seq的数据分析项目，它的可视化展现比较复杂一点，自己写程序将会耗费挺长时间的，就想着利用现成的工具，前面试用了deeptools，挺好的，但是有点慢，是python程序，如下：

deeptools辅助CHIP-seq数据分析-可视化

现在换一个R程序，这个非常快速，而且绘图个人觉得稍微美观一点，大家也可以都试试看。

首先软件的github里面有源代码，然后作者还四处宣讲这个包的神奇之处，下面的ppt非常言简意赅的描述了它的功能和强大之处。

github: https://github.com/shenlab-sinai/ngsplot

ppt:http://jura.wi.mit.edu/bio/education/hot_topics/ngsplot/ngsplot_Apr2014.pdf

example:https://drive.google.com/drive/folders/0B1PVLadG_dCKN1liNFY0MVM1Ulk

安装超级简单啦，只需要去Google的云盘里下载软件和测试数据咯

cd ~/biosoft

mkdir ngsplot && cd ngsplot

## download by yourself :https://drive.google.com/drive/folders/0B1PVLadG_dCKN1liNFY0MVM1Ulk

tar -zxvf ngsplot-2.61.tar.gz

tar zxvf ngsplot.eg.bam.tar.gz ## 测试数据非常给力，清楚的说明了，CHIP-seq数据分析-可视化需要什么样的数据。

cp ../ngsplot/example/config.example.txt ./ ## 在后面的测试代码需要用

echo 'export PATH=/home/jianmingzeng/biosoft/ngsplot/ngsplot/bin:$PATH' >>~/.bashrc

echo 'export NGSPLOT=/home/jianmingzeng/biosoft/ngsplot/ngsplot' >>~/.bashrc

source ~/.bashrc

## 需要你的服务器安装好R，并且你自己手动安装好这几个包。

install.packages("doMC", dep=T)

install.packages("caTools", dep=T)

install.packages("utils", dep=T)

source("http://bioconductor.org/biocLite.R")

biocLite( "BSgenome" )

biocLite( "Rsamtools" )

biocLite( "ShortRead" )

使用非常简单，看懂ngs.plot.r的用法即可，一个命令就出图了，如果这个出图不满意，就用replot.r重新选择参数绘制一个新的图！

前提是自己下载好了基因组文件，本软件自带hg19，其余的基因组有：https://github.com/shenlab-sinai/ngsplot/wiki/SupportedGenomes ，但是都放在Google云盘里面，所以需要翻墙才能下载的： https://drive.google.com/drive/folders/0B1PVLadG_dCKNEsybkh5TE9XZ1E

测序数据如下：

有了这些测试数据，而且软件里面还自带了测试代码：

ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.1M.bam -O k4.test

ngs.plot.r -G hg19 -R tss -C config.example.txt -O encode1M.k4k27

如果需要对多个bam文件画图，就根据作者定义的规则来设置好config.example.txt 文件即可

如果你对上面的图不满意，可以用replot.r 来重新根据上面的参数来画图。

replot.r prof -I k4.test.zip -O k4.replot -SE 0 -MW 9 -H 0.3

replot.r heatmap -I encode1M.k4k27.zip -O k4k27.replot -GO hc -RR 80

除了以tss来画图，还可以根据genebody或者其它： tss, tes, genebody, exon, cgi, enhancer, dhs or bed

ngs.plot.r -G hg19 -R genebody -F rnaseq -C hesc.RNAseq.1M.bam -O encode1M.rnaseq

ngs.plot.r -G hg19 -R tss -C hesc.H3k4me3.1M.bam:hesc.Input.500K.bam -O k4vsInp

轻轻松松get到作者的意图，然后拿自己的数据就可以做同样的分析图片了！

当然，如果你领悟力比较差，慢慢读人家的github上面的readme吧，实在是太简单了，我都不知道需要我讲什么。

而且运行速度还特快！

当然，也可能是它这个测试文件本来就很小的原因。

The genome files can be found in this Google drive folder: ngs.plot genome folder. A list of the available genomes is listed in this Wiki: SupportedGenomes. A brief list is here (not all): "human (hg18, hg19), chimpanzee (panTro4), rhesus macaque (rheMac2), mouse (mm9, mm10), rat (rn4, rn5), cow (bosTau6), chicken (galGal4), zebrafish (Zv9), drosophila (dm3), Caenorhabditis elegans (ce6, ceX), Saccharomyces cerevisiae (sacCer2, sacCer3), Schizosaccharomyces pombe (Asm294), Arabidopsis thaliana (TAIR10), Zea mays (AGPv3), rice (IRGSP-1.0)".

生物信息数据分析文章就是看图写作文

ulwvfje — Wed, 28 Dec 2016 07:14:39 +0000

首先是从测试原始数据里面得到汇总数据

然后把各种统计汇总数据可视化成图表

最后根据图表来写作文即可。

来源：Genome-wide Mapping of HATs and HDACs Reveals Distinct Functions in Active and Inactive Genes

http://www.sciencedirect.com/science/article/pii/S0092867409008411

比如下面这个图，就是CHIP-seq的数据，比对后根据全基因组的所有基因的区域范围内的reads密度的总结：

故事该怎么写呢？

首先看图例：

A. Profiles of HATs binding across 5’ gene ends, 3’ gene ends and gene body regions of the 1000 most active, intermediately active and least active genes were examined using ChIP-Seq.txStart: transcription start site. txEnd: transcription end site.

B. Profiles of HATs binding across intergenic (5kb away from any gene) or promoter (defined

as +/− 1kb surrounding TSS) DNase HS sites. DNase HS sites were obtained from (Boyle et

al., 2008).

作者做了5个HATs基因的CHIP-seq数据，根据上面的图，可以把它们分成3组，分别是CBP and p300，PCAF (p300/CBP associated factor) and GCN5，MOF and Tip60，它们虽然都是蛋白质的乙酰化酶，但是它们的CHIP-seq数据表现不一致，仔细看上图就明白了。为什么不一致，就需要解释，解释就需要有生物学背景，比如CBP and p300结构上高度同源，前人研究也表明主要是参与转录起始。而PCAF (p300/CBP associated factor) and GCN5是另外一组的高度同源，前人研究参与转录延伸。最后的MOF and Tip60是MYST family of HATs，跟上面的HATs不大一样，前人研究表明它们参与的功能特别多样性，所以在基因上面的结合密度跟其它不一样。最后再扯一扯它们在其它物种的功能如何如何，跟人类比较一下如何如何。再找几个已有的CHIP-seq数据交叉验证一下，再说一下自己也做实验随机验证了一些，因为高通量测序毕竟不是金标准。

下面这张图是把CHIP-seq数据的reads密度和基因的表达量关联起来，也很简单。

故事该怎么写呢?

首先看图例：

C. Correlation between HAT binding and gene expression levels. Genes were grouped to 100

gene (one dot in the figure) sets according to expression level. The HAT binding level in

promoter region was calculated for the same 100 gene sets. The y-axis indicates the HAT

binding level and the x-axis indicates the expression level.

D. Correlation between HAT binding and RNA Pol II binding levels among the 100 gene sets

grouped according to expression levels as defined in panel C. The y-axis indicates the HAT

binding level and the x-axis indicates the Pol II level.

E. Correlation between HAT binding and histone acetylation levels among the 100 gene sets

grouped according to expression levels as defined in panel C. The acetylation level was

calculated by pooling all reads for 18 histone acetylations mapped previously (Wang et al.,

2008). The y-axis indicates the HAT binding level and the x-axis indicates the acetylation level.

图例就很复杂了，但是信息量很少。就是根据转录组数据把基因分区段，不同表达水平的基因组它们的对应的基因的CHIP-seq数据的密码如何，很简单的一个相关图。就是为了说明它们跟基因的表达水平是正相关的。其实表达水平就是polyII的结合密度，也可以看看polyII的结合密度跟这些CHIP-seq的IP的结合密度看看相关性，也能说明同样的结论。

此文的作者把HATs系列酶都做了CHIP-seq数据，同时也把HDACs系列酶也做了CHIPseq数据！~~~

一般人入门生物信息学的时候问题都集中在如何得到可绘图的数据，因为绘图很简单，哪怕是不会R语言，在excel也能做。至于后面的看图写作文，主要是考验生物学底蕴了。

最后说一下下面这个图：

A. Distribution profiles of HDAC6, Tip60, Pol II and H3K36me3 across the active genes were

plotted. The left y-axis indicates tag densities for HDAC6, Tip60 and Pol II. The right axis

indicates tag densities for H3K36me3.

这个没什么好说的了，很明显HATs和HDACs和polyII都是一样的pattern，都代表着转录激活，跟H3K36me6的pattern有显著区别。这个现象很新颖，很有趣，再扯一堆生物学意义就好，为什么HATs和HDACs和polyII都是一样的pattern呢？给自己的假设和猜想。前提是要有生物学背景知识。

而且，如何得到这样的绘图的数据，讲起来就比较复杂了。

deeptools辅助CHIP-seq数据分析-可视化

ulwvfje — Thu, 15 Dec 2016 11:20:00 +0000

有很多读者来信，CHIP-seq数据比对后的bam文件如果根据基因组的所有基因来画热图，profile图呢？

这里隆重推荐deeptools这个软件：

第一个功能，把bam文件转换为bw格式文件：

bamCoverage -b tmp.sorted.bam -o tmp.bw

里面有一个参数非常重要，就是--extendReads 在 macs软件里面也有，macs2 pileup --extsize 200 ，就算是你的reads长度可能不一致，是否需要把它们补齐成一个统一的长度，因为我们只要是测到了reads，就代表那里是有signal的，只是因为我们的测序仪限制，测到的长度不够，或者质量不好，我们QC的时候，把前后碱基给trim掉了。还可以安装基因组的有效大小来对测序深度进行normlization。

第二个功能，画所有基因附近的信号热图，tools: computeMatrix, then plotHeatmap

http://deeptools.readthedocs.io/en/latest/content/example_step_by_step.html#heatmaps-and-summary-plots

需要自行下载合适的基因坐标记录文件，BED格式的。把上面两个命令结合起来即可，代码和图形实例如下:

第3个功能，画profile的图！

use computeMatrix for all of the signal files (bigWig format) at once

use plotProfile on the resulting file

还有很多小功能，欢迎大家去探索，这个软件是python软件，安装非常简单：

## Download and install deepTools

## https://github.com/fidelram/deepTools

## http://deeptools.readthedocs.io/en/latest/content/example_usage.html

pip install pyBigWig --user

cd ~/biosoft

mkdir deepTools && cd deepTools

git clone https://github.com/fidelram/deepTools ## 130M,

cd deepTools

python setup.py install --user

## 17 tools in ~/.local/bin/

~/.local/bin/deeptools

安装之后，很多小工具都放到了~/.local/bin/目录：

如果你有大批量的bam文件，需要批量做，用下面的脚本啦：

ls ../bamFiles/*bam |while read id

do

file=$(basename $id )

sample=${file%%.*}

echo $sample

bamCoverage -b $id -o $sample.bw

computeMatrix reference-point --referencePoint TSS -b 10000 -a 10000 -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed -S $sample.bw --skipZeros -o matrix1_${sample}_TSS.gz --outFileSortedRegions regions1_${sample}_genes.bed

plotHeatmap -m matrix1_${sample}_TSS.gz -out ${sample}.png

done

对CHIP-seq数据call peaks应该选取unique比对的reads吗？

ulwvfje — Sun, 07 Aug 2016 13:13:18 +0000

对于CHIP-seq数据处理完全是自学的，所以有很多细节得慢慢学习回来，这次记录的就是当我们把测序仪的fastq数据比对到参考基因组之后，应该对比对的结果文件做什么样的处理，然后去给peaks caller软件拿来call peaks呢？我看过博客提到只保留比对质量值大于30的，也看过博客提到只保留unique比对的reads，我这里拿一篇公共数据测试了一下它们的区别！数据描述如下：

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74311

参考流程：https://github.com/jmzeng1314/NGS-pipeline/tree/master/CHIPseq

GSM1916974	H3K27ac ChIP-seq	SRR2774675
GSM1916975	input DNA	SRR2774676

首先在SRA数据库下载 SRR2774675.sra 和 SRR2774676.sra

http://www.ncbi.nlm.nih.gov/sra?term=SRP065184

应用我github的流程很快就可以对比对，我把两种方法处理的比对结果都拿去call peaks，然后得到了，两个peaks文件。

39709 highQuaily_peaks.bed
39709 highQuaily_summits.bed

可以看到两次结果得到的peaks条数并没有显著区别，我们简单看看前几行！

其实用bedtools就可以看看左右两边的文件的交集情况，但是我这里选用了ChIPpeakAnno这个R包集成好的函数，直接得到结果即可！

ChIPpeakAnno 包直接看说明书吧，我这里贴出代码：

library(ChIPpeakAnno)
highPeak <- readPeakFile( 'highQuaily_peaks.bed' )
uniquePeak <- readPeakFile( 'unique_peaks.bed' )
ol <- findOverlapsOfPeaks(highPeak, uniquePeak)
png('overlapVenn.png')
makeVennDiagram(ol)
dev.off()

然后打开画好的韦恩图：

可以看到这两种情况得到的结果几乎没有区别，如果大家感兴趣可以自己看看它们那些独特的peaks到底是什么原因！

结论就是，说明CHIP-seq数据分析的时候，call peaks那个步骤，只保留比对质量值大于30的，或者只保留unique比对的reads，从数据处理的角度来讲差别不大，主要看你具体实验意义。

根据比对的bam文件来对peaks区域可视化

ulwvfje — Tue, 02 Aug 2016 13:52:53 +0000

之前分析了好几个公共项目，拿到的peaks都很诡异，搞得我一直怀疑是不是自己分析错了。终于，功夫不负有心人，我分析了一个数据，它的peaks非常完美！！！可以证明，我的分析流程以及peaks绘图代码并没有错！数据来自于http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74311，是关于H3K27ac_ChIP-Seq_LOUCY，组蛋白修饰的CHIP-seq数据，很容易就下载了作者上传的测序数据，然后跑了我的流程！https://github.com/jmzeng1314/NGS-pipeline/tree/master/CHIPseq

本文的重点在于讲解如何查看自己的peaks是否是正确的！我是直接用比对的bam文件来用samtools depth命令来获取peaks区域的测序深度，从而画图的，代码见step5-peaks-view-samtools-depth.R

在终端调用我的代码画图命令如下；

Rscript ~/scripts/peakView.R ../unique_peaks.bed ../../SRR2774675.unique.sorted.bam ../../SRR2774676.unique.sorted.bam
Rscript ~/scripts/peakView.R ../unique_peaks.bed ../../SRR2774675.unique.sorted.bam ../../SRR2774676.unique.sorted.bam

下面随便看两个peaks，很明显是双峰模型，而且IP的测序深度远高于INPUT，数据非常棒！

然后我不得不指出如果CHIP-seq实验失败，那么peaks会很诡异，首先你会看到测序深度大多都在10以下，即使有部分测序深度很高的，也是IP和INPUT的测序深度压根就没有差异，下面就是一个典型的失败案例！

这种实验失败的数据，实在是无法分析。而转录因子的CHIP-seq实验失败率还挺高的，所以一定要有control，否则再怎么分析也是 rubbish in rubbish out

4种方式下载roadmap计划的所有数据

ulwvfje — Thu, 28 Jul 2016 14:52:16 +0000

roadmap的官网是：http://www.roadmapepigenomics.org/

精选的129个细胞系，细胞系的介绍如下：http://www.broadinstitute.org/~anshul/projects/roadmap/metadata/EID_metadata.tab

对每个细胞系，都至少处理了5个核心组蛋白修饰数据，还有其它若干转录因子数据。

官网介绍的很详细，我就不翻译了：

The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The Consortium leverages experimental pipelines built around next-generation sequencing technologies to map DNA methylation, histone modifications, chromatin accessibility and small RNA transcripts in stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease. The Consortium expects to deliver a collection of normal epigenomes that will provide a framework or reference for comparison and integration within a broad array of future studies. The Consortium also aims to close the gap between data generation and its public dissemination by rapid release of raw sequence data, profiles of epigenomics features and higher-level integrated maps to the scientific community. The Consortium is also committed to the development, standardization and dissemination of protocols, reagents and analytical tools to enable the research community to utilize, integrate and expand upon this body of data.

首先是这个网站：

http://www.encode-roadmap.org/

矩阵很容易看懂roadmap处理了哪些细胞系，进行了什么样的处理，数据可以直接下载。

然后我比较首先推崇broad研究所的下载方式

里面还列出了他们用过的peaks caller 工具：

http://www.broadinstitute.org/~anshul/projects/encode/preprocessing/peakcalling/ 可以看到，主要有MACS，peakranger，quest，sicer，peakseq，hotspot等等

直接进入broad分析好的peaks结果：

Parent Directory		-
broadPeak/	08-Feb-2015 21:00	-
gappedPeak/	08-Feb-2015 21:00	-
lowq/	31-Aug-2014 20:42	-
narrowPeak/	08-Feb-2015 20:59	-

这里面有3种peaks，我现在还没有搞懂是什么意思。

接着是 iHEC存放的数据：

http://epigenomesportal.ca/ihec/download.html

我还是第一次看到这个数据接口，也是以文件夹文件的形式直接浏览，根据自己的需求下载即可：

除了ENCODE计划的数据，还有Blueprint计划和roadmap计划的数据都可以下载。

NIH Roadmap

2014-05-29

Click here for policies

最后可以从圣路易斯华盛顿大学里面下载

圣路易斯华盛顿大学Washington University in St. Louis，简称（Wash U，WU）以美国国父乔治·华盛顿命名，始建于1853年2月22日，位于美国密苏里州圣路易斯市，是美国历史上建校最早也是最负盛名的“华盛顿大学”，该校在美国新闻和世界报道（US News & World Report）2014大学综合排名中名列14位。

里面有一个非常详细的页面来介绍roadmap的各种数据:http://egg2.wustl.edu/roadmap/web_portal/processed_data.html

如果你已经了解了roadmap计划，就很容易找到自己的数据，从而直接浏览器或者wget下载即可。

首先是序列比对结果下载。

onsolidated Epigenomes:36 bp mappability filtered, pooled and subsampled read alignment files:
http://egg2.wustl.edu/roadmap/data/byFileType/alignments/consolidated/

Unconsolidated Epigenomes (Uniform mappability): 36 bp mappability filtered primary alignment files:
http://egg2.wustl.edu/roadmap/data/byFileType/alignments/unconsolidated/

包括各种peaks记录文件下载

Narrow contiguous regions of enrichment (peaks) for histone ChIP-seq and DNase-seq
- Data format: NarrowPeak
- http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/

Broad domains on enrichment for histone ChIP-seq and DNase-seq)
- Data format: BroadPeak

http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/

Data format: GappedPeak (subset of domains containing at least one narrow peaks)

http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/gappedPeak/

6种方式下载ENCODE计划的所有数据

ulwvfje — Thu, 28 Jul 2016 14:50:00 +0000

DNA元件百科全书(Encyclopedia of DNA Elements, ENCODE)ENCODE计划的重要性我就不多说了，如果大家还不是很了解，可以直接跳到本文末尾去下载一下ENCODE教程，好好学习。该计划采用以下几种高通量测序技术来刻画了超过100种不同的细胞系或者组织内的全基因组范围内的基因调控元件信息。本来只是针对人类的，后来对mouse以及fly等模式生物也开始测这些数据并进行分析了，叫做 modENCODE

chromatin structure (5C)

open chromatin (DNase-seq and FAIRE-seq)

histone modifications and DNA-binding of over 100 transcription factors (ChIP-seq)

RNA transcription (RNAseq and CAGE)

目前所有数据均全部公开(http://genome.ucsc.edu/ENCODE/ )，ENCODE results from 2007 and later are available from the ENCODE Project Portal, encodeproject.org. 并以30篇论文在Nature、Science、Cell、JBC、Genome Biol、Genome Research同时发表(http://www.nature.com/encode )。

所有数据从raw data形式的原始测序数据到比对后的信号文件以及分析好的有意的peaks文件都可以下载。

我这里根据自己的学习情况，简单介绍一些ENCODE计划数据下载方式，包括ENCODE官网下载,UCSC下载，ENSEMBL下载，broad研究所数据，IHEC存放的数据，还有GEO下载这6种形式！！！

首先在UCSC里面：

网址是：http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/ 因为是直接浏览文件，根据文件夹分类及文件名就可以任意方式下载自己感兴趣的数据啦，所以最对我胃口。

大家可能会比较习惯用UCSC提供的Genome Browser工具来可视化CHIP-seq的结果，而且Genome Browser里面非常多的选项可以控制各种在线资料是否跟你的数据一起显示来做对比，所以它必然有ftp服务器存放这些数据，其中比较出名的就是ENCODE计划的相关数据啦！如下图所示：

我比较关注ENCODE计划的组蛋白数据，点击进入！

一般都是

每个细胞系对应的各个组蛋白标记物的数据，从测序序列到比对bam文件，以及call到的peaks都可以下载！！！

然后是ENCODE计划的官网下载：

在ENCODE计划的官网上面还有各种数据处理的流程介绍：https://www.encodeproject.org/pipelines/

RNA-seq pipelines

RAMPAGE pipeline

Chromatin pipelines(Histone ChIP-seq Pipeline/Transcription Factor ChIP-seq Pipeline)

Methylation pipeline(WGBS Pipeline Overview)

官网的数据下载，做得像是一个购物网站，大家可以根据自己的需求把数据添加到购物篮，然后统一下载。

This document describes what data are available at the ENCODE Portal, ways to get started searching and downloading data, and an overview to how the metadata describing the assays and reagents are organized. ENCODE data can be visualized and accessed from other resources, including the UCSC Genome Browser and ENSEMBL.

进入 https://www.encodeproject.org/matrix/?type=Experiment 可以看到里面列出了173种细胞系，148种组织，还有一堆癌症样本的，包括CHIP-seq，DNase-seq等在内的十几种高通量测序数据。

接下来是GEO数据库里面：

里面直接把所有跟ENCODE相关的GSE study列出来了：http://www.ncbi.nlm.nih.gov/geo/info/ENCODE.html

GEO数据就没什么好说的了，直接进入study页面，然后下载数据即可，这也是我比较喜欢的数据下载方式，因为GEO里面对一个实验的描述很详细。

然后是broad 研究所托管的ENCODE计划的数据:

大名鼎鼎的broad研究所貌似是生物信息最全面的资源站点了，它不仅host了ENCODE计划的所有数据，还有它分析ENCODE计划的数据时使用的软件，工具。

http://www.broadinstitute.org/~anshul/projects/encode

原始数据在：http://www.broadinstitute.org/~anshul/projects/encode/rawdata/

接着是 iHEC存放的数据：

http://epigenomesportal.ca/ihec/download.html

我还是第一次看到这个数据接口，也是以文件夹文件的形式直接浏览，根据自己的需求下载即可：

除了ENCODE计划的数据，还有Blueprint计划和roadmap计划的数据都可以下载。


CEEHRC	2014-09-18	Click here for policies
Blueprint	2014-08-11	Click here for policies
ENCODE	2011-01	Click here for policies
NIH Roadmap	2014-05-29	Click here for policies
DEEP	2014-08-15	Click here for policies
CREST JST	2014-09-12	Click here for policies
KNIH	2015-07-15	Click here for policies

最后就是ENSEMBL数据库里面的：

我没有找到直接下载地址；http://asia.ensembl.org/info/website/tutorials/encode.html

The full ENCODE datasets that were used in the Ensembl regulatory build can also be viewed in the Ensembl GrCh37 archive, by attaching a track hub to Region in Detail - the link below will do this automatically:

Link to add ENCODE integrative analysis hub

This creates a menu in the Control Panel on Region in Detail, from which you can add individual tracks or groups of tracks using matrix selectors. Cell type and experimental factor are the two principal axes; other dimensions can be selected by clicking on a box to open an additional submenu (see below).

如果你对ENCODE计划不是很了解，可以先看看一些教程：

NIH提供的ENCODE计划相关教程： https://www.genome.gov/27553900/encode-tutorials/

https://www.genome.gov/27562350/encode-workshop-april-2015-keystone-symposia/

https://www.genome.gov/27561253/encode-workshop-tutorial-october-2014-ashg/

https://www.genome.gov/27553901/encode-tutorial-may-2013-biology-of-genomes-cshl/

https://www.genome.gov/27563006/encoderoadmap-epigenomics-tutorial-october-2015-ashg/

https://www.genome.gov/27555330/encoderoadmap-epigenomics-tutorial-october-2013-ashg/

https://www.genome.gov/27551933/encoderoadmap-epigenomics-tutorial-nov-2012-ashg/

http://useast.ensembl.org/info/website/tutorials/encode.html

https://www.encodeproject.org/tutorials/

https://www.encodeproject.org/tutorials/encode-meeting-2016/

https://www.encodeproject.org/tutorials/encode-users-meeting-2015/

DNA元件百科全书(Encyclopedia of DNA Elements, ENCODE)项目旨在描述人类基因组中所编码的全部功能性序列元件。ENCODE计划于2003年9月正式启动，吸引了来自美国、英国、西班牙、日本和新加坡五国32个研究机构的440多名研究人员的参与，经过了9年的努力，研究了147个组织类型，进行了1478次实验，获得并分析了超过15万亿字节的原始数据，确定了400万个基因开关，明确了哪些DNA片段能打开或关闭特定的基因，以及不同类型细胞之间的“开关”存在的差异。证明所谓“垃圾DNA”都是十分有用的基因成分，担任着基因调控重任。证明人体内没有一个DNA片段是无用的。

用UCSC提供的Genome Browser工具来可视化customTrack

ulwvfje — Tue, 26 Jul 2016 14:59:09 +0000

customTrack，我这里翻译为自定义的测序片段示踪文件，可以追踪我们的reads到底比对到了参加基因组的什么区域，或者追踪参考基因组的各个区域的覆盖度，测序深度！翻译自：http://genome.ucsc.edu/goldenPath/help/customTrack.html 这个非常有用！！！

UCSC提供的Genome Browser工具非常好用，可以很方便的浏览我们的测序数据在参考基因组的比对情况，由于定义好了一系列track的文件格式，用户可以非常方便的上传自己的track文件，但是如果用户超过48小时没有浏览自己的数据，UCSC会默认删除掉这些数据，除非用户已经保存在session里面。或者用户可以分享这些自定义的reads示踪文件customTrack。

UCSC已经提供了一系列customTrack的例子：click the Custom Tracks link

这些自定义的Track文件保密性非常好，如果用户感兴趣，可以按照以下4个步骤来操作：

Step 1. Format the data set

我们支持非常多的Track文件格式，尤其是标准的GFF文件，还包括：bedGraph, GTF, PSL, BED, bigBed, WIG, bigGenePred, bigMaf, bigChain, bigPsl, bigWig, BAM,CRAM, VCF, MAF, BED detail, Personal Genome SNP, broadPeak, narrowPeak, and microarray (BED15).

染色体一定是chrN 类型的标记，大小写敏感！也支持多种或者多个annotation的track文件。

Step 2. Define the Genome Browser display characteristics

设置浏览器选项，是否在Genome Browser里面显示UCSC的其它数据类型，包括hide/dense/pack/squish/full各种选项，包括ENCODE计划等各种公共数据是否需要显示。Add one or more optional browser lines to the beginning of your formatted data file to configure the overall display of the Genome Browser when it initially shows your annotation data.

这个非常复杂，但是一般就定义有限的几个属性即可。

Step 3. Define the annotation track display characteristics

设置如何显示自己的数据，包括颜色，数据名，数据描述情况。Following the browser lines--and immediately preceding the formatted data--add a track line to define the display attributes for your annotation data set.

下面这幅图里面的一些track的颜色，形状，注释，都是可以设置的，设置规则需要自己详细读说明书啦。

Step 4. Display your annotation track in the Genome Browser

重点就是上传自己的文件，步骤是：

open the Genome Browser home page ,click the Genome Browser link in the top menu bar.

On the Gateway page that displays, select the genome and assembly on which your annotation data is based, then click the "add custom tracks" button.

看到下面的图片的链接，点进去就好啦

On the Add Custom Tracks page, load the annotation track data or URL for your custom track into the upper text box and the track documentation (optional) into the lower text box, then click the Submit button. Tracks may be loaded by entering text, a URL, or a pathname on your local computer.

用户可以提交多种格式的自定义track文件

see Loading a Custom Track into the Genome Browser.

提交完毕之后，直接回到 Genome Browser 页面就可以看到了，这个工具不默认跳转。

Step 5. (Optional) Add details pages for individual track features

Step 6. (Optional) Share your annotation track with others

这是可选的步骤，自己去探索：read the section Sharing Your Annotation Track with Others.

我这里添加了一个UCSC也提供的一个wig文件：http://genome.ucsc.edu/goldenPath/help/examples/wiggleExample.txt 作为测试例子，显示如下：