生信菜鸟团 » Peak

用网页版工具GREAT来对CHIP-seq的peaks进行下游功能分析

ulwvfje — Thu, 07 Jul 2016 12:57:16 +0000

一般做完一个CHIP-seq测序，如果实验设计没有问题，测序质量也OK的话，很容易了根据序列call到符合要求的peaks，或者可以去很多文章或者roadmap里面下载到非常多有意义的peaks文件，一般是BED格式文件，这是就需要对这些peaks进行各种各样的注释以及可视化了，还有根据peaks相关的基因可以做各种各样的下游分析，包括各种pathway数据库的富集，MsigDB数据库注释，gene ontology的注释等等，此时不得不强烈推荐一款网页版工具，是斯坦福大学的学者开发的GREAT。

此工具的出现主要是为了解决基因组上面的非编码区域注释缺乏的问题，而我们CHIP-seq实验得到的peaks结果通常就是在非编码区域

首先进入该工具主页：http://bejerano.stanford.edu/great/public/html/

该工具每次只能上传一个文件，就是我们call出来的peaks记录文件，支持bed格式的：

一般很快就可以出结果啦！

首先会有三个图，都是很常见的，大家随便看看咯

Number of associated genes per region

Binned by orientation and distance to TSS

Binned by absolute distance to TSS

然后就是pathway和GO注释啦

这个网站提供的pathway非常之多，还是蛮全面的，包括KEGG，biocarta,reactome,msigdb等等还有一些signature和gene families，相当于一站式完成了大部分下游分析

GO Molecular Function (no terms)

GO Biological Process (no terms)

GO Cellular Component (no terms)

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
GO Molecular Function has 3,688 terms covering 15,090 (84%) of all 18,041 genes, and 189,388 term - gene associations.

3,688 ontology terms (100%) were tested using an annotation count range of [1, Inf].

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
GO Biological Process has 10,440 terms covering 15,441 (86%) of all 18,041 genes, and 950,065 term - gene associations.

10,440 ontology terms (100%) were tested using an annotation count range of [1, Inf].

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
GO Biological Process has 10,440 terms covering 15,441 (86%) of all 18,041 genes, and 950,065 term - gene associations.

10,440 ontology terms (100%) were tested using an annotation count range of [1, Inf].

Mouse Phenotype (no terms)

Human Phenotype (no terms)

Disease Ontology (no terms)

MSigDB Cancer Neighborhood (no terms)

Placenta Disorders (no terms)

PANTHER Pathway (no terms)

BioCyc Pathway (no terms)

MSigDB Pathway (no terms)

MGI Expression: Detected (no terms)

MSigDB Perturbation (no terms)

MSigDB Predicted Promoter Motifs (no terms)

MSigDB miRNA Motifs (no terms)

InterPro (no terms)

HGNC Gene Families (no terms)

MSigDB Oncogenic Signatures (no terms)

MSigDB Immunologic Signatures (no terms)

The test set of 5,225 genomic regions picked 2,992 (17%) of all 18,041 genes.
MSigDB Immunologic Signatures has 1,910 terms covering 16,609 (92%) of all 18,041 genes, and 363,333 term - gene associations.

1,910 ontology terms (100%) were tested using an annotation count range of [1, Inf].

用网页版工具ChIPseek来可视化CHIP-seq的peaks结果

ulwvfje — Thu, 07 Jul 2016 12:56:10 +0000

一般做完一个CHIP-seq测序，如果实验设计没有问题，测序质量也OK的话，很容易了根据序列call到符合要求的peaks，或者可以去很多文章或者roadmap里面下载到非常多有意义的peaks文件，一般是BED格式文件，这是就需要对这些peaks进行各种各样的注释以及可视化了，此时不得不强烈推荐一款网页版工具，是台湾学者开发的ChIPseek：

该工具首页就show了8张图片，就说明了该软件的功能：http://chipseek.cgu.edu.tw/index_show.py

该工具本质是就是后台调用 HOMER 和BEDTools, 这两个软件，使得那些不会编程的生物学家可以更方便快捷的理解自己的CHIP-seq结果，功能包括：

annotate the peaks
link to UCSC genome browser
provide pie charts, histograms and bar charts for peak location distribution
apply filter criteria by peak length to get a subset of peaks
apply filter criteria by distance to nearest TSS to get a subset of peaks
apply filter criteria by location of the peaks
apply filter criteria by list(s) of genes
apply filter criteria by GO terms
apply filter criteria by KEGG pathway annotations
compare two datasets
compare dataset with ENCODE transcription factor dataset
identify enriched motif
plot peaks on chromosome ideograms
allow users to download figures or tables

大部分功能自己写脚本也能实现，我就不多说了。

使用方法非常简单：

首先进入分析界面：http://chipseek.cgu.edu.tw/analysis_form.php

然后上传自己想要分析的peaks文件

比如GSE50177里面的GSE50177_RAW.tar：http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50177

我拿了四个peaks文件测试了一下：

提交任务后，文件就会上传，然后网页会给一个job ID号，如果你是在一个月之内看到这篇文章，你可以直接拿我的ID号去看结果，不需要自己上传自己的文件了，当然，你肯定是需要分析自己的peaks结果的。

ChIPseek is annotating your file(s).

This page will automatically refresh every 60 seconds.

Alternatively, You may use the job ID: 1467890358.407 to visit ChIPseek latter.

一会儿就可以看到结果了，因为网页版工具的服务器容量有限，所以这个结果一个月内是有效的。

http://chipseek.cgu.edu.tw/main_menu.py?job_id=1467890358.407

GSM1278641_Xu_MUT_rep1_BAF155_MUT (a total of 6733 peaks) (Download all annotation results)

GSM1278643_Xu_MUT_rep2_BAF155_MUT (a total of 3625 peaks) (Download all annotation results)

GSM1278645_Xu_WT_rep1_BAF155 (a total of 10987 peaks) (Download all annotation results)

GSM1278647_Xu_WT_rep2_BAF155 (a total of 5225 peaks) (Download all annotation results)

把每个文件的每个peaks都注释了，而且提供带链接的下载结果，tab分割的纯文本文件，用excel打开可能看起来舒服一点

还有4个可视化图片是我们可能会比较感兴趣的：

Peak location (pie chart)

Peak location (bar chart)

Distance to TSS

Peak length distribution

以及它可以把我们上传的bed格式peaks区域文件转为fasta序列 Peak sequences

本质是根据坐标从参考基因组里面提取序列而已，我把所有的序列都下载下来了，可以用来直接做motif查找

$ ls -lh *fasta

-rw-r–r– 1 Jimmy 197121 18M Jul 7 19:40 GSM1278641_Xu_MUT_rep1_BAF155_MUT_sequence.fasta

-rw-r–r– 1 Jimmy 197121 9.9M Jul 7 19:38 GSM1278643_Xu_MUT_rep2_BAF155_MUT_sequence.fasta

-rw-r–r– 1 Jimmy 197121 26M Jul 7 19:41 GSM1278645_Xu_WT_rep1_BAF155_sequence.fasta

-rw-r–r– 1 Jimmy 197121 14M Jul 7 19:41 GSM1278647_Xu_WT_rep2_BAF155_sequence.fasta

自学CHIP-seq分析第七讲~peaks注释

ulwvfje — Wed, 06 Jul 2016 00:17:17 +0000

经过前面的CHIP-seq测序数据处理的常规分析，我们已经成功的把测序仪下机数据变成了BED格式的peaks记录文件，我选取的这篇文章里面做了4次CHIP-seq实验，分别是两个重复的野生型MCF7细胞系的 BAF155 immunoprecipitates和两个重复的突变型MCF7细胞系的 BAF155 immunoprecipitates，这样通过比较野生型和突变型MCF7细胞系的 BAF155 immunoprecipitates的结果的不同就知道该细胞系的BAF155 突变，对它在全基因组的结合功能的影响啦。

#我这里直接从GEO里面下载了peaks结果，它们详情如下：wc -l *bed
6768 GSM1278641_Xu_MUT_rep1_BAF155_MUT.peaks.bed
3660 GSM1278643_Xu_MUT_rep2_BAF155_MUT.peaks.bed
11022 GSM1278645_Xu_WT_rep1_BAF155.peaks.bed
5260 GSM1278647_Xu_WT_rep2_BAF155.peaks.bed
49458 GSM601398_Ini1HeLa-peaks.bed
24477 GSM601398_Ini1HeLa-peaks-stringent.bed
12725 GSM601399_Brg1HeLa-peaks.bed
12316 GSM601399_Brg1HeLa-peaks-stringent.bed
46412 GSM601400_BAF155HeLa-peaks.bed
37920 GSM601400_BAF155HeLa-peaks-stringent.bed
30136 GSM601401_BAF170HeLa-peaks.bed
25432 GSM601401_BAF170HeLa-peaks-stringent.bed

每个BED的peaks记录，本质是就3列是需要我们注意的，就是染色体，以及在该染色体上面的起始和终止坐标，如下：

#PeakID chr start end strand Normalized Tag Count region size findPeaks Score Clonal Fold Change
chr20 52221388 52856380 chr20-8088 41141 +
chr20 45796362 46384917 chr20-5152 31612 +
chr17 59287502 59741943 chr17-2332 29994 +
chr17 59755459 59989069 chr17-667 19943 +
chr20 52993293 53369574 chr20-7059 12642 +
chr1 121482722 121485861 chr1-995 9070 +
chr20 55675229 55855175 chr20-6524 7592 +
chr3 64531319 64762040 chr3-4022 7213 +
chr20 49286444 49384563 chr20-4482 6165 +

我们所谓的peaks注释，就是想看看该peaks在基因组的哪一个区段，看看它们在各种基因组区域(基因上下游，5,3端UTR，启动子，内含子，外显子，基因间区域，microRNA区域)分布情况，但是一般的peaks都有近万个，所以需要批量注释，如果脚本学的好，自己下载参考基因组的GFF注释文件，完全可以自己写一个，我这里会介绍一个R的bioconductor包ChIPpeakAnno来做CHIP-seq的peaks注释，下面的包自带的示例：

library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
gr1[1:2]
gr1.import[1:2] #note the name slot is different from gr1
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)

##还可以用binOverFeature来根据特定的GRanges对象(通常是TSS)来画分布图
## Distribution of aggregated peak scores or peak numbers around transcript start sites.

可以看到这个包使用起来非常简单，只需要把我们做好的peaks文件(GSM1278641_Xu_MUT_rep1_BAF155_MUT.peaks.bed等等)用toGRanges或者import读进去，成一个GRanges对象即可，上面的代码是比较两个peaks文件的overlap。然后还可以根据R很多包都自带的数据来注释基因组特征：

data(TSS.human.GRCh37) ## 主要是借助于这个GRanges对象来做注释，也可以用getAnnotation来获取其它GRanges对象来做注释
## featureType ： TSS, miRNA, Exon, 5'UTR, 3'UTR, transcript or Exon plus UTR
peaks=MUT_rep1_peaks
macs.anno <- annotatePeakInBatch(peaks, AnnotationData=TSS.human.GRCh37,
output="overlapping", maxgap=5000L)

## 得到的macs.anno对象就是已经注释好了的，每个peaks是否在基因上，或者距离基因多远，都是写的清清楚楚
if(require(TxDb.Hsapiens.UCSC.hg19.knownGene)){
aCR<-assignChromosomeRegion(peaks, nucleotideLevel=FALSE,
precedence=c("Promoters", "immediateDownstream",
"fiveUTRs", "threeUTRs",
"Exons", "Introns"),
TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage)
}

得到的条形图如下，虽然很丑，但这就是peaks注释的精髓，搞清楚每个peaks在基因组的位置特征：

同理，对每个peaks文件，都可以做类似的分析！

但是对多个peaks文件，比如本文中的，想比较野生型和突变型MCF7细胞系的 BAF155 immunoprecipitates的结果的不同，就需要做peaks之间的差异分析，已经后续的差异基因注释啦

当然，值得一提的是peaks注释我更喜欢网页版工具，反正peaks文件非常小，直接上传到别人做好的web tools，就可立即出一大堆可视化图表分析结果啦，大家可以去试试看：

http://chipseek.cgu.edu.tw/

http://bejerano.stanford.edu/great/public/html/

http://liulab.dfci.harvard.edu/CEAS/

虽然我花费了大部分篇幅来描述ChIPpeakAnno这个包的用法，但是真正的重点是你得明白peaks记录了什么，要注释什么，已经把这3个网页工具的可视化图表分析结果全部看懂，这网页版工具才是重点！！！

用R包BayesPeak来对CHIP-seq数据call peaks

ulwvfje — Tue, 05 Jul 2016 15:25:46 +0000

BayesPeak也是peaks caller家族一员，用的人也不少，我这次也试了一下，因为是R的bioconductor系列包，所以直接在R里面安装就好，但是有几个点需要注意，我比对的基因组不只是Chr1~22,X,Y,M，还有一些contig和scaffold，需要在bam文件里面去除的，而且BayesPeak比较支持读取BED文件，可以直接转为GRanges对象，虽然它号称可以使用多核，但是计算速度还是非常慢。

### step6.7 peak calling by BayesPeak(R bioconductor package)
# Bayesian Analysis of ChIP-seq data
## BayesPeak fits a Markov model to the data (the aligned reads) via Markov Chain Monte Carlo (MCMC) techniques.

# 有博客里面提到I've used BayesPeak running in R. It is much easier to install than MACS (failed for me), which require some (strange to me) files.
# 首先要把bowtie2比对好的alignment文件bam格式转换为bed格式： http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
## In particular, the chromosome, start position, end position and DNA strand appear in the 1st, 2nd, 3rd and 6th columns respectively.
#### software : http://bioconductor.org/packages/release/bioc/html/BayesPeak.html
#### readme: http://bioconductor.org/packages/release/bioc/vignettes/BayesPeak/inst/doc/BayesPeak.pdf
学习R的bioconductor系列包很容易，先看看例子即可examples:

library(BayesPeak) ## 一般例子都会读取包自带的测试文件
tFile=file.path(system.file(package='BayesPeak'),'extdata','H3K4me3reduced.bed')
cFile=file.path(system.file(package='BayesPeak'),'extdata','Inputreduced.bed')

raw.output <- bayespeak(tFile, cFile, chr = "chr16", start = 9.2E7, end = 9.5E7, job.size = 6E6)
output <- summarize.peaks(raw.output, method = "lowerbound")
## the function summarize.peaks will do : Filtering of unenriched jobs/Filtering of unenriched bins/Assembly of enriched bins/Conversion of bins to peaks

write.table(as.data.frame(output), file = "H3K4me3output.txt", quote = FALSE)
## write.csv(as.data.frame(output), file = "H3K4me3output.csv", quote = FALSE)
## 可以借助多线程来加快运行速度：
library(parallel)
## 还需要检查pp这个阈值的选取 # A “potentially enriched” bin is defined as any bin with PP > 0.01.
The output of the algorithm is the Posterior Probability (often abbreviated to PP) of each bin being enriched.
The PP value is useful not only for calling the peaks, but could also be used in downstream analyses - for
example, to weight observations when searching for a novel transcription factor motif. The PP value is not
to be confused with the p value from hypothesis testing

> min.job <- min(raw.output$peaks$job)
> max.job <- max(raw.output$peaks$job)
> par(mfrow = c(2,2), ask = TRUE)
> for(i in min.job:max.job) {plot.PP(raw.output, job = i, ylim = c(0,50))}
When the coverage is sparse and therefore less information is available, the PP values tend to be more
uniformly spread over the interval [0,1], as above. This means that the distinction between peaks and
background is harder to make, which is usually a result of poor enrichment,

raw.output <- bayespeak(tFile, cFile,use.multicore = TRUE, mc.cores = 4)
i <- 324
plot.PP(raw.output, job = i, ylim = c(0,50))

看完了例子，就可以开始处理自己的数据啦：

############ first change bam files to bed files :
ls *sorted.bam |while read id ;do ~/biosoft/bedtools/bedtools2/bin/bedtools bamtobed -i $id > ${id%%.*}.bed ;done
但是要过滤掉特殊染色体(chr6_cox_hap2,chrUn_gl000214)，仅仅保留CHR1-22,X,Y,M
ls *bed |while read id ;do grep -v "_" $id >${id%%.*}.clean_bed;done

下面是我处理自己的数据的完整代码，很简单：

############ Then do peak calling in R by BayesPeak
library(BayesPeak)
library(parallel)
workdir=getwd()
tFile=file.path(workdir,'SRR1042593.clean_bed')
cFile=file.path(workdir,'SRR1042594.clean_bed')
raw.output <- bayespeak(tFile, cFile,use.multicore = TRUE, mc.cores = 8)
output <- summarize.peaks(raw.output, method = "lowerbound")
write.table(as.data.frame(output), file = "Xu_MUT_rep1.txt", quote = FALSE)

用PeakRanger软件来对CHIP-seq数据call peaks

ulwvfje — Tue, 05 Jul 2016 15:19:19 +0000

此文专门讲这个软件如何用，但是跟我以前写的软件说明书又不大一样，主要是因为我用MACS2这个软件call peaks并没有达到预期的结果，所以就多使用了几个软件，其中PeakRanger尤其值得一提，安装特别简单，而且处理数据的速度特别快，结果也非常容易理解，更重要的是它给出一个网页版的报告，里面有所有找到的符合要求的peaks的可视化图片！！！！

该软件有linux二进制版本，所以直接下载解压即可使用，具体代码如下：

## Download and install PeakRanger
cd ~/biosoft
mkdir PeakRanger && cd PeakRanger
wget https://sourceforge.net/projects/ranger/files/PeakRanger-1.18-Linux-x86_64.zip/
## Length: 1517587 (1.4M) [application/octet-stream]
unzip PeakRanger-1.18-Linux-x86_64.zip
~/biosoft/PeakRanger/bin/peakranger -h

下面的笔记是我做自学CHIP-seq数据分析系列教程的，所以中英文夹杂，大家将就着看吧，里面很多链接，大家可以进去自己学习

### step6.8 peak calling by PeakRanger
# PeakRanger is a multi-purporse software suite for analyzing next-generation sequencing (NGS) data. The suite contains the following tools:
# Used by modENCODE, iPlant and many others
# Not just for calling narrow and broad peaks
# Runs fast, together with sleek program options
To measure the significance of the enriched regions, PeakRanger uses binormial distribution to model the relative enrichment of sample over control.
A p value is generated as a result. Users can thus select highly significant peaks by using a smaller -p.
In addition, users can filter peaks by the '-q' option, which controls the FDR of peaks.
For each p-value, the Benjamini-Hochberg procedure is applied to calculate the FDR.
# http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ## gunzip refGene.txt.gz ; mv refGene.txt hg19refGene.txt
#### software : http://ranger.sourceforge.net/ go to the root path of the unzipped package and type:make
#### readme: http://ranger.sourceforge.net/manual1.18.html
# http://www.broadinstitute.org/~anshul/projects/encode/preprocessing/peakcalling/peakranger/bin/MANUAL

### ~/biosoft/PeakRanger/bin/peakranger -h ##我的软件已经安装完毕
nr estimate data quality
lc calculate library complexity
wig generate wiggle files
wigpe generate wiggle files for paired reads
ranger peak calling for sharp peaks
ccat peak calling for broad peaks
bcp peak calling for complex broad peaks

## 上面是该软件的几个用法，它直接各种格式的比对文件，我这里给的bed格式的，就是把sam转为bam再转为bed，，大家没必要那么复杂，直接用bam格式即可

~/biosoft/PeakRanger/bin/peakranger nr --format bed SRR1042593.clean_bed SRR1042594.clean_bed
~/biosoft/PeakRanger/bin/peakranger ccat --format bed SRR1042593.clean_bed SRR1042594.clean_bed \
Xu_MUT_rep1_ccat_report --report --gene_annot_file hg19refGene.txt -q 0.05 -t 4

很快就出结果，找到的peak非常多，但是需要过滤
844K Jun 30 09:32 Xu_MUT_rep1_ccat_report_details
637K Jun 30 09:32 Xu_MUT_rep1_ccat_report_region.bed
798K Jun 30 09:32 Xu_MUT_rep1_ccat_report_summit.bed
需要重点看到就是details文件，格式如下：很容易理解
#region_chr region_start region_end nearby_genes(6kbp) region_ID region_summits region_fdr region_strand region_treads region_creads
chr1 121482750 121486000 ccat_fdrPassed_0_fdr_0.001 121485025 0.001 + 551 642
chr1 115296600 115302500 CSDE1 ccat_fdrFailed_0_fdr_0.646 115301075 0.646 + 58 217
chr1 114351100 114356850 PTPN22,RSBN1 ccat_fdrFailed_3_fdr_0.646 114355425 0.646 + 48 112

很容易使用，但是具体条件参数，就需要自己看说明书啦
Guide: Peak Calling for ChIP-Seq :　http://epigenie.com/guide-peak-calling-for-chip-seq/

自学CHIP-seq分析第二讲~学习资料的搜集

ulwvfje — Tue, 05 Jul 2016 00:20:00 +0000

我只能说，CHIP-seq的确是非常完善的NGS流程了，各种资料层出不穷，大家首先可以看下面几个完整流程的PPT来对CHIP-seq流程有个大致的印象，我对前面提到的文献数据处理的几个要点，就跟下面这个图片类似：

QuEST is a statistical software for analysis of ChIP-Seq data with data and analysis results visualization through UCSC Genome Browser. http://www-hsc.usc.edu/~valouev/QuEST/QuEST.html

peak calling 阈值的选择： http://www.nature.com/nprot/journal/v7/n1/fig_tab/nprot.2011.420_F2.html

MeDIP-seq and histone modification ChIP-seq analysis http://crazyhottommy.blogspot.com/2014/01/medip-seq-and-histone-modification-chip.html

2011-review-CHIP-seq-high-quaility-data: http://www.nature.com/ni/journal/v12/n10/full/ni.2117.html?message-global=remove

不同处理条件的CHIP-seq的差异peaks分析： http://www.slideshare.net/thefacultyl/diffreps-automated-chipseq-differential-analysis-package

一个实际的CHIP-seq数据分析例子： http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/

http://biow.sb-roscoff.fr/ecole_bioinfo/training_material/chip-seq/documents/presentation_chipseq.pdf

http://ecole-bioinfo-aviesan.sb-roscoff.fr/sites/ecole-bioinfo-aviesan.sb-roscoff.fr/files/files/chipseq_CarlHerrmann_Roscoff2015.pdf

http://ecole-bioinfo-aviesan.sb-roscoff.fr/sites/ecole-bioinfo-aviesan.sb-roscoff.fr/files/files/defrance-ChIP-seq_annotation.pdf

然后下面的各种资料，是针对CHIP-seq流程的各个环境的，还有一些是针对于表观遗传学知识的

## ppt : http://159.149.160.51/epigen_milano/epigen_barozzi.pdf

## best practise: http://bioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/

## pipeline : https://github.com/shenlab-sinai/chip-seq_preprocess

## https://sites.google.com/site/anshul...e/projects/idr ## samtools view -b -F 1548 -q 30 chipSampleRep1.bam

## pipeline : http://daudin.icmb.utexas.edu/wiki/index.php/ChIPseq_prep_and_map

## pipeline : https://github.com/BradyLab/ChipSeq/blob/master/chipseq.sh

## https://github.com/crukci-bioinformatics/chipseq-pipeline

## https://github.com/ENCODE-DCC/chip-seq-pipeline

## Hands-on introduction to ChIP-seq analysis - VIB Training http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/

## video(A Step-by-Step Guide to ChIP-Seq Data Analysis Webinar) : http://www.abcam.com/webinars/a-step-by-step-guide-to-chip-seq-data-analysis-webinar

## Using ChIP-Seq to identify and/or quantify bound regions (peaks) http://barcwiki.wi.mit.edu/wiki/SOPs/chip_seq_peaks

## http://jura.wi.mit.edu/bio/education/hot_topics/ChIPseq/ChIPSeq_HotTopics.pdf

## http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/practicals/chip-seq/mapping_tutorial.html

## 公开课： https://www.coursera.org/learn/galaxy-project/lecture/FUzcg/chip-sequence-analysis-with-macs

##ＥＢＩ的教程：https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course/chip-seq-analysis/chip-seq-practical

## 日语教程：http://genomejack.net/download/GenomeJackBrowserAppendix/browser_appendix_j/tutorials/chipSeq.html

## 台湾教程：http://lsl.sinica.edu.tw/Services/Class/files/20151118475_2.pdf 徐唯哲 Paul Wei-Che HSU

中央研究院分子生物研究所

研究助技師

## peak finder软件大全： http://wodaklab.org/nextgen/data/peakfinders.html

## https://www.encodeproject.org/documents/049704a4-5c58-4631-acf1-4ef152bdb3ef/@@download/attachment/Learning_Chromatin_States_from_ChIP-seq_data.pdf

## https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/47aq5pp5mzza5vb/PDFs/Tuesday_MB_ChIP-Seq_Intro.pdf

## paper： Large-Scale Quality Analysis of Published ChIP-seq Data http://www.g3journal.org/content/4/2/209.full

## paper： Chip-seq data analysis: from quality check to motif discovery and more http://ccg.vital-it.ch/var/sib_april15/cases/landt12/strand_correlation.html

## Workshop hands on session(RNA-Seq / ChIP-Seq ) : https://hpc.oit.uci.edu/biolinux/handson.docx

## http://www.gqinnovationcenter.com/documents/bioinformatics/ChIPseq.pptx

## paper supplement : http://genome.cshlp.org/content/suppl/2015/10/02/gr.192005.115.DC1/Supplemental_Information.docx

http://www.illumina.com/documents/products/datasheets/datasheet_chip_sequence.pdf

http://www.ncbi.nlm.nih.gov/pubmed/22130887 "Analyzing ChIP-seq data: preprocessing, normalization, differential identification, and binding pattern characterization."

http://www.ncbi.nlm.nih.gov/pubmed/22499706 "Normalization, bias correction, and peak calling for ChIP-seq." (stat heavy)

http://www.ncbi.nlm.nih.gov/pubmed/24244136 "Practical guidelines for the comprehensive analysis of ChIP-seq data."

http://www.ncbi.nlm.nih.gov/pubmed/25223782 "Identifying and mitigating bias in next-generation sequencing methods for chromatin biology."

A quick search also turned up this recent paper (which I haven't read) that might be of interest to you

http://www.ncbi.nlm.nih.gov/pubmed/24598259 "Impact of sequencing depth in ChIP-seq experiments."

## figures: https://github.com/shenlab-sinai/ngsplot

https://github.com/daler/metaseq

http://liulab.dfci.harvard.edu/CEAS/usermanual.html

还有两个ｗｅｂ－ｔｏｏｌｓ也是可视化

bioconductor系列工具和教程 :

http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rchipseq/Rchipseq.pdf

http://bioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/Day4/chipqc_sweave.pdf

http://bioconductor.org/packages/release/bioc/html/chipseq.html

http://bioconductor.org/help/workflows/chipseqDB/

http://bioconductor.org/help/workflows/generegulation/

http://bioconductor.org/help/course-materials/2009/EMBLJune09/Practicals/chipseq/BasicChipSeq.pdf

## 公司教程： http://www.partek.com/Tutorials/microarray/Tiling/ChipSeqTutorial.pdf

自学CHIP-seq分析第一讲~文献选择与解读

ulwvfje — Tue, 05 Jul 2016 00:14:58 +0000

文章：CARM1 Methylates Chromatin Remodeling Factor BAF155 to Enhance Tumor Progression and Metastasis

我很早以前想自学CHIP-seq的时候就关注过这篇文章，那时候懂得还不多，甚至都没有仔细看这篇文章就随便下载了数据进行分析，也只是跑一些软件而已，这次仔细阅读这篇文章才发现里面的门道很多，尤其是CHIP-seq的实验基础，以及表观遗传学的生物学基础知识，我有时间一定要把这篇文章翻译一下。学习这篇文章前一定要温习一些生物学知识，见我上一篇博客

作者首先实验证明了用small haripin RNA来knockout CARM1 只能达到90%的敲除效果，有趣的是，对CARM1的功能影响非常小，说明只需要极地量的CARM1就可以很好的发挥作用，所以作者设计了100%敲除CARM1的实验材料，通过zinc finger nuclease这种基因组编辑技术( 缩写成ZFN技术)。

这样就能比较CARM1有无的机体种各种蛋白被催化状态了，其中SWI/SNF(BAF) chromatin remodeling complex 染色质重构复合物的一个亚基 BAF155，非常明显的只有在CARM1这个基因完好无损的细胞系里面才能被正常的甲基化。作者证明了BAF155是CARM1这个基因非常好(拉丁语 bona fide)的一个底物，而且通过巧妙的实验设计，证明了BAF155这个蛋白的第1064位氨基酸(R) 是 CARM1的作用位点。

因为早就有各种文献说明了SWI/SNF(BAF) chromatin remodeling complex 染色质重构复合物在癌症的重要作用，所以作者也很自然的想探究BAF155在癌症的功能详情。这里作者选择的是CHIP-seq技术，因为BAF155是转录因子的一种。（转录因子(transcription factor)是一群能与基因5`端上游特定序列专一性结合，从而保证目的基因以特定的强度在特定的时间与空间表达的蛋白质分子。）CHIP-seq技术最适合来探究BAF155这样转录因子的功能了，所以作者构造了一种细胞系（MCF7），它的BAF155这个蛋白的第1064位氨基酸(R) 突变了，这样就无法被CARM1这个基因催化而甲基化，然后比较突变的细胞系和野生型细胞系的BAF155的CHIP-seq结果，这样就可以研究BAF155这个转录因子，是否必须要被CARM1这个基因催化而甲基化后才能行使生物学功能。

作者用me-BAF155特异性抗体+western bloting 证明了正常的野生型MCF7细胞系里面有~74%的BAF155是被甲基化的！

有一个细胞系SKOV3，可以正常表达除了BAF155之外的其余14种SWI/SNF(BAF) chromatin remodeling complex 染色质重构复合物，而不管是把突变的细胞系和野生型细胞系的BAF155混在里面都可以促进染色质重构复合物的组装，所以甲基化与否并不影响这个染色质重构复合物的组装，我们重点应该研究的是甲基化会影响BAF155在基因组其它地方结合。

结果是，突变的细胞系和野生型细胞系种BAF155在基因组结合位置(peaks)还是有较大的overlap的，重点是看它们的peaks在各种基因组区域(基因上下游，5,3端UTR，启动子，内含子，外显子，基因间区域，microRNA区域)分布情况的差别，还有它们举例转录起始位点的距离的分布区别，还有它们注释到的基因区别，已经基因富集到什么通路，等等这样的分析。

虽然作者在人的细胞系(MCF7)上面做CHIP-seq，但是在老鼠细胞系(MDA-MB-231)做了mRNA芯片数据分析,BAF155这个蛋白的第1064位氨基酸(R) 突变细胞系，和野生型细胞系，用的是Affymetrix HG U133 Plus 2.0这个常用平台

which was hybridized to Affymetrix HG U133 Plus 2.0 microarrays containing 54,675 probesets for >47,000 transcripts and variants, including 38,500 human genes.

To identify genes differentially expressed between MDA-MB-231-BAF155WT and MDA-MB-231-BAF155R1064K

表达矩阵可以下载：## http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4004525/bin/NIHMS556863-supplement-03.xlsx

我简单摘抄作者的对CHIP-seq数据的生物信息学分析结果

## All samples were mapped from fastq files using BOWTIE [-m 1 -- best] to mm9 [UCSCmouse genome build 9]

## Sequences were mapped to the human genome (hg19) using BOWTIE (--best –m 1) to yield unique alignments

## Peaks were called by using HOMER [http://biowhat.ucsd.edu/homer/] and QuEST [http://mendel.stanford.edu/sidowlab/downloads/quest/].

QuEST 2.4 (Valouev et al., 2008) was run using the recommend settings for transcription factor (TF) like binding with the following exceptions:

kde_bandwith=30, region_size=600, ChIP threshold=35, enrichment fold=3, rescue fold=3.

HOMER (Heinz et al., 2010) analysis was run using the default settings for peak finding.

False Discovery Rate (FDR) cut off was 0.001 (0.1%) for all peaks.

The tag density for each factor was normalized to 1x107 tags and displayed using the UCSC genome browser.

Motif analysis (de novo and known), was performed using the HOMER software and Genomatix.

Peak overlaps were processed with HOMER and Galaxy (Giardine et al., 2005).

Peak comparisons between replicates were processed with EdgeR statistical package in R

也就是我们接下来需要学习的流程化分析步骤，下面我给一个主要流程的截图，但是主要还是要看实验是如何设计的，也有一个文章发表关于CHIP-seq的流程的：http://biow.sb-roscoff.fr/ecole_bioinfo/protected/jacques.van-helden/ThomasChollier_NatProtoc_2012_peak-motifs.pdf

同时我还推荐大家看几篇相关文献：

Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. http://www.nature.com/nature/journal/v448/n7153/pdf/nature06008.pdf

Mapping and analysis of chromatin state dynamics in nine human cell types(GSE26386): http://www.nature.com/nature/journal/v473/n7345/full/nature09906.html

Promiscuous RNA binding by Polycomb Repressive Complex 2 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823624/pdf/nihms517229.pdf