生信菜鸟团 » GSEA

制作自己的gene set文件给gsea软件

ulwvfje — Thu, 15 Dec 2016 11:43:56 +0000

熟悉GSEA软件的都知道，它只需要GCT,CLS和GMT文件，其中GMT文件，GSEA的作者已经给出了一大堆！就是记录broad的Molecular Signatures Database (MSigDB) 已经收到了18026个geneset，但是我奇怪的是里面竟然没有包括cancer testis的gene set，MSigDB的确是多，但未必全，其实里面还有很多重复。而且有不少几乎没有意义的gene set。那我想做自己的gene set来用gsea软件做分析，就需要自己制造gmt格式的数据。因为即使下载了MSigDB的gene set，本质上就是gmt格式的数据而已：http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29

我们首先要拿到自己感兴趣的gene set里面的gene list，最好是以hugo规定的标准symbol。

比如我感兴趣的是：http://www.cta.lncc.br/modelo.php

我这里提供一个2列的文件，直接转换成gmt的R代码！

文件来自于：下载最新版的KEGG信息，并且解析好，如下：

首先在R里面赋值一个变量path2gene_file就是图中的kegg2gene.txt文件，读到R里面去

tmp=read.table(path2gene_file,sep="\t",colClasses=c('character'))

#tmp=toTable(org.Hs.egPATH)

# first column is kegg ID, second column is entrez ID

GeneID2kegg_list<<- tapply(tmp[,1],as.factor(tmp[,2]),function(x) x)

kegg2GeneID_list<<- tapply(tmp[,2],as.factor(tmp[,1]),function(x) x)

这个变量kegg2GeneID_list是一个list，因为是entrez gene ID，需要转换成symbol，我就不多说了，转换后的数据，就是kegg2symbol_list 。

最后对 kegg2symbol_list 输出成gmt文件：

write.gmt <- function(geneSet=kegg2symbol_list,gmt_file='kegg2symbol.gmt'){

sink( gmt_file )

for (i in 1:length(geneSet)){

cat(names(geneSet)[i])

cat('\tNA\t')

cat(paste(geneSet[[i]],collapse = '\t'))

cat('\n')

}

sink()

}

java版本GSEA软件的ES score图片的修改

ulwvfje — Thu, 01 Dec 2016 16:53:10 +0000

首先要明白这个ES score图片里面的数据是什么，这样才能修改它，因为java是一个封闭打包好的软件，所以我们没办法在里面修改它没有提供的参数，运行完GSEA，默认输出的图就是下面这样：

ES score

这个图片在发表的时候，就会发现其实蛮模糊的，所以有可能需要自己重新制作这个图，那么就需要明白这个图后面的数据。

其中最下面的数据是量方法测到了2万个基因，那么这两万个基因在case和control组的差异度量(六种差异度量，默认是signal 2 noise，GSEA官网有提供公式，也可以选择大家熟悉的foldchange)肯定不一样,那么根据它们的差异度量，就可以对它们进行排序，并且Z-score标准化的结果。

而中间的就是该gene set在测到了的已经根据signal2noise排好序的2万个基因的位置。

最上面的图，就是所有的基因的ES score都要一个个加起来，叫做running ES score，在加的过程中，什么时候ES score达到了最大值，就是这个gene set最终的ES score！

我这里全面解析了GSEA官网提供的R代码的绘图函数，如下：

这个函数本身也被我抽离出来了：

这个知识点有点复杂，我解释的很清楚数据是什么，但是数据如何来的（就是下面代码读取的txt文件），我没办法用博客写清楚，需要修改一个2500行的源代码才能获取数据！

setwd('data')
Obs.RES=read.table('Obs.RES.txt')
Obs.RES=t(Obs.RES) ## 每个基因在每个gene set里面的running ES score，一个矩阵
Obs.indicator=read.table('Obs.indicator.txt')
Obs.indicator=t(Obs.indicator) ## 每个基因是否属于每个gene set，一个0/1矩阵
obs.s2n=read.table('obs.s2n.txt')[,1] ## 每个基因的signal 2 noise值，已经Z-score化，而且排好序了。
size.G=read.table('size.G.txt')[,1] ## 每个gene set的基因数量，在图中需要显示
gs.names=read.table('gs.names.txt')[,1] ## 每个gene set的名字，在图中需要显示
Obs.arg.ES=read.table('Obs.arg.ES.txt')[,1]## 每个gene set的最大ES score出现在排序基因的位置
Obs.ES.index=read.table('Obs.ES.index.txt')[,1]## 这个用不着的，我也忘记是什么了
Obs.ES=read.table('Obs.ES.txt')[,1] ##每个gene set的最大ES score是多少，如果是正值，用红色表示富集在case组，如果是负值，用蓝色，表示富集在control组。

plot_ES_score <- function(Ng=12,N=34688,phen1='control',phen2='case',Obs.RES,Obs.indicator,obs.s2n,size.G,gs.names,Obs.arg.ES,Obs.ES.index){
for (i in 1:Ng) {
png(paste0('number_',gs.names[i],'.png'))
ind <- 1:N
min.RES <- min(Obs.RES[i,])
max.RES <- max(Obs.RES[i,])
if (max.RES < 0.3) max.RES <- 0.3
if (min.RES > -0.3) min.RES <- -0.3
delta <- (max.RES - min.RES)*0.50
min.plot <- min.RES - 2*delta
max.plot <- max.RES
max.corr <- max(obs.s2n)
min.corr <- min(obs.s2n)
Obs.correl.vector.norm <- (obs.s2n - min.corr)/(max.corr - min.corr)*1.25*delta + min.plot
zero.corr.line <- (- min.corr/(max.corr - min.corr))*1.25*delta + min.plot
col <- ifelse(Obs.ES[i] > 0, 2, 4)

# Running enrichment plot

sub.string <- paste("Number of genes: ", N, " (in list), ", size.G[i], " (in gene set)", sep = "", collapse="")

main.string <- paste("Gene Set ", i, ":", gs.names[i])

plot(ind, Obs.RES[i,], main = main.string, sub = sub.string, xlab = "Gene List Index", ylab = "Running Enrichment Score (RES)", xlim=c(1, N), ylim=c(min.plot, max.plot), type = "l", lwd = 2, cex = 1, col = col)
for (j in seq(1, N, 20)) {
lines(c(j, j), c(zero.corr.line, Obs.correl.vector.norm[j]), lwd = 1, cex = 1, col = colors()[12]) # shading of correlation plot
}
lines(c(1, N), c(0, 0), lwd = 1, lty = 2, cex = 1, col = 1) # zero RES line
lines(c(Obs.arg.ES[i], Obs.arg.ES[i]), c(min.plot, max.plot), lwd = 1, lty = 3, cex = 1, col = col) # max enrichment vertical line
for (j in 1:N) {
if (Obs.indicator[i, j] == 1) {
lines(c(j, j), c(min.plot + 1.25*delta, min.plot + 1.75*delta), lwd = 1, lty = 1, cex = 1, col = 1) # enrichment tags
}
}
lines(ind, Obs.correl.vector.norm, type = "l", lwd = 1, cex = 1, col = 1)
lines(c(1, N), c(zero.corr.line, zero.corr.line), lwd = 1, lty = 1, cex = 1, col = 1) # zero correlation horizontal line
temp <- order(abs(obs.s2n), decreasing=T)
arg.correl <- temp[N]
lines(c(arg.correl, arg.correl), c(min.plot, max.plot), lwd = 1, lty = 3, cex = 1, col = 3) # zero crossing correlation vertical line

leg.txt <- paste("\"", phen1, "\" ", sep="", collapse="")
text(x=1, y=min.plot, adj = c(0, 0), labels=leg.txt, cex = 1.0)

leg.txt <- paste("\"", phen2, "\" ", sep="", collapse="")
text(x=N, y=min.plot, adj = c(1, 0), labels=leg.txt, cex = 1.0)

adjx <- ifelse(Obs.ES[i] > 0, 0, 1)

leg.txt <- paste("Peak at ", Obs.arg.ES[i], sep="", collapse="")
text(x=Obs.arg.ES[i], y=min.plot + 1.8*delta, adj = c(adjx, 0), labels=leg.txt, cex = 1.0)

leg.txt <- paste("Zero crossing at ", arg.correl, sep="", collapse="")
text(x=arg.correl, y=min.plot + 1.95*delta, adj = c(adjx, 0), labels=leg.txt, cex = 1.0)
dev.off()
}

}

通过这个代码，就可以把当前所有gese set的 ES score图给重新画一下，如果需要调整字体大小，就去代码里面慢慢调整。

GSEA的统计学原理试讲

ulwvfje — Thu, 01 Dec 2016 16:39:21 +0000

GSEA这个java软件使用非常方便，只需要根据要求做好GCT/CLS格式的input文件就好了。我以前也写个用法教程：

用GSEA来做基因集富集分析

批量运行GSEA，命令行版本

但说到统计学原理，就有点麻烦了，我试着用自己的思路阐释一下：

假设芯片或者其它测量方法测到了2万个基因，那么这两万个基因在case和control组的差异度量(六种差异度量，默认是signal 2 noise，GSEA官网有提供公式，也可以选择大家熟悉的foldchange)肯定不一样,那么根据它们的差异度量，就可以对它们进行排序，并且Z-score标准化，在下图的最底端展示的就是

那么图中间，就是我们每个gene set里面的基因在所有的2万个排序好基因的位置，如果gene set里面的基因集中在2万个基因的前面部分，就是在case里面富集，如果集中在后面部分，就是在control里面富集着。

而最上面的那个ES score的算法，大概如下：

仔细看，其实还是能看明白的，每个基因在每个gene set里面的ES score取决于这个基因是否属于该gene set，还有就是它的差异度量，上图的差异度量就是FC（foldchange）,对每个gene set来说，所有的基因的ES score都要一个个加起来，叫做running ES score，在加的过程中，什么时候ES score达到了最大值，就是这个gene set最终的ES score！

算法解读我参考的PPT，反正我是看懂了，但不一定能讲清楚：

http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/gsea1_bw.pdf

https://bioinformatics.cancer.gov/sites/default/files/course_material/GSEA_Theory.pptx

http://compbio.ucdenver.edu/Hunter_lab/Phang/downloads/files/GSEA.ppt

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1239896/

http://www.baderlab.org/CancerStemCellProject/VeroniqueVoisin/AdditionalResources/GSEA

软件还有大把的参数可以调整：http://www.baderlab.org/CancerStemCellProject/VeroniqueVoisin/AdditionalResources/GSEA/parameters

批量运行GSEA，命令行版本

ulwvfje — Mon, 11 Jan 2016 13:12:48 +0000

之前用过有界面的那种，那样非常方便，只需要做好数据即可，但是如果有非常多的数据，每次都要点击文件，点击下一步，也很烦，不过，，它既然是java软件，就可以以命令行的形式来玩转它！

能够命令行运行了，就很容易批量啦

一、程序安装

直接在官网下载java版本软件即可：http://software.broadinstitute.org/gsea/downloads.jsp

二、输入数据

需要下载gmt文件，自己制作gct和cls文件，或者直接下载测试文件p53

见：http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

三、运行命令

hgu95av2的芯片数据，只有一万多探针，所以很快就可以出结果

java -cp gsea2-2.2.1.jar -Xmx1024m xtools.gsea.Gsea -gmx c2.cp.kegg.v5.0.symbols.gmt \

-res P53_hgu95av2.gct -cls P53.cls -chip chip/HG_U95Av2.chip -out result -rpt_label p53_example

这个参数其实非常多，见文件http://software.broadinstitute.org/gsea/doc/linkedFiles/GSEAParameters.txt

但是一般我们都用默认的即可

里面报错说有些探针找不到，不要管它

四、输出数据

自己看官网去理解这些结果咯！

需要下载的数据如下：

首先需要下载 Molecular Signatures Database (MSigDB)，一般选择C2的kegg，BioCarta 还有Reactome

http://software.broadinstitute.org/gsea/downloads.jsp

都是gmt格式的文件！

CP: Canonical pathways (browse 1330 gene sets)	Gene sets from the pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. details	Download GMT Files original identifiers gene symbols entrez genes ids
CP:BIOCARTA: BioCarta gene sets (browse 217 gene sets)	Gene sets derived from the BioCarta pathway database (http://www.biocarta.com/genes/index.asp).	Download GMT Files original identifiers gene symbols entrez genes ids
CP:KEGG: KEGG gene sets (browse 186 gene sets)	Gene sets derived from the KEGG pathway database (http://www.genome.jp/kegg/pathway.html).	Download GMT Files original identifiers gene symbols entrez genes ids
CP:REACTOME: Reactome gene sets (browse 674 gene sets)	Gene sets derived from the Reactome pathway database (http://www.reactome.org/).	Download GMT Files original identifiers gene symbols entrez genes ids

然后做出表达数据gct文件和cls表型文件~

见：http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

然后就可以直接运行了

如果是芯片数据，第一列是芯片探针，那么还需要下载chip数据：ftp://ftp.broadinstitute.org/pub/gsea/annotations

用GSEA来做基因集富集分析

ulwvfje — Wed, 30 Dec 2015 01:17:43 +0000

how to use GSEA?

这个有点类似于pathway（GO,KEGG等）的富集分析，区别在于gene set（矫正好的基于文献的数据库）的概念更广泛一点，包括了

how to download GSEA ?

软件下载地址：http://software.broadinstitute.org/gsea/downloads.jsp

教程：http://software.broadinstitute.org/gsea/doc/desktop_tutorial.jsp

需要自己安装好java环境！

what's the input for the GSEA?

说明书上写的输入数据是：GSEA supported data files are simply tab delimited ASCII text files, which have special file extensions that identify them. For example, expression data usually has the extension *.gct, phenotypes *.cls, gene sets *.gmt, and chip annotations *.chip. Click the More on file formats help button to view detailed descriptions of all the data file formats.

并且提供了测试数据：http://software.broadinstitute.org/gsea/datasets.jsp

实际上没那么复杂，一个表达矩阵即可！然后做一个分组说明的cls文件即可。

主要是自己看说明书，做出要求的数据格式：http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

表达矩阵我这里下载GSE1009数据集做测试吧！

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse1009

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1009/matrix/GSE1009_series_matrix.txt.gz

cls的样本说明文件，就随便搞一搞吧，下面这个是例子：

6 2 1

# good bad

good good good bad bad bad

文件如下，六个样本，根据探针来的表达数据，分组前后各三个一组。

现在开始运行GSEA！

start to run the GSEA !

首先载入数据

确定无误，就开始运行，运行需要设置一定的参数！

what's the output ?

输出的数据非常多，对你选择的gene set数据集里面的每个set都会分析看看是否符合富集的标准，富集就出来一个报告。

点击success就能进入报告主页，里面的链接可以进入任意一个分报告。

最大的特色是提供了大量的数据集：You can browse the MSigDB from the Molecular Signatures Database page of the GSEA web site or the Browse MSigDB page of the GSEA application. To browse the MSigDB from the GSEA application:

还自己建立了wiki说明主页：http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page

有些文献是基于GSEA的：

www.ncbi.nlm.nih.gov/pubmed/16199517

http://stke.sciencemag.org/highwire/filestream/4681053/field_highwire_adjunct_files/1/2001966_Slides.zip

http://www.ingentaconnect.com/content/ben/cbio/2007/00000002/00000002/art00003

http://www.nature.com/articles/ng0704-663a

http://bioinformatics.oxfordjournals.org/content/23/23/3251.short

http://link.springer.com/article/10.1007/s00335-011-9359-x

生信菜鸟团 » GSEA

制作自己的gene set文件给gsea软件

java版本GSEA软件的ES score图片的修改

GSEA的统计学原理试讲

批量运行GSEA，命令行版本

用GSEA来做基因集富集分析

Identification of high-copper-responsive target pathways in Atp7b knockout mouse liver byGSEA on microarray data sets

Comparison of invariant NKT cells with conventional T cells by using gene set enrichment analysis (GSEA)