使用PGSEA包进行基因集分析

GSEA 相信看过我生信菜鸟团博客的朋友都已经耳熟能详了的,其需要样本的描述以及分组信息,虽然有ssGSEA这样的单样本的分析,但仍然不够,也有GSVA这样的算法来弥补,这里要介绍的是另外一个包,PGSEA。

it tests for each sample whether the average expression of genes in a gene sets deviates from the overall average expression (expression of all genes in all samples).

安装PGSEA这个R包

安装并且查看 PDF教程:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("PGSEA")
library(PGSEA)
browseVignettes("PGSEA")

最新版教程: https://bioconductor.org/packages/release/bioc/html/PGSEA.html

重点就是PGSEA函数及smcPlot函数,前者根据表达矩阵及基因集来进行GSEA分析,后者用来可视化分析后的结果。

因为PGSEA分析后的结果是每个基因集在每个样本的一个score,所以也是一个表达矩阵,也可以进行limma的差异分析流程。

执行PGSEA分析及可视化

library(PGSEA)
library(gskb)
data(mm_miRNA)
gse<-read.csv("http://ge-lab.org/gskb/GSE40261.csv",header=TRUE, row.name=1)
# Gene are centered by mean expression
gse <- gse - apply(gse,1,mean)  
​
## 保证表达矩阵和基因集的基因名字是对应的即可。d1 <- scan("http://ge-lab.org/gskb/2-MousePath/MousePath_Co-expression_gmt.gmt", what="", sep="\n", skip=1)
mm_Co_expression <- strsplit(d1, "\t")
names(mm_Co_expression) <- sapply(mm_Co_expression, '[[', 1)
​
pg <- PGSEA(gse, cl=mm_Co_expression, range=c(15,2000), p.value=NA)
​
# Remove pathways that has all NAs. This could be due to that pathway has
# too few matching genes. 
pg2 <- pg[rowSums(is.na(pg))!= dim(gse)[2], ]
​
# Difference in Average Z score in two groups of samples is calculated and 
# the pathways are ranked by absolute value.
diff <- abs( apply(pg2[,1:4],1,mean) - apply(pg2[,5:8], 1, mean) )
pg2 <- pg2[order(-diff), ]  
​
sub <- factor( c( rep("Control",4),rep("Anti-miR-29",4) ) ) 
​
smcPlot(pg2[1:15,],sub,scale=c(-12,12),show.grid=TRUE,margins=c(1,1,7,19),col=.rwb)
​

分析结果也可以走limma流程

library(PGSEA)
library(GEOquery)
library(GSEABase)
gse <- getGEO("GSE7023",GSEMatrix=TRUE)
#load("gse.rda")
subtype <- gsub("\\.", "_",gsub("subtype: ", "", phenoData(gse[[1]])$"characteristics_ch1"))
pheno <- new("AnnotatedDataFrame", data = data.frame(subtype), varMetadata = data.frame(labelDescription="subtype"))
rownames(pheno@data) <- colnames(exprs(gse[[1]]))
eset <- new("ExpressionSet", exprs = exprs(gse[[1]]), phenoData = pheno)
​
data(VAIgsc)
details(VAIgsc[[1]])
​
pgNF <- PGSEA(eset, VAIgsc, ref=which(subtype=="NO"), p.value=NA)
​
library(limma)
​
design <- model.matrix(~ -1+factor(subtype))
colnames(design) <- names(table(subtype))
fit <- lmFit(pgNF, design)
contrast.matrix <- makeContrasts(P2B-NO , levels=design)
fit <- contrasts.fit(fit, contrast.matrix)
fit <- eBayes(fit)
​
topTable(fit, n=10)[,c("logFC","t","adj.P.Val")]
​

Comments are closed.