用R的bioconductor里面的stringDB包来做PPI分析

ulwvfje — Wed, 23 Nov 2016 11:37:37 +0000

PPI本质上是根据一系列感兴趣的蛋白质或者基因（可以是几百个甚至上千个）来去PPI数据库里面找到跟这系列蛋白质或者基因的相互作用关系！

本次的主角是stringDB，顾名思义用得是大名鼎鼎的string数据库，

paper见：https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383874/

本来还以为需要自己上传自己的基因给这个数据库去做分析，没想到他们也开发了R包，主页见： http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html 而我比较喜欢用编程来解决问题，所以就学了一下这个包，非常好用！

它只需要一个3列的data.frame，分别是logFC,p.value,gene ID,就是标准的差异分析的结果。

然后用string_db$map函数给它加上一列是 string 数据库的蛋白ID，然后用string_db$add_diff_exp_color函数给它加上一列是color。

用string_db$plot_network函数画网络图，只需要 string 数据库的蛋白ID，如果需要给蛋白标记不同的颜色，需要用string_db$post_payload来把color对应到每个蛋白，然后再画网络图。

也可以直接用get_interactions函数得到所有的PPI数据，然后写入到本地，再导入到cytoscape进行画图

还以几个小功能，对我可能没什么用，但是比较适合初学者，仅仅根据string 数据库的蛋白ID就可以做GO/KEGG的enrichment分析啦，还可以查找两个蛋白的interaction呀，还有两个蛋白直接相互作用的paper呀，还有找某个蛋白在其它物种的同源蛋白呀！

软件运行中需要下载以下文件，悲催的是每次都在下载，很坑呀！因为它默认把这些文件存储在电脑的临时文件夹里面！

所有的网络图本质上是基于iGraph的深度定制，包括后面的cluster方法，还有可能要结合cytoscape的MCODE插件来找hub基因

基本上只需要把下面的代码运行一遍，就明白了：http://www.bioconductor.org/packages/release/bioc/vignettes/STRINGdb/inst/doc/STRINGdb.R

library(STRINGdb)

## 整个包不是用roxygen2来写帮助文档的，而且自己把所有函数放在了string_db对象里面，用$符合来调用各个函数，也可以查看函数的帮助文档！

## 首先选定物种及数据库的版本！

string_db <- STRINGdb$new( version="10", species=9606,

score_threshold=0, input_directory="" )

###################################################

### code chunk number 3: help

###################################################

STRINGdb$methods() # To list all the methods available.

STRINGdb$help("get_graph") # To visualize their documentation.

## 列出该包所包含的所有函数，并且可以具体查看某个函数的帮助文档。

###################################################

### code chunk number 4: load_data

###################################################

data(diff_exp_example1)

head(diff_exp_example1)

##一个测试数据，三列，如下：

# pvalue logFC gene

# 0.0001018 3.333461 VSTM2L

# 0.0001392 3.822383 TBC1D2

# 通常就是差异分析的结果

###################################################

### code chunk number 5: map

###################################################

example1_mapped <- string_db$map( diff_exp_example1, "gene", removeUnmappedRows = TRUE )

## 因为我们的差异分析是以基因来标识的，需要map到string数据库的蛋白ID

STRINGdb$help("map")

# 查看帮助文档，明白map函数如何使用，以及该函数返回的是什么！

# 本质上就是根据输入的data.frame的gene列来查找string的蛋白ID，返回的data.frame多了一列！

###################################################

### code chunk number 6: STRINGdb.Rnw:118-121

###################################################

options(SweaveHooks=list(fig=function()

par(mar=c(2.1, 0.1, 4.1, 2.1))))

#par(mar=c(1.1, 0.1, 4.1, 2.1))))

## 设置画图的属性，没什么好讲的

###################################################

### code chunk number 7: get_hits

###################################################

hits <- example1_mapped$STRING_id[1:200]

# 这里简单的挑选了前面的200个蛋白来进行下一步的分析！

## 请记住，这个例子是在随机挑选，事实上我们应该挑选自定义的差异基因

###################################################

### code chunk number 8: plot_network

###################################################

string_db$plot_network( hits )

## 只有有蛋白ID就可以进行画网络图，ID越多，耗时越长！

## 函数会根据输入的ID列表在string数据库里面找到所有的PPI数据，然后画网络图

## STRINGdb$help("plot_network")

###################################################

### code chunk number 9: add_diff_exp_color

###################################################

# filter by p-value and add a color column

# (i.e. green down-regulated gened and red for up-regulated genes)

example1_mapped_pval05 <- string_db$add_diff_exp_color( subset(example1_mapped, pvalue<0.05),

logFcColStr="logFC" )

## 上面简单的网络图一般不满足需求，比如我们需要定位基因的上下调关系，还有联系的紧密与否，可以用红绿色的深浅来刻画。

## 用add_diff_exp_color函数得到的对象还是data.frame，但是增加了一列是color

STRINGdb$help("add_diff_exp_color")

###################################################

### code chunk number 10: post_payload

###################################################

# post payload information to the STRING server

payload_id <- string_db$post_payload( example1_mapped_pval05$STRING_id,

colors=example1_mapped_pval05$color )

## 前面add_diff_exp_color函数为我们的data.frame增加了一列是color，还需要用post_payload函数来把string的蛋白ID跟color对应成功，返回一个payload_id对象给画图函数。

STRINGdb$help("post_payload")

###################################################

### code chunk number 11: plot_halo_network

###################################################

# display a STRING network png with the "halo"

string_db$plot_network( hits, payload_id=payload_id )

## 同样是画网络图，但是增加了一个color的属性。

## 可以看出来，基因太多了，画的图其实很拥挤

###################################################

### code chunk number 13: plot_ppi_enrichment

###################################################

# plot the enrichment for the best 1000 genes

string_db$plot_ppi_enrichment( example1_mapped$STRING_id[1:1000], quiet=TRUE )

STRINGdb$help("plot_ppi_enrichment")

## 这个代码我没有看懂在干吗

###################################################

### code chunk number 14: enrichment

###################################################

enrichmentGO <- string_db$get_enrichment( hits, category = "Process", methodMT = "fdr", iea = TRUE )

enrichmentKEGG <- string_db$get_enrichment( hits, category = "KEGG", methodMT = "fdr", iea = TRUE )

head(enrichmentGO, n=7)

head(enrichmentKEGG, n=7)

### 直接根据 string 数据库的蛋白ID来做富集分析，此函数会自动下载一些数据。默认是以人类的蛋白库作为背景，但是大部分情况下是需要改变的，否则P值就算的不准确啦

#################################################

# code chunk number 15: background (eval = FALSE)

#################################################

# 这里修改背景值，人类本来有两万多个基因，这里变成只有2000个了

backgroundV <- example1_mapped$STRING_id[1:2000] # as an example, we use the first 2000 genes

string_db$set_background(backgroundV)

## string_db 是一个全局变量，之前是直接选择人类的V10.0版本，现在被修改了，只是做一个测试，一定要记得改回去！！！

###################################################

### code chunk number 16: new_background_inst (eval = FALSE)

###################################################

string_db <- STRINGdb$new( score_threshold=0, backgroundV = backgroundV )

###################################################

### code chunk number 17: enrichmentHeatmap (eval = FALSE)

###################################################

eh <- string_db$enrichment_heatmap( list( hits[1:100], hits[101:200]),

list("list1","list2"), title="My Lists" )

## 我们还是把 string_db 修改回来吧！

string_db <- STRINGdb$new( version="10", species=9606,

score_threshold=0, input_directory="" )

###################################################

### code chunk number 18: clustering1

###################################################

# get clusters

clustersList <- string_db$get_clusters(example1_mapped$STRING_id[1:600])

###################################################

### code chunk number 19: STRINGdb.Rnw:254-256

###################################################

options(SweaveHooks=list(fig=function()

par(mar=c(2.1, 0.1, 4.1, 2.1))))

###################################################

### code chunk number 20: clustering2

###################################################

# plot first 4 clusters

par(mfrow=c(2,2))

for(i in seq(1:4)){

string_db$plot_network(clustersList[[i]])

}

## 把4个cluster画在同一个画布上面！

###################################################

### code chunk number 21: proteins

###################################################

string_proteins <- string_db$get_proteins()

## 下面是一下其它小工具，比如找两个蛋白的interaction呀，还有两个蛋白直接相互作用的paper呀，还有找某个蛋白在其它物种的同源蛋白呀！

###################################################

### code chunk number 22: atmtp

###################################################

tp53 = string_db$mp( "tp53" )

atm = string_db$mp( "atm" )

###################################################

### code chunk number 23: neighbors (eval = FALSE)

###################################################

## string_db$get_neighbors( c(tp53, atm) )

###################################################

### code chunk number 24: interactions

###################################################

string_db$get_interactions( c(tp53, atm) )

###################################################

### code chunk number 25: pubmedInteractions (eval = FALSE)

###################################################

## string_db$get_pubmed_interaction( tp53, atm )

###################################################

### code chunk number 26: homologs (eval = FALSE)

###################################################

## # get the reciprocal best hits of the following protein in all the STRING species

## string_db$get_homologs_besthits(tp53, symbets = TRUE)

###################################################

### code chunk number 27: homologs2 (eval = FALSE)

###################################################

## # get the homologs of the following two proteins in the mouse (i.e. species_id=10090)

## string_db$get_homologs(c(tp53, atm), target_species_id=10090, bitscore_threshold=60 )

###################################################

### code chunk number 28: benchmark1

###################################################

data(interactions_example)

interactions_benchmark = string_db$benchmark_ppi(interactions_example, pathwayType = "KEGG",

max_homology_bitscore = 60, precision_window = 400, exclude_pathways = "blacklist")

###################################################

### code chunk number 29: STRINGdb.Rnw:391-393

###################################################

options(SweaveHooks=list(fig=function()

par(mar=c(4.1, 4.1, 4.1, 2.1))))

###################################################

### code chunk number 30: benchmark2

###################################################

plot(interactions_benchmark$precision, ylim=c(0,1), type="l", xlim=c(0,700),

xlab="interactions", ylab="precision")

###################################################

### code chunk number 31: benchmark3

###################################################

interactions_pathway_view = string_db$benchmark_ppi_pathway_view(interactions_benchmark, precision_threshold=0.2, pathwayType = "KEGG")

head(interactions_pathway_view)

生信菜鸟团 » stringDB

用R的bioconductor里面的stringDB包来做PPI分析