生信菜鸟团 » PPI

用R的bioconductor里面的stringDB包来做PPI分析

ulwvfje — Wed, 23 Nov 2016 11:37:37 +0000

PPI本质上是根据一系列感兴趣的蛋白质或者基因（可以是几百个甚至上千个）来去PPI数据库里面找到跟这系列蛋白质或者基因的相互作用关系！

本次的主角是stringDB，顾名思义用得是大名鼎鼎的string数据库，

paper见：https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383874/

本来还以为需要自己上传自己的基因给这个数据库去做分析，没想到他们也开发了R包，主页见： http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html 而我比较喜欢用编程来解决问题，所以就学了一下这个包，非常好用！

它只需要一个3列的data.frame，分别是logFC,p.value,gene ID,就是标准的差异分析的结果。

然后用string_db$map函数给它加上一列是 string 数据库的蛋白ID，然后用string_db$add_diff_exp_color函数给它加上一列是color。

用string_db$plot_network函数画网络图，只需要 string 数据库的蛋白ID，如果需要给蛋白标记不同的颜色，需要用string_db$post_payload来把color对应到每个蛋白，然后再画网络图。

也可以直接用get_interactions函数得到所有的PPI数据，然后写入到本地，再导入到cytoscape进行画图

还以几个小功能，对我可能没什么用，但是比较适合初学者，仅仅根据string 数据库的蛋白ID就可以做GO/KEGG的enrichment分析啦，还可以查找两个蛋白的interaction呀，还有两个蛋白直接相互作用的paper呀，还有找某个蛋白在其它物种的同源蛋白呀！

软件运行中需要下载以下文件，悲催的是每次都在下载，很坑呀！因为它默认把这些文件存储在电脑的临时文件夹里面！

所有的网络图本质上是基于iGraph的深度定制，包括后面的cluster方法，还有可能要结合cytoscape的MCODE插件来找hub基因

基本上只需要把下面的代码运行一遍，就明白了：http://www.bioconductor.org/packages/release/bioc/vignettes/STRINGdb/inst/doc/STRINGdb.R

library(STRINGdb)

## 整个包不是用roxygen2来写帮助文档的，而且自己把所有函数放在了string_db对象里面，用$符合来调用各个函数，也可以查看函数的帮助文档！

## 首先选定物种及数据库的版本！

string_db <- STRINGdb$new( version="10", species=9606,

score_threshold=0, input_directory="" )

###################################################

### code chunk number 3: help

###################################################

STRINGdb$methods() # To list all the methods available.

STRINGdb$help("get_graph") # To visualize their documentation.

## 列出该包所包含的所有函数，并且可以具体查看某个函数的帮助文档。

###################################################

### code chunk number 4: load_data

###################################################

data(diff_exp_example1)

head(diff_exp_example1)

##一个测试数据，三列，如下：

# pvalue logFC gene

# 0.0001018 3.333461 VSTM2L

# 0.0001392 3.822383 TBC1D2

# 通常就是差异分析的结果

###################################################

### code chunk number 5: map

###################################################

example1_mapped <- string_db$map( diff_exp_example1, "gene", removeUnmappedRows = TRUE )

## 因为我们的差异分析是以基因来标识的，需要map到string数据库的蛋白ID

STRINGdb$help("map")

# 查看帮助文档，明白map函数如何使用，以及该函数返回的是什么！

# 本质上就是根据输入的data.frame的gene列来查找string的蛋白ID，返回的data.frame多了一列！

###################################################

### code chunk number 6: STRINGdb.Rnw:118-121

###################################################

options(SweaveHooks=list(fig=function()

par(mar=c(2.1, 0.1, 4.1, 2.1))))

#par(mar=c(1.1, 0.1, 4.1, 2.1))))

## 设置画图的属性，没什么好讲的

###################################################

### code chunk number 7: get_hits

###################################################

hits <- example1_mapped$STRING_id[1:200]

# 这里简单的挑选了前面的200个蛋白来进行下一步的分析！

## 请记住，这个例子是在随机挑选，事实上我们应该挑选自定义的差异基因

###################################################

### code chunk number 8: plot_network

###################################################

string_db$plot_network( hits )

## 只有有蛋白ID就可以进行画网络图，ID越多，耗时越长！

## 函数会根据输入的ID列表在string数据库里面找到所有的PPI数据，然后画网络图

## STRINGdb$help("plot_network")

###################################################

### code chunk number 9: add_diff_exp_color

###################################################

# filter by p-value and add a color column

# (i.e. green down-regulated gened and red for up-regulated genes)

example1_mapped_pval05 <- string_db$add_diff_exp_color( subset(example1_mapped, pvalue<0.05),

logFcColStr="logFC" )

## 上面简单的网络图一般不满足需求，比如我们需要定位基因的上下调关系，还有联系的紧密与否，可以用红绿色的深浅来刻画。

## 用add_diff_exp_color函数得到的对象还是data.frame，但是增加了一列是color

STRINGdb$help("add_diff_exp_color")

###################################################

### code chunk number 10: post_payload

###################################################

# post payload information to the STRING server

payload_id <- string_db$post_payload( example1_mapped_pval05$STRING_id,

colors=example1_mapped_pval05$color )

## 前面add_diff_exp_color函数为我们的data.frame增加了一列是color，还需要用post_payload函数来把string的蛋白ID跟color对应成功，返回一个payload_id对象给画图函数。

STRINGdb$help("post_payload")

###################################################

### code chunk number 11: plot_halo_network

###################################################

# display a STRING network png with the "halo"

string_db$plot_network( hits, payload_id=payload_id )

## 同样是画网络图，但是增加了一个color的属性。

## 可以看出来，基因太多了，画的图其实很拥挤

###################################################

### code chunk number 13: plot_ppi_enrichment

###################################################

# plot the enrichment for the best 1000 genes

string_db$plot_ppi_enrichment( example1_mapped$STRING_id[1:1000], quiet=TRUE )

STRINGdb$help("plot_ppi_enrichment")

## 这个代码我没有看懂在干吗

###################################################

### code chunk number 14: enrichment

###################################################

enrichmentGO <- string_db$get_enrichment( hits, category = "Process", methodMT = "fdr", iea = TRUE )

enrichmentKEGG <- string_db$get_enrichment( hits, category = "KEGG", methodMT = "fdr", iea = TRUE )

head(enrichmentGO, n=7)

head(enrichmentKEGG, n=7)

### 直接根据 string 数据库的蛋白ID来做富集分析，此函数会自动下载一些数据。默认是以人类的蛋白库作为背景，但是大部分情况下是需要改变的，否则P值就算的不准确啦

#################################################

# code chunk number 15: background (eval = FALSE)

#################################################

# 这里修改背景值，人类本来有两万多个基因，这里变成只有2000个了

backgroundV <- example1_mapped$STRING_id[1:2000] # as an example, we use the first 2000 genes

string_db$set_background(backgroundV)

## string_db 是一个全局变量，之前是直接选择人类的V10.0版本，现在被修改了，只是做一个测试，一定要记得改回去！！！

###################################################

### code chunk number 16: new_background_inst (eval = FALSE)

###################################################

string_db <- STRINGdb$new( score_threshold=0, backgroundV = backgroundV )

###################################################

### code chunk number 17: enrichmentHeatmap (eval = FALSE)

###################################################

eh <- string_db$enrichment_heatmap( list( hits[1:100], hits[101:200]),

list("list1","list2"), title="My Lists" )

## 我们还是把 string_db 修改回来吧！

string_db <- STRINGdb$new( version="10", species=9606,

score_threshold=0, input_directory="" )

###################################################

### code chunk number 18: clustering1

###################################################

# get clusters

clustersList <- string_db$get_clusters(example1_mapped$STRING_id[1:600])

###################################################

### code chunk number 19: STRINGdb.Rnw:254-256

###################################################

options(SweaveHooks=list(fig=function()

par(mar=c(2.1, 0.1, 4.1, 2.1))))

###################################################

### code chunk number 20: clustering2

###################################################

# plot first 4 clusters

par(mfrow=c(2,2))

for(i in seq(1:4)){

string_db$plot_network(clustersList[[i]])

}

## 把4个cluster画在同一个画布上面！

###################################################

### code chunk number 21: proteins

###################################################

string_proteins <- string_db$get_proteins()

## 下面是一下其它小工具，比如找两个蛋白的interaction呀，还有两个蛋白直接相互作用的paper呀，还有找某个蛋白在其它物种的同源蛋白呀！

###################################################

### code chunk number 22: atmtp

###################################################

tp53 = string_db$mp( "tp53" )

atm = string_db$mp( "atm" )

###################################################

### code chunk number 23: neighbors (eval = FALSE)

###################################################

## string_db$get_neighbors( c(tp53, atm) )

###################################################

### code chunk number 24: interactions

###################################################

string_db$get_interactions( c(tp53, atm) )

###################################################

### code chunk number 25: pubmedInteractions (eval = FALSE)

###################################################

## string_db$get_pubmed_interaction( tp53, atm )

###################################################

### code chunk number 26: homologs (eval = FALSE)

###################################################

## # get the reciprocal best hits of the following protein in all the STRING species

## string_db$get_homologs_besthits(tp53, symbets = TRUE)

###################################################

### code chunk number 27: homologs2 (eval = FALSE)

###################################################

## # get the homologs of the following two proteins in the mouse (i.e. species_id=10090)

## string_db$get_homologs(c(tp53, atm), target_species_id=10090, bitscore_threshold=60 )

###################################################

### code chunk number 28: benchmark1

###################################################

data(interactions_example)

interactions_benchmark = string_db$benchmark_ppi(interactions_example, pathwayType = "KEGG",

max_homology_bitscore = 60, precision_window = 400, exclude_pathways = "blacklist")

###################################################

### code chunk number 29: STRINGdb.Rnw:391-393

###################################################

options(SweaveHooks=list(fig=function()

par(mar=c(4.1, 4.1, 4.1, 2.1))))

###################################################

### code chunk number 30: benchmark2

###################################################

plot(interactions_benchmark$precision, ylim=c(0,1), type="l", xlim=c(0,700),

xlab="interactions", ylab="precision")

###################################################

### code chunk number 31: benchmark3

###################################################

interactions_pathway_view = string_db$benchmark_ppi_pathway_view(interactions_benchmark, precision_threshold=0.2, pathwayType = "KEGG")

head(interactions_pathway_view)

下载最新的蛋白相互作用数据库-STRING

ulwvfje — Thu, 28 Apr 2016 12:02:47 +0000

string数据库是PPI领域里面最完备已经最受欢迎的数据库了。如果直接在谷歌里面搜索PPI，映入眼帘就是string的官网，它们的主页现在是html5啦，比较精美： http://string-db.org/

写的很霸气，近两亿的记录，不过一般大家只会关心一个物种，比如人，其实还不到一千万！

我们直接进入下载界面，找到人类的数据，人类的物种ID是9606.

需要一定许可才能下载完整版本，我这里测试最上面那个公开版本数据！

数据很简单，就是protein+protein+score，共八百多万行记录，记录着string数据库搜集的所有可能以及可信的蛋白相互作用！但是它的蛋白ID是ENSEMBL的ID，所以需要转换成基因的ID，才能被大多数人使用，因为大家的研究单位一般是基因，所以蛋白相互作用略等于基因相互作用。

基因ID转换，我推荐用org.Hs.eg.db这个R的包，很容易就可以实现的！

> tmp=toTable(org.Hs.egENSEMBLPROT)
> dim(tmp)
[1] 110916      2
> head(tmp)
  gene_id         prot_id
1       1 ENSP00000263100
2       1 ENSP00000470909
3       2 ENSP00000443302
4       2 ENSP00000323929
5       2 ENSP00000438599
6       2 ENSP00000445717

有约500多个蛋白ID是无法转换成对应的基因的，这个很正常，毕竟这种ID本来就不稳定，很多用着用着就失效了！

转换好之后就可以上传到数据库啦，然后可以供其它可视化或者分析程序使用！

蛋白质相互作用（PPI）数据库大全

ulwvfje — Thu, 14 Jan 2016 12:09:23 +0000

最近遇到一个项目需要探究一个gene list里面的基因直接的联系，所以就想到了基因的产物蛋白的相互作用关系数据库，发现这些数据库好多好多！

一个比较综合的链接是：A compendium of PPI databases can be found in http://www.pathguide.org/.

里面的数据库非常多，仅仅是对于人类就有

Your search returned 207 results in 9 categories with the following search parameters:

Organisms: Homo sapiens (Human)
Availability: Free to all users
Standards: all

人类的六个主要PPI是：Analysis of human interactome PPI data showing the coverage of six major primary databases (BIND, BioGRID, DIP, HPRD, IntAct, and MINT), according to the integration provided by the meta-database APID.

BIND	the biomolecular interaction network database	died link
DIP	the database of interacting proteins	http://dip.doe-mbi.ucla.edu/
MINT	the molecular interaction database	http://mint.bio.uniroma2.it/mint/
STRING	Search Tool for the Retrieval of Interacting Genes/Proteins	http://string-db.org/
HPRO	Human protein reference database	http://www.hprd.org/
BioGRID	The Biological General Repository for Interaction Datasets	http://thebiogrid.org/

这些数据库大部分都还有维护者，还在持续更新，每次更新都可以发一篇paper，而数据库收集的paper引用一般都上千，如果你做了一个数据库，才十几个人引用，那就说明你是自己在跟自己玩。

见：http://openwetware.org/wiki/Protein-protein_interaction_databases

其中比较好用的是宾夕法尼亚州匹兹堡的大学的一个：http://severus.dbmi.pitt.edu/wiki-pi/

http://severus.dbmi.pitt.edu/wiki-pi/index.php/pair/view/3838/7157

(a) PPI definition; a definition of a protein-to-protein interaction compared to other biomolecular relationships or associations.

(b)PPI determination by two alternative approaches: binary and co-complex; a description of the PPIs determined by the two main types of experimental technologies.

(c) The main databases and repositories that include PPIs; a description and comparison of the main databases and repositories that include PPIs, indicating the type of data that they collect with a special distinction between experimental and predicted data.

(d) Analysis of coverage and ways to improve PPI reliability; a comparative study of the current coverage on PPIs and presentation of some strategies to improve the reliability of PPI data.

(e) Networks derived from PPIs compared to canonical pathways; a practical example that compares the characteristics and information provided by a canonical pathway and the PPI network built for the same proteins. Last, a short summary and guidance for learning more is provided.

现在的蛋白质相互作用数据库的数据都很有限，但是在持续增长，一般有下面四种原因导致数据被收录到数据库

There are four common approaches for PPI data expansions:

1) manual curation from the biomedical literature by experts;

2) automated PPI data extraction from biomedical literature with text mining methods;

3) computational inference based on interacting protein domains or co-regulation relationships, often derived from data in model organisms; and

4) data integration from various experimental or computational sources.

Partly due to the difficulty of evaluating qualities for PPI data, a majority of widely-used PPI databases, including DIP, BIND, MINT, HPRD, and IntAct, take a "conservative approach" to PPI data expansion by adding only manually curated interactions. Therefore, the coverage of the protein interactome developed using this approach is poor.

In the second literature mining approach, computer software replaces database curators to extract protein interaction (or, association) data from large volumes of biomedical literature . Due to the complexity of natural language processing techniques involved, however, this approach often generates large amount of false positive protein "associations" that are not truly biologically significant "interactions".

The challenge for the integrative approach is how to balance quality with coverage.

In particular, different databases may contain many redundant PPI information derived from the same sources, while the overlaps between independently derived PPI data sets are quite low .

参考：

2009年发表的HIPPI数据库：http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-10-S1-S16#CR6_2544 （是对HPRD [11], BIND [20], MINT [21], STRING [26], and OPHID数据库的整合）

2010年的综述：http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000807

http://bib.oxfordjournals.org/content/early/2010/09/16/bib.bbq064.full