十一 23

用R的bioconductor里面的stringDB包来做PPI分析

PPI本质上是根据一系列感兴趣的蛋白质或者基因(可以是几百个甚至上千个)来去PPI数据库里面找到跟这系列蛋白质或者基因的相互作用关系!

本次的主角是stringDB,顾名思义用得是大名鼎鼎的string数据库,
本来还以为需要自己上传自己的基因给这个数据库去做分析,没想到他们也开发了R包,主页见: http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html 而我比较喜欢用编程来解决问题,所以就学了一下这个包,非常好用!
它只需要一个3列的data.frame,分别是logFC,p.value,gene ID,就是标准的差异分析的结果。
然后用string_db$map函数给它加上一列是 string 数据库的蛋白ID,然后用string_db$add_diff_exp_color函数给它加上一列是color。
用string_db$plot_network函数画网络图,只需要 string 数据库的蛋白ID,如果需要给蛋白标记不同的颜色,需要用string_db$post_payload来把color对应到每个蛋白,然后再画网络图。
也可以直接用get_interactions函数得到所有的PPI数据,然后写入到本地,再导入到cytoscape进行画图

Continue reading

28

下载最新的蛋白相互作用数据库-STRING

string数据库是PPI领域里面最完备已经最受欢迎的数据库了。如果直接在谷歌里面搜索PPI,映入眼帘就是string的官网,它们的主页现在是html5啦,比较精美: http://string-db.org/

1

写的很霸气,近两亿的记录,不过一般大家只会关心一个物种,比如人,其实还不到一千万!

我们直接进入下载界面,找到人类的数据,人类的物种ID是9606.

2

需要一定许可才能下载完整版本,我这里测试最上面那个公开版本数据!

数据很简单,就是protein+protein+score,共八百多万行记录,记录着string数据库搜集的所有可能以及可信的蛋白相互作用!但是它的蛋白ID是ENSEMBL的ID,所以需要转换成基因的ID,才能被大多数人使用,因为大家的研究单位一般是基因,所以蛋白相互作用略等于基因相互作用。

基因ID转换,我推荐用org.Hs.eg.db这个R的包,很容易就可以实现的!

> tmp=toTable(org.Hs.egENSEMBLPROT)
> dim(tmp)
[1] 110916      2
> head(tmp)
  gene_id         prot_id
1       1 ENSP00000263100
2       1 ENSP00000470909
3       2 ENSP00000443302
4       2 ENSP00000323929
5       2 ENSP00000438599
6       2 ENSP00000445717
>

有约500多个蛋白ID是无法转换成对应的基因的,这个很正常,毕竟这种ID本来就不稳定,很多用着用着就失效了!

转换好之后就可以上传到数据库啦,然后可以供其它可视化或者分析程序使用!

 

14

蛋白质相互作用(PPI)数据库大全

最近遇到一个项目需要探究一个gene list里面的基因直接的联系,所以就想到了基因的产物蛋白的相互作用关系数据库,发现这些数据库好多好多!
一个比较综合的链接是:A compendium of PPI databases can be found in http://www.pathguide.org/.

里面的数据库非常多,仅仅是对于人类就有

Your search returned 207 results in 9 categories with the following search parameters:

人类的六个主要PPI是:Analysis of human interactome PPI data showing the coverage of six major primary databases (BIND, BioGRID, DIP, HPRD, IntAct, and MINT), according to the integration provided by the meta-database APID.
BIND the biomolecular interaction network database died link
DIP the database of interacting proteins http://dip.doe-mbi.ucla.edu/ 
MINT the molecular interaction database http://mint.bio.uniroma2.it/mint/ 
STRING Search Tool for the Retrieval of Interacting Genes/Proteins http://string-db.org/  
HPRO Human protein reference database http://www.hprd.org/ 
BioGRID The Biological General Repository for Interaction Datasets http://thebiogrid.org/ 
这些数据库大部分都还有维护者,还在持续更新,每次更新都可以发一篇paper,而数据库收集的paper引用一般都上千,如果你做了一个数据库,才十几个人引用,那就说明你是自己在跟自己玩。
其中比较好用的是宾夕法尼亚州匹兹堡的大学的一个:http://severus.dbmi.pitt.edu/wiki-pi/
(a) PPI definition; a definition of a protein-to-protein interaction compared to other biomolecular relationships or associations.
(b)PPI determination by two alternative approaches: binary and co-complex; a description of the PPIs determined by the two main types of experimental technologies.
(c) The main databases and repositories that include PPIs; a description and comparison of the main databases and repositories that include PPIs, indicating the type of data that they collect with a special distinction between experimental and predicted data.
(d) Analysis of coverage and ways to improve PPI reliability; a comparative study of the current coverage on PPIs and presentation of some strategies to improve the reliability of PPI data.
(e) Networks derived from PPIs compared to canonical pathways; a practical example that compares the characteristics and information provided by a canonical pathway and the PPI network built for the same proteins. Last, a short summary and guidance for learning more is provided.
现在的蛋白质相互作用数据库的数据都很有限,但是在持续增长,一般有下面四种原因导致数据被收录到数据库
There are four common approaches for PPI data expansions:
1) manual curation from the biomedical literature by experts;
2) automated PPI data extraction from biomedical literature with text mining methods;
3) computational inference based on interacting protein domains or co-regulation relationships, often derived from data in model organisms; and
4) data integration from various experimental or computational sources.
Partly due to the difficulty of evaluating qualities for PPI data, a majority of widely-used PPI databases, including DIP, BIND, MINT, HPRD, and IntAct, take a "conservative approach" to PPI data expansion by adding only manually curated interactions. Therefore, the coverage of the protein interactome developed using this approach is poor.
In the second literature mining approach, computer software replaces database curators to extract protein interaction (or, association) data from large volumes of biomedical literature . Due to the complexity of natural language processing techniques involved, however, this approach often generates large amount of false positive protein "associations" that are not truly biologically significant "interactions".
The challenge for the integrative approach is how to balance quality with coverage.
In particular, different databases may contain many redundant PPI information derived from the same sources, while the overlaps between independently derived PPI data sets are quite low .
参考:
2009年发表的HIPPI数据库:http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-10-S1-S16#CR6_2544 (是对HPRD [11], BIND [20], MINT [21], STRING [26], and OPHID数据库的整合)