生信菜鸟团 » 计算机基础

导入的ubuntu源被服务器拒绝怎么办？

ulwvfje — Fri, 24 Mar 2017 03:13:36 +0000

很久很久以前我就写过一个服务器系列教程：http://www.bio-info-trainee.com/555.html
在那里，我留下了一个疑问，因为后来没有机会再继续捣鼓服务器，所以一直悬而未决，问题描述如下：

如果你导入的R源被你的服务器拒绝，你就惨了
The following signatures couldn't be verified because the public key is not
以下签名不能因为公钥未验证~~

因为ubuntu对生信菜鸟来说是最好用的linux服务器，没有之一，因为它有apt-get。
比如安装R语言，我只需要把厦门大学或者北京大学的R源添加到apt-get的源文件里面就可以用apt-get来自动下载安装了。

如果，你添加的源，不被你的服务器认可，你就惨了，但还是可以解决的：

http://askubuntu.com/questions/13065/how-do-i-fix-the-gpg-error-no-pubkey

比如我在/etc/apt/sources.list文件最下面，添加了厦门大学的Ubuntu 16.04 LTS对应的R语言源;
deb http://mirrors.xmu.edu.cn/CRAN/bin/linux/ubuntu/ xenial/
接下来sudo apt-get update # 更新源就遇到了这个问题。

ubuntu@ip-172-31-2-206:~$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9
Executing: /tmp/tmp.QcuTMmu82U/gpg.1.sh --keyserver
keyserver.ubuntu.com
--recv-keys
51716619E084DAB9
gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
gpg: key E084DAB9: public key "Michael Rutter " imported
gpg: Total number processed: 1
gpg: imported: 1 (RSA: 1)
ubuntu@ip-172-31-2-206:~$ sudo apt-get update
Hit:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu xenial InRelease
Hit:2 http://us-west-2.ec2.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:3 http://us-west-2.ec2.archive.ubuntu.com/ubuntu xenial-backports InRelease
Get:4 http://security.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
Hit:5 http://ppa.launchpad.net/webupd8team/y-ppa-manager/ubuntu xenial InRelease
Get:6 http://mirrors.xmu.edu.cn/CRAN/bin/linux/ubuntu xenial/ InRelease [3,590 B]
Fetched 106 kB in 0s (209 kB/s)
Reading package lists... Done
ubuntu@ip-172-31-2-206:~$

完美解决啦~
~ sudo apt-get install r-base-core # 再次安装R语言软件包
~ R –version # 检查R的版本
安装过程非常慢，可能得好几个小时。

我用rmarkdown写过的教程

ulwvfje — Wed, 15 Mar 2017 09:16:05 +0000

用rmarkdown写教程真心非常方便，尤其是R语言相关的，比如一些R包的应用，或者一些可视化，或者一些统计，下面我简单列出一些我以前写过的，图文并茂，关键是还非常省心，不需要排版，不需要上传图片，整理图片。

一般来说看链接最后的文件名就知道这篇文章讲的是什么了：

首先是几个R包的讲解：
http://www.bio-info-trainee.com/ ... software/limma.html
http://www.bio-info-trainee.com/ ... oftware/DESeq2.html
http://www.bio-info-trainee.com/ ... tware/GEOquery.html
http://www.bio-info-trainee.com/ ... are/limma_voom.html
当然，一些并不是bioconductor的包我也会写教程，偶尔：
http://www.bio-info-trainee.com/ ... oftware/GOplot.html
http://www.bio-info-trainee.com/ ... ftware/Rcircos.html

下面是一个统计学里面的逻辑分析的讲解

http://www.bio-info-trainee.com/tmp/tutorial_for_logical_analysis.html

下面是一个表达矩阵的15个常见的可视化图形的制作：

http://bio-info-trainee.com/tmp/basic_visualization_for_expression_matrix.html

用deconstructSigs来做cosmic的mutation signature图

http://biotrainee.com/jmzeng/markdown/deconstuctSigs.html

这个史上最全方差分析，不是我写的，但是写的很赞，我就不多此一举了：

http://biotrainee.com/jmzeng/markdown/ANOVA.html 推荐大家看看

标准的基因检测报告目录 http://www.biotrainee.com/jmzeng/blogMyGenome/name_introduction.html

下面是一堆高通量测序分析的结题报告：

简单 RNA-seq 项目结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/Ref_RNAseq_result/index.html

16s rDNA 高变区测序项目结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/16sRNA/index.html

示范宏基因组分析结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/MetaGenome_result/index.html

示范细菌基因组分析结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/Pacbio_Genome_result/index.html

示范小RNA 项目结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/SmallRNA_result/index.html

示范 lncRNA 项目结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/lncRNA_result/index.html

示范ChIP-Seq结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/chip-report/index.html

示范转录组测序（De novo）项目结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/Denovo_transcriptome/index.html

示范 WGCNA分析结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/WGCNA_Traits_result/index.html

蛋白iTRAQ定量分析项目结题报告

http://www.biotrainee.com/jmzeng/html_report/d/e/e/p/i/n/iTRAQ_Result/index.html

gene symbol 中的奇怪开头基因

ulwvfje — Sun, 11 Dec 2016 00:48:20 +0000

这本是我为论坛的基础板块写的一个基础知识点，但是浏览量实在有限，不忍它蒙尘，特在博客重新发布一次！原帖见：http://www.biotrainee.com/thread-511-1-1.html

gene symbol 是非常官方的，由HUGO 组织负责维护，有专门的数据库HGNC database of human gene names | HUGO
以前分析数据的时候，有一些基因的symbol很奇怪，让我百思不得其解，比如
C orf 系列基因，
HS.系列基因，
KRTAP系列基因，
LOC系列基因，
MIR系列基因，
LINC系列基因
它们往往一个系列，就有好几百个基因；
C12orf44; Chromosome 12 Open Reading Frame 44; 这个是C orf系列基因的意思
MIR系列基因应该是 miRNA相关的基因
LINC系列基因应该就是long intergenic non-protein coding RNA
LOC系列基因，是非正式的，推定的，日后可能被更合适的名字替代
我这里做好了所有的基因对应关系，去生信菜鸟团QQ群里下载吧，共47938个基因的symbol和entrez gene id还有name，还有alias的对应!

还有一些RNA基因，根本就没有symbol，比如：CTA/B/C/D系列的
Aliases for ENSG00000271971 Gene
Quality Score for this RNA gene is 1
Aliases for ENSG00000271971 Gene
CTD-2006H14.2 5
External Ids for ENSG00000271971 Gene
Ensembl: ENSG00000271971
还有，如果你看到HS.开头的基因，它是unigene的ID了，已经不再是symbol啦。

用R获取芯片探针与基因的对应关系三部曲-NCBI下载对应关系

ulwvfje — Sun, 11 Dec 2016 00:34:42 +0000

这是系列文章，请先看：

用R获取芯片探针与基因的对应关系三部曲-bioconductor

ncbi现有的GPL已经过万了，但是bioconductor的芯片注释包不到一千，虽然bioconductor可以解决我们大部分的需要，比如affymetrix的95,133系列，深圳1.0st系列，HTA2.0系列，但是如果碰到比较生僻的芯片，bioconductor也不会刻意为之制作一个bioconductor的包，这时候就需要自行下载NCBI的GPL信息了，也可以通过R来解决：

##本质上是下载一个文件，读进R里面，然后解析行列式，得到芯片探针与基因的对应关系，看下面的代码，你就能理解了。

## A-AGIL-28 - Agilent Whole Human Genome Microarray 4x44K 014850 G4112F (85 cols x 532 rows)
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL6480', destdir=".")
colnames(Table(gpl)) ## [1] 41108 17
head(Table(gpl)[,c(1,6,7)]) ## you need to check this , which column do you need
write.csv(Table(gpl)[,c(1,6,7)],"GPL6400.csv")
#platformDB='hgu133plus2.db'
#library(platformDB, character.only=TRUE)
probeset <- featureNames(GSE32575[[1]])
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL6102', destdir=".")
colnames(Table(gpl)) ## [1] 41108 17
head(Table(gpl)[,c(1,10,13)]) ## you need to check this , which column do you need
probe2symbol=Table(gpl)[,c(1,13)]
## GPL15207 [PrimeView] Affymetrix Human Gene Expression Array
probeset <- featureNames(GSE58979[[1]])
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL15207', destdir=".")
colnames(Table(gpl)) ## [1] 49395 24
head(Table(gpl)[,c(1,15,19)]) ## you need to check this , which column do you need
probe2symbol=Table(gpl)[,c(1,15)]

## GPL10558 Illumina HumanHT-12 V4.0 expression beadchip
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL10558', destdir=".")
colnames(Table(gpl)) ## [1] 41108 17
head(Table(gpl)[,c(1,10,13)]) ## you need to check this , which column do you need
probe2symbol=Table(gpl)[,c(1,13)]

java版本GSEA软件的ES score图片的修改

ulwvfje — Thu, 01 Dec 2016 16:53:10 +0000

首先要明白这个ES score图片里面的数据是什么，这样才能修改它，因为java是一个封闭打包好的软件，所以我们没办法在里面修改它没有提供的参数，运行完GSEA，默认输出的图就是下面这样：

ES score

这个图片在发表的时候，就会发现其实蛮模糊的，所以有可能需要自己重新制作这个图，那么就需要明白这个图后面的数据。

其中最下面的数据是量方法测到了2万个基因，那么这两万个基因在case和control组的差异度量(六种差异度量，默认是signal 2 noise，GSEA官网有提供公式，也可以选择大家熟悉的foldchange)肯定不一样,那么根据它们的差异度量，就可以对它们进行排序，并且Z-score标准化的结果。

而中间的就是该gene set在测到了的已经根据signal2noise排好序的2万个基因的位置。

最上面的图，就是所有的基因的ES score都要一个个加起来，叫做running ES score，在加的过程中，什么时候ES score达到了最大值，就是这个gene set最终的ES score！

我这里全面解析了GSEA官网提供的R代码的绘图函数，如下：

这个函数本身也被我抽离出来了：

这个知识点有点复杂，我解释的很清楚数据是什么，但是数据如何来的（就是下面代码读取的txt文件），我没办法用博客写清楚，需要修改一个2500行的源代码才能获取数据！

setwd('data')
Obs.RES=read.table('Obs.RES.txt')
Obs.RES=t(Obs.RES) ## 每个基因在每个gene set里面的running ES score，一个矩阵
Obs.indicator=read.table('Obs.indicator.txt')
Obs.indicator=t(Obs.indicator) ## 每个基因是否属于每个gene set，一个0/1矩阵
obs.s2n=read.table('obs.s2n.txt')[,1] ## 每个基因的signal 2 noise值，已经Z-score化，而且排好序了。
size.G=read.table('size.G.txt')[,1] ## 每个gene set的基因数量，在图中需要显示
gs.names=read.table('gs.names.txt')[,1] ## 每个gene set的名字，在图中需要显示
Obs.arg.ES=read.table('Obs.arg.ES.txt')[,1]## 每个gene set的最大ES score出现在排序基因的位置
Obs.ES.index=read.table('Obs.ES.index.txt')[,1]## 这个用不着的，我也忘记是什么了
Obs.ES=read.table('Obs.ES.txt')[,1] ##每个gene set的最大ES score是多少，如果是正值，用红色表示富集在case组，如果是负值，用蓝色，表示富集在control组。

plot_ES_score <- function(Ng=12,N=34688,phen1='control',phen2='case',Obs.RES,Obs.indicator,obs.s2n,size.G,gs.names,Obs.arg.ES,Obs.ES.index){
for (i in 1:Ng) {
png(paste0('number_',gs.names[i],'.png'))
ind <- 1:N
min.RES <- min(Obs.RES[i,])
max.RES <- max(Obs.RES[i,])
if (max.RES < 0.3) max.RES <- 0.3
if (min.RES > -0.3) min.RES <- -0.3
delta <- (max.RES - min.RES)*0.50
min.plot <- min.RES - 2*delta
max.plot <- max.RES
max.corr <- max(obs.s2n)
min.corr <- min(obs.s2n)
Obs.correl.vector.norm <- (obs.s2n - min.corr)/(max.corr - min.corr)*1.25*delta + min.plot
zero.corr.line <- (- min.corr/(max.corr - min.corr))*1.25*delta + min.plot
col <- ifelse(Obs.ES[i] > 0, 2, 4)

# Running enrichment plot

sub.string <- paste("Number of genes: ", N, " (in list), ", size.G[i], " (in gene set)", sep = "", collapse="")

main.string <- paste("Gene Set ", i, ":", gs.names[i])

plot(ind, Obs.RES[i,], main = main.string, sub = sub.string, xlab = "Gene List Index", ylab = "Running Enrichment Score (RES)", xlim=c(1, N), ylim=c(min.plot, max.plot), type = "l", lwd = 2, cex = 1, col = col)
for (j in seq(1, N, 20)) {
lines(c(j, j), c(zero.corr.line, Obs.correl.vector.norm[j]), lwd = 1, cex = 1, col = colors()[12]) # shading of correlation plot
}
lines(c(1, N), c(0, 0), lwd = 1, lty = 2, cex = 1, col = 1) # zero RES line
lines(c(Obs.arg.ES[i], Obs.arg.ES[i]), c(min.plot, max.plot), lwd = 1, lty = 3, cex = 1, col = col) # max enrichment vertical line
for (j in 1:N) {
if (Obs.indicator[i, j] == 1) {
lines(c(j, j), c(min.plot + 1.25*delta, min.plot + 1.75*delta), lwd = 1, lty = 1, cex = 1, col = 1) # enrichment tags
}
}
lines(ind, Obs.correl.vector.norm, type = "l", lwd = 1, cex = 1, col = 1)
lines(c(1, N), c(zero.corr.line, zero.corr.line), lwd = 1, lty = 1, cex = 1, col = 1) # zero correlation horizontal line
temp <- order(abs(obs.s2n), decreasing=T)
arg.correl <- temp[N]
lines(c(arg.correl, arg.correl), c(min.plot, max.plot), lwd = 1, lty = 3, cex = 1, col = 3) # zero crossing correlation vertical line

leg.txt <- paste("\"", phen1, "\" ", sep="", collapse="")
text(x=1, y=min.plot, adj = c(0, 0), labels=leg.txt, cex = 1.0)

leg.txt <- paste("\"", phen2, "\" ", sep="", collapse="")
text(x=N, y=min.plot, adj = c(1, 0), labels=leg.txt, cex = 1.0)

adjx <- ifelse(Obs.ES[i] > 0, 0, 1)

leg.txt <- paste("Peak at ", Obs.arg.ES[i], sep="", collapse="")
text(x=Obs.arg.ES[i], y=min.plot + 1.8*delta, adj = c(adjx, 0), labels=leg.txt, cex = 1.0)

leg.txt <- paste("Zero crossing at ", arg.correl, sep="", collapse="")
text(x=arg.correl, y=min.plot + 1.95*delta, adj = c(adjx, 0), labels=leg.txt, cex = 1.0)
dev.off()
}

}

通过这个代码，就可以把当前所有gese set的 ES score图给重新画一下，如果需要调整字体大小，就去代码里面慢慢调整。

如何安装别人开发的未发表的包

ulwvfje — Tue, 29 Nov 2016 23:49:52 +0000

我以为我写完了R包终极解决方案！之后，应该不会再有任何关于R包安装的问题产生了，但仔细回过头来看才发现，我介绍的都是如何从CRAN或者bioconductor里面安装正规发布的包，但是有很多人开发的是自己私人的包，而我们有的确非常需要用怎么办？？这个时候就需要下载别人开发的包来安装了。比如我R包地址见github：https://github.com/jmzeng1314/humanid

首先你必须确定这个包是干净的，没有危险，然后要确定你的确需要这个包，因为大多数是时候你其实只需要他包里面一个函数即可。如果确定需要安装，就安装一个git软件吧，然后git clone https://github.com/jmzeng1314/humanid.git 这样就把这个R包下载到了自己指定的目录，或者如果你懒得安装git软件，直接在github网页里面下载成zip格式的压缩包也行。

下载的R包里面有一个.Rproj后缀的文件，可以自己双击打开，在Rstudio里面就可以点击build安装这个R包了，如图：

安装完毕后就会自动加载这个包，然后就可以看到它里面的各种函数和数据的！你已经成功的接收了别人的代码啦！

那么是不是安装好了这个包，你就可以使用它了呢？其实不然，如果是已经发表的正规的包，一般会写好完全的依赖关系，所以在你安装过程中，会提示你不停地安装各种包，但是我的包没有，只有在你运行我但是的时候，我才会报错，告诉你你需要安装某某包！的确有点傻，因为我懒得去写依赖关系，或者说，我还没有学到！

> keggAnno()
Show Traceback

Rerun with Debug
Error in loadNamespace(name) : there is no package called ‘DT’
>

这样做其实也有好处，你无论如何都是可以把我的包安装上的，虽然你可能安装上了也无法使用。

> install.packages('DT')
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/DT_0.2.zip'
Content type 'application/zip' length 950203 bytes (927 KB)
downloaded 927 KB

package ‘DT’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\jimmy\AppData\Local\Temp\RtmpoT4VWl\downloaded_packages
>

很容易安装好了DT这个包，我的函数就可以使用啦！keggAnno()这个函数默认运行成功是没有提示的，但是你可以查看你当前目录，的确多了一个kegg注释文件，可以把你感兴趣的基因批量注释到KEGG数据库。

如何开发自己的R包

ulwvfje — Tue, 29 Nov 2016 23:40:30 +0000

随着R语言的流行度的提高，开发一个R包已经不再是专业程序猿才有的技能了。我这里讲的不是如何写一个包含了复杂统计公式或者发表一篇SCI文章的包，而是简简单单的用Rstudio自带的创建包的功能把自己的几个函数和数据打包！！！我R包地址见github：https://github.com/jmzeng1314/humanid

起初，我也是搜索了一下资料的，资料如下：

https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio

使用Rstudio工具，只需要鼠标点击记下就可以创建自己的R包了！

首先需要自行读完下面4个教程

重点并不在如何创建包，而是在如何写包里面的函数，readme，还有data，如果你觉得上面的文档看得有点枯燥，还可以去YouTube里面看看视频，十几分钟就可以说明白如何创建R包：

http://stat545.com/packages05_foofactors-package-02.html

R包最好是跟自己的github账号同步，首先https://www.r-bloggers.com/rstudio-and-github/

如果你不想看上面我看过的教程，那么就看我写的吧！

首先安装devtools和roxygen这两个辅助开发R包的包，然后在Rstudio的File菜单下面有一个new project继续选择new directory，选择R package，然后就可以啦！！！这时候你已经成功了开发了一个自己的包，里面也自带了一个函数。当然，这样的包没有任何意义，只是为了让你明白什么是开发一个R包，然后你可以添加自己的函数和数据，为了方便理解，对每个函数还需要写详细的帮助文档。如果的包里面调用了其它公共包，还需要写清楚依赖关系。如下图：

R语言画网络图三部曲之igraph

ulwvfje — Mon, 28 Nov 2016 10:08:59 +0000

经过热心的小伙伴的提醒，我才知道我以前写的R语言画网络图三部曲竟然漏掉了最基础的一个包，就是igraph，不了解这个，后面的两个也是无源之水。

R语言画网络图三部曲之networkD3

R语言画网络图三部曲之sna

包说明书：https://cran.r-project.org/web/packages/igraph/igraph.pdf

包例子：https://www.r-project.org/conferences/useR-2008/slides/Csardi.pdf

包函数：http://igraph.org/r/doc/

PPI实例：http://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter11.html

其实包括了3个包：igraph/RBGL/Rgraphviz

用到了一个测试数据，是构建好的PPI网络对象：We will first analyse a curated data set of protein-protein interactions in the yeast Saccharomyces cerevisiae extracted from published papers. This data set comes from with an R package called “yeastExpData”, which calls the data set “litG”. This data was first described in a paper by Ge et al (2001) in Nature Genetics (http://www.nature.com/ng/journal/v29/n4/full/ng776.html).

重点是graphNEL graph对象如何构造以及如何用函数来处理它！

构造方式，请记住，构造网络对象是重点，就是graph.data.frame+as_graphnel即可，一系列以网络对象为基础的包都需要这个步骤，学会了，也就没有问题了！

读取PPI数据到data.frame里面，比如my_edges

tmp <- graph.data.frame(my_edges)

tmp;summary(tmp)

plot(tmp, layout=layout.kamada.kawai)

subnet <- as_graphnel(tmp)

这个时候得到的subnet就是一个网络对象啦！

> subnet

A graphNEL graph with directed edges

Number of Nodes = 818

Number of Edges = 12249

有了这个网络对象，就可以用BioNet来处理找maximal-scoring subgraph

对于网络对象，其它处理的函数有

mynodes <- nodes(litG) 得到网络里面的所有节点信息

adj(litG, "YBR009C") 得到网络里面的YBR009C这个node节点的所有edges

mydegrees <- graph::degree(litG) 算出网络里面的每个node的degree

table(mydegrees);mean(mydegrees);hist(mydegrees, col="red") 看看degree的分布情况。

对比较大的网络来说，并非里面的node都是连通的，可以用RBGL包来看看哪些nodes被隔离开了。

library("RBGL") myconnectedcomponents <- connectedComp(litG)

返回的myconnectedcomponents这个list的每个元素都是一个被隔离开的网络图，可以去找最大连通图，也可以对这个list找到特定的某个node参与的连通图。

component3 <- myconnectedcomponents[[3]]

mysubgraph <- subGraph(component3, litG) 取指定的连通图，生成graphNEL对象，其实就是根据nodes来取子网络图。

下面代码可以把网络图展现出来：

library("Rgraphviz") mysubgraph <- subGraph(component3, litG) mygraphplot <- layoutGraph(mysubgraph, layoutType="neato") renderGraph(mygraphplot)

对网络图还可以找communities，这个又是一个网络图研究术语了： http://en.wikipedia.org/wiki/Community_structure

还可以进行聚类，就是cluster，还有很多，我就不一一介绍了。

上面的连通图也是一个网络研究术语：http://en.wikipedia.org/wiki/Connected_component_(graph_theory)

用BioNet这个bioconductor包来找 maximal-scoring subgraph

ulwvfje — Fri, 25 Nov 2016 14:54:20 +0000

## 此包是为了解决一个难题： maximal-scoring subgraph (MSS) problem ，在一个巨大的复杂网络里面找到significantly differentially expressed subnetworks，就是说，得到了几百个差异基因，去PPI数据库做网络图的时候，发现还是巨大无比，所以需要用这个包来精简我们的网络图。

heuristically的中文意思：启发性地

## 而这个R包可以整合多种数据结果来给一个网络打分，

包的主页是：https://www.bioconductor.org/packages/release/bioc/html/BioNet.html

paper：BioNet: an R-Package for the Functional Analysis of ... - Bioinformatics

它整合了PPI网络分析和寻找功能模块的需求。

脚本：https://www.bioconductor.org/packages/release/bioc/vignettes/BioNet/inst/doc/Tutorial.R

教程：https://www.bioconductor.org/packages/release/bioc/vignettes/BioNet/inst/doc/Tutorial.pdf

重点就是根据一个"igraph" or "graphNEL"对象和打分来找最大的MSS

subnet <- subNetwork(dataLym$label, interactome)

module <- runFastHeinz(subnet, scores)

plotModule(module, scores=scores, diff.expr=logFC) #这个就是精简后的我们的网络图。

其实另外一个函数也有类似的功能，dNetFind https://rdrr.io/cran/dnet/man/dNetFind.html

## 里面用到的网络，都是基于igraph的包： A graph object, either in graphNEL or igraph format.

## 首先加载一系列的包和内置数据

library(BioNet)

library(DLBCL)

data(dataLym)

data(interactome)

## dataLym 里面是3个样本,t,s,o 分别对应着的每个基因的p值

## interactome是一个内置的PPI网络对象，可以根据指定的基因list来提取里面的信息

pvals <- cbind(t=dataLym$t.pval, s=dataLym$s.pval)

rownames(pvals) <- dataLym$label

pval <- aggrPvals(pvals, order=2, plot=FALSE)

## 提取t,s样本的p值，然后用aggrPvals整合成一个p值

subnet <- subNetwork(dataLym$label, interactome)

subnet <- rmSelfLoops(subnet)

subnet

## 根据指定的dataLym$label基因信息来提取网络，但是这个基因信息有点奇怪,比如TP53(7157) ，看起来是symbol跟entrez ID的合体。

## 函数rmSelfLoops是标配，只要是网络，都需要处理一下，去除自循环信息

## 因为指定的dataLym$label基因是有限的，一般不会太多，提取的网络一般也就上千个nodes，万把个edges的

fb <- fitBumModel(pval, plot=FALSE)

## 对我们整合好的基因对应的P值进行Beta-Uniform-Mixture (BUM) model模型处理。

scores <- scoreNodes(subnet, fb, fdr=0.001)

module <- runFastHeinz(subnet, scores)

## Here we use a fast heuristic approach to calculate an approximation to the optimal scoring subnetwork.

logFC <- dataLym$diff

names(logFC) <- dataLym$label

plotModule(module, scores=scores, diff.expr=logFC)

## diff.expr是用来给nodes调色的

## scores是用来给nodes赋予性状的

## 这个函数本身是基于graphNEL or igraph format的定制版，其实可以直接用igraph包来绘图。

## 也可以把这个network导出成Cytoscape format，这样可以用cytoscape来绘图

## 一般来说，红色是上调基因，绿色是下调基因，圆形是得分为正，菱形是得分为负

## 下面是一个实际的例子，如何使用BioNet包来做网络分析

library(BioNet)

library(DLBCL)

data(exprLym)

data(interactome)

exprLym ## 内置对象，所以它的gene的laber是符合interactome的要求的

interactome

network <- subNetwork(featureNames(exprLym), interactome)

network

network <- largestComp(network)

## The function extracts the largest component of a network

network

library(genefilter)

library(impute)

expressions <- impute.knn(exprs(exprLym))$data

## exprs得到的不再是纯粹的表达矩阵，需要用来 impute missing expression data

## 这里选择genefilter包的rowttests函数来做差异分析

t.test <- rowttests(expressions, fac=exprLym$Subgroup)

t.test[1:10, ]

data(dataLym)

ttest.pval <- t.test[, "p.value"]

surv.pval <- dataLym$s.pval

names(surv.pval) <- dataLym$label

pvals <- cbind(ttest.pval, surv.pval)

pval <- aggrPvals(pvals, order=2, plot=FALSE)

fb <- fitBumModel(pval, plot=FALSE)

## 用图来展示这个fitBumModel函数到底做了什么

dev.new(width=13, height=7)

par(mfrow=c(1,2))

hist(fb)

plot(fb)

dev.off()

## 下面这个图可以看到 Beta-Uniform-Mixture (BUM) 模型的两个参数是如何体现的

plotLLSurface(pval, fb)

scores <- scoreNodes(network=network, fb=fb, fdr=0.001)

## 根据p值来对每个edge打分

network <- rmSelfLoops(network)

## 下面是把网络数据写到txt文档，就可以导入到cytoscape啦！

writeHeinzEdges(network=network, file="lymphoma_edges_001", use.score=FALSE)

writeHeinzNodes(network=network, file="lymphoma_nodes_001", node.scores = scores)

datadir <- file.path(path.package("BioNet"), "extdata")

dir(datadir)

## 本次算法变了：the heinz algorithm is used to calculate the maximum-scoring subnetwork

## 下面的文件需要借助heinz.py脚本生成，这里实例用的是包自带的数据

## 脚本代码是：heinz.py -e lymphoma_edges_001.txt -n lymphoma_nodes_001.txt -N True -E False

module <- readHeinzGraph(node.file=file.path(datadir, "lymphoma_nodes_001.txt.0.hnz"), network=network)

diff <- t.test[, "dm"]

names(diff) <- rownames(t.test)

plotModule(module, diff.expr=diff, scores=scores)

sum(scores[nodes(module)])

sum(scores[nodes(module)]>0)

sum(scores[nodes(module)]<0)

###################################################

### code chunk number 27: Tutorial.Rnw:375-380

###################################################

library(BioNet)

library(DLBCL)

library(ALL)

data(ALL)

data(interactome)

## 这个ALL是另外一个包的数据，基因ID现在还没有，是探针ID，需要转换成BioNet识别的！

mapped.eset <- mapByVar(ALL, network=interactome, attr="geneID")

mapped.eset[1:5,1:5]

length(intersect(rownames(mapped.eset), nodes(interactome)))

network <- subNetwork(rownames(mapped.eset), interactome)

network

network <- largestComp(network)

network <- rmSelfLoops(network)

network

## 这里用limma来做差异分析

library(limma)

design <- model.matrix(~ -1+ factor(c(substr(unlist(ALL$BT), 0, 1))))

colnames(design)<- c("B", "T")

contrast.matrix <- makeContrasts(B-T, levels=design)

contrast.matrix

fit <- lmFit(mapped.eset, design)

fit2 <- contrasts.fit(fit, contrast.matrix)

fit2 <- eBayes(fit2)

pval <- fit2$p.value[,1]

fb <- fitBumModel(pval, plot=FALSE)

dev.new(width=13, height=7)

par(mfrow=c(1,2))

hist(fb)

plot(fb)

scores <- scoreNodes(network=network, fb=fb, fdr=1e-14)

## 还是把网络数据写到本地，供cytoscape导入

writeHeinzEdges(network=network, file="ALL_edges_001", use.score=FALSE)

writeHeinzNodes(network=network, file="ALL_nodes_001", node.scores = scores)

## 还是使用 heinz algorithm is used to calculate the maximum-scoring subnetwork

## A new implementation Heinz v2.0 is also available at https://software.cwi.nl/software/heinz ,

datadir <- file.path(path.package("BioNet"), "extdata")

module <- readHeinzGraph(node.file=file.path(datadir, "ALL_nodes_001.txt.0.hnz"), network=network)

nodeDataDefaults(module, attr="diff") <- ""

nodeData(module, n=nodes(module), attr="diff") <- fit2$coefficients[nodes(module),1]

nodeDataDefaults(module, attr="score") <- ""

nodeData(module, n=nodes(module), attr="score") <- scores[nodes(module)]

nodeData(module)[1]

## 保存为XGMML file，供cytoscape使用

saveNetwork(module, file="ALL_module", type="XGMML")

## 一般来说，红色是上调基因，绿色是下调基因，圆形是得分为正，菱形是得分为负

mysql的table居然有最大列限制

ulwvfje — Mon, 07 Nov 2016 12:19:33 +0000

想着把TCGA的RPKM值矩阵表格写入到mysql，然后做一个查询网页给生物学家，我下载的是所有TCGA收集的mRNA表达数据集数据集-GSE62944 ，共9264个癌症样本，和741个正常组织的表达数据。当我想写入癌症表达矩阵的时候，报错了：

Error in .local(conn, statement, ...) :

could not run statement: Too many columns

简单搜索了一下，发现是mysql有最大列的限制，但是我不是很懂计算机，所以没太看明白该如何调整参数使得mysql列限制扩充：http://dev.mysql.com/doc/refman/5.7/en/column-count-limit.html 所以就把癌症表达矩阵根据癌症拆分了，癌种数量如下：

table(tumorCancerType2amples$CancerType)

ACC BLCA BRCA CESC COAD DLBC GBM HNSC KICH KIRC KIRP LAML LGG LIHC LUAD LUSC OV PRAD READ SKCM STAD THCA UCEC UCS

79 414 1119 306 483 48 170 504 66 542 291 178 532 374 541 502 430 502 167 472 420 513 554 57

分开写入mysql，下面给出解决方案及代码：

tumorRPKM=read.table('GSM1536837_06_01_15_TCGA_24.tumor_Rsubread_FPKM.txt.gz',sep = '\t',stringsAsFactors = F,header = T)

colnames(tumorRPKM)[1]='geneSymbol'

rownames(tumorRPKM)=tumorRPKM$geneSymbol

tumorRPKM=tumorRPKM[,-1]

tumorRPKM=round( as.matrix(tumorRPKM),3)

tumorRPKM=as.data.frame(tumorRPKM)

tumorRPKM$geneSymbol = rownames(tumorRPKM)

#load(file = 'tumorRPKM.rData')

tumorCancerType2amples=read.table('GSE62944_06_01_15_TCGA_24_CancerType_Samples.txt',sep = '\t',stringsAsFactors = F)

colnames(tumorCancerType2amples)=c('sampleID','CancerType')

lapply(unique((tumorCancerType2amples$CancerType)), function(x){

#x='PRAD';

sampleList=tumorCancerType2amples[tumorCancerType2amples$CancerType==x,1]

sampleList=gsub("-",".", sampleList)

tmpMatrix=tumorRPKM[,c('geneSymbol',sampleList)]

dbWriteTable(con, paste('tumor',x,'RPKM',sep='_'), tmpMatrix, append=F,row.names=F)

})

dbWriteTable这个函数，需要加载RMySQL，而且还需要连接好mysql数据库，不然你根本就看不懂的！

写入数据库如下：