28

TCGA数据库的癌症种类以及癌症相关基因列表

TCGA projects 里面包含的癌症种类非常多,但是我们分析数据时候常常用pan-cancer 12,pan-cancer 17,pan-cancer 21来表示数据集有多少种癌症,一般文献会给出癌症的简称或者全名:

BLCA, BRCA, COADREAD, GBM, HNSC, KIRC, LAML, LGG, LUAD, LUSC, OV, PRAD, SKCM, STAD, THCA, UCEC.

Acute myeloid leukaemia
Bladder
Breast
Carcinoid
Chronic lymphocytic leukaemia
Colorectal
Diffuse large B-cell lymphoma
Endometrial
Oesophageal adenocarcinoma
Glioblastoma multiforme
Head and neck
Kidney clear cell
Lung adenocarcinoma
Lung squamous cell carcinoma
Medulloblastoma
Melanoma
Multiple myeloma
Neuroblastoma
Ovarian
Prostate
Rhabdoid tumour

HCD features: download

这是高置信度的癌症驱动基因列表:共280多个基因
Cancer5000 features: download

这是一篇对接近5000个癌症样本的研究得到的癌症相关基因列表:共230多个基因

参考:http://bg.upf.edu/oncodrive-role/

http://bioinformatics.oxfordjournals.org/content/30/17/i549.full

http://www.nature.com/nature/journal/v505/n7484/full/nature12912.html?WT.ec_id=NATURE-20140123

28

TCGA年度研讨会资料分享

TCGA想必搞生信都或有耳闻,尤其是癌症研究方向的,共4个年度研讨会,主要是pdf格式的ppt分享,有需要的可以具体点击到页面一个个下载自己慢慢研究,也可以用我下面链接直接下载。

本来是有youtube分享演讲视频的,但是国内被墙了,大家就看看ppt吧

http://www.genome.gov/17516564

The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

TCGA is a joint effort of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), which are both part of the National Institutes of Health, U.S. Department of Health and Human Services.

Meetings

pdf链接地址如下

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Laird.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Durbin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Ley.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Sartor.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Ciriello.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Imielinski.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Gao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Carter.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Ng.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Parvin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Raphael.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Lawrence.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Kreisberg.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Marra.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Helman.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Stuart.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Cooper.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Levine.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Natsoulis.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Haussler.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Erkkila.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Gehlenborg.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Qiao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Sivachenko.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Sumazin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Gutman.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Mardis.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/01_Shaw.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/02_Chanock.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/03_Staudt.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/05_Creighton.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/06_Stojanov.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/07_Karchin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/08_Mungall.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/09_Hakimi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/10_Gao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/11_Hayes.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/12_Troester.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/13_Knobluach.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/14_Raphael.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/15_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/16_Giordano.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/17_Weinstein.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/18_Zheng.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/19_Getz.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/20_VanDneBroek.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/21_Liao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/22_Khazanov.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/23_Levine.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/24_Miller.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/25_Ewing.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/26_Cirello.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/27_Verhaak.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/28_Hofree.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/29_Meyerson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/30_Yang.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/31_Wheeler.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/32_Parfenov.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/33_Bernard-Rovira.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/34_Hast.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/36_Sellars.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/04_Brat.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/05_Mungall.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/06_Boutros.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/07_Zmuda.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/08_Benz.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/09_Zheng.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/11_Creighton.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/12_Aksoy.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/13_Dinh.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/14_Stuart.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/15_Amin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/16_Gross.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/15_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/18_Giordano.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/19_Amin-Mansour.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/20_Oesper.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/21_Gatza.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/22_Bernard.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/23_Sinha.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/24_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/25_Watson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/26_Martignetti.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/27_Bandlamudi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/28_Fu.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/29_Akdemir.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/30_Bass.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/31_Hakimi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/32_Wheeler.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/33_Lehmann.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/34_Gordenin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/35_Wyczalkowski.pdf

 

http://www.genome.gov/Multimedia/Slides/TCGA4/02_Zenklusen.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/03_Hutter.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/04_Brat.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/05_Mungall.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/06_Linehan.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/07_Brooks.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/08_Wu.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/09_Giger.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/10_Wilkerson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/11_Orsulic.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/12_Zhong.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/13_Knijnenburg.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/14_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/15_Wang.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/16_Poisson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/17_Alaeimahabadi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/18_Noushmehr.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/19_Pantazi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/20_Shih.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/21_Stransky.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/22_Giordano.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/23_Davidsen.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/24_Gross.pdf

 

28

R语言实现并行计算

前面我提到有一个大的运算任务需要很久才完成,所以用到了进度条来监控过程,但并不是改善了计算速度,所以需要用到并行计算,我又在网上找了找。

同样也是一个包,跟matlab的实现过程很像

library(parallel)

cl.cores <- detectCores() #检查当前电脑可用核数。

cl <- makeCluster(cl.cores) #使用刚才检测的核并行运算

#这里用clusterEvalQ或者par开头的apply函数族就可以进行并行计算啦

stopCluster(cl)

R-Doc里这样描述makeCluster函数:Creates a set of copies of R running in parallel and communicating over sockets. 即同时创建数个R进行并行运算。在该函数执行后就已经开始并行运算了,电脑可能会变卡一点。尤其在执行par开头的函数时。

在并行运算环境下,常用的一些计算方法如下:

1、clusterEvalQ(cl,expr)函数利用创建的cl执行expr。这里利用刚才创建的cl核并行运算expr。expr是执行命令的语句,不过如果命令太长的话,一般写到文件里比较好。比如把想执行的命令放在Rcode.r里:clusterEvalQ(cl,source(file="Rcode.r"))

2、par开头的apply函数族。这族函数和apply的用法基本一样,不过要多加一个参数cl。一般如果cl创建如上面cl <- makeCluster(cl.cores)的话,这个参数可以直接用作parApply(cl=cl,…)。当然Apply也可以是Sapply,Lapply等等。注意par后面的第一个字母是要大写的,而一般的apply函数族第一个字母不大写。

另外要注意,即使构建了并行运算的核,不使用parApply()函数,而使用apply()函数的话,则仍然没有实现并行运算。换句话说,makeCluster只是创建了待用的核,而不是并行运算的环境。

参考:http://www.r-bloggers.com/lang/chinese/1131

然后我模仿着用并行计算实现自己的需求

#it did work very fast
library(parallel)
cl.cores <- detectCores()
cl <- makeCluster(cl.cores)
clusterExport(cl, "all_dat_t")  #这里是重点,因为并行计算里面用到了自定义函数
clusterExport(cl, "all_prob_id") #但是这个函数需要用到这两个数据,所以需要把这两个数据加载到并行计算环境里面
prob_202723_s_at=parSapply( #我这里用的parSapply来实现并行计算
cl=cl,  #其中cl是我前面探测到的core数量,

deviation_prob, #deviation_prob是我待并行处理的向量

test_pro #这里其实应该是一个自定义函数,我这里就不写出来了,对上面的deviation_prob向量的每个探针都进行判断
)

28

R语言实现进度条

我也是临时在网上搜索到的教程,然后简单看了一下就实现了,其实就是就用到了一个名称为tcltk的包,直接查看函数tkProgressBar就可以知道怎么用啦!

下面是网上的一个小的示例代码(么有实际意义,仅为举例而已):

library(tcltk2)

u <- 1:2000

plot.new()

pb <- tkProgressBar("进度","已完成 %",  0, 100)

for(i in u) {

x<-rnorm(u)

points(x,x^2,col=i)

info <- sprintf("已完成 %d%%", round(i*100/length(u)))

setTkProgressBar(pb, i*100/length(u), sprintf("进度 (%s)", info), info)

}

close(pb)#关闭进度条

但是下面的代码是我模仿上面这个教程自己实现的。

[R]

# 以下是实现进度条
library(tcltk2)
plot.new()
pb <- tkProgressBar("进度","已完成 %", 0, 100)
prob_202723_s_at_value=rep(0,length(deviation_prob))
start_time=Sys.time() #这里可以计时,因为要实现进度条的一般都是需要很长运算时间
for (i in 1:length(deviation_prob)) {
tmp=test_pro(deviation_prob[i]) #test_pro是我自定义的一个函数,判断该探针是否符合要求。
if (length(tmp)!=0){prob_202723_s_at_value[i]=tmp}
info <- sprintf("已完成 %d%%", round(i*100/length(deviation_prob)))  #进度条就是根据循环里面的i来看看循环到哪一步了
setTkProgressBar(pb, i*100/length(deviation_prob), sprintf("进度 (%s)", info), info)
}
close(pb)#关闭进度条
end_time=Sys.time()
cat(end_time-start_time)

[/R]

28

R语言-比较数据框提取列的速度

结论:从数据框里面取某列数据,三种方法的时间消耗区别很大,直接用索引值,是最快的,而用$符号其次,用列名最慢。

我在R里面建立了一个表达量矩阵,列名是一个个样品,行是一个个探针,矩阵值是该探针在该样品测定的表达量。

那么,如果我要看看名为"202723_s_at"的探针的表达向量与其它所有探针的表达向量的相关系数,我可以用以下三种方法:

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t$"202723_s_at",x)))

user  system elapsed

22.93    0.03   23.03

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t[,"202723_s_at"],x)))

Timing stopped at: 92.02 5.32 97.66

太耗时间了,省去

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t[,grep(prob,names(all_dat_t))],x)))

Timing stopped at: 13.55 0.04 13.66

> prob_num=grep(prob,names(all_dat_t))

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t[,prob_num],x)))

user  system elapsed

8.14    0.01    8.17

可以看出,如果我首先根据探针名,grep出它在该表达量矩阵的列数,然后用列数来提取它的表达量是最快的,而且时间改善非常明显!

我们再探究一下cor函数的效率问题

探究的矩阵有54675个变量,每个变量均有189个观测值,如果取这个大矩阵的部分变量来求相关系数,结果如下!

> system.time(cor(all_dat_t[,1:10]))

user  system elapsed

0.001   0.000   0.001

> system.time(cor(all_dat_t[,1:100]))

user  system elapsed

0.003   0.000   0.003

> system.time(cor(all_dat_t[,1:1000]))

user  system elapsed

0.107   0.002   0.108

> system.time(cor(all_dat_t[,1:10000]))

user  system elapsed

11.102   0.849  11.983

> system.time(cor(all_dat_t)) 约六分钟也是可以搞定的

但是如果cor(all_dat_t),六分钟后得到的相关系数矩阵约32G,非常恐怖!

但是它很明显没有把这个32G相关系数矩阵存储到内存,因为我的机器本来就16G内存。我至今不能明白R具体实现机理

 

28

生信教程推荐-MSU的一个生信课程

http://angus.readthedocs.org/en/2014/index.html

Next-Gen Sequence Analysis Workshop (2014)

This is the schedule for the 2014 MSU NGS course.

This workshop has a Workshop Code of Conduct.

Download all of these materials or visit the GitHub repository.

Day Schedule
Monday 8/4
Tuesday 8/5
Wed 8/6
Thursday 8/7
Friday 8/8
Saturday 8/9
Monday 8/11
Tuesday 8/12
Wed 8/13
Thursday 8/14
Friday 8/15