11

批量运行GSEA,命令行版本

之前用过有界面的那种,那样非常方便,只需要做好数据即可,但是如果有非常多的数据,每次都要点击文件,点击下一步,也很烦,不过,,它既然是java软件,就可以以命令行的形式来玩转它!

能够命令行运行了,就很容易批量啦

一、程序安装

直接在官网下载java版本软件即可:http://software.broadinstitute.org/gsea/downloads.jsp

二、输入数据

需要下载gmt文件,自己制作gct和cls文件,或者直接下载测试文件p53

见:http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

1
三、运行命令

hgu95av2的芯片数据,只有一万多探针,所以很快就可以出结果

 java -cp gsea2-2.2.1.jar  -Xmx1024m xtools.gsea.Gsea   -gmx c2.cp.kegg.v5.0.symbols.gmt  \
 -res P53_hgu95av2.gct  -cls P53.cls   -chip  chip/HG_U95Av2.chip  -out result -rpt_label p53_example
但是一般我们都用默认的即可

2
里面报错说有些探针找不到,不要管它

四、输出数据
3
自己看官网去理解这些结果咯!
需要下载的数据如下:

首先需要下载 Molecular Signatures Database (MSigDB),一般选择C2的kegg,BioCarta 还有Reactome

都是gmt格式的文件!
CP: Canonical pathways
(browse 1330 gene sets)
Gene sets from the pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. details Download GMT Files
original identifiers
gene symbols
entrez genes ids
CP:BIOCARTA: BioCarta gene sets
(browse 217 gene sets)
Gene sets derived from the BioCarta pathway database (http://www.biocarta.com/genes/index.asp). Download GMT Files
original identifiers
gene symbols
entrez genes ids
CP:KEGG: KEGG gene sets
(browse 186 gene sets)
Gene sets derived from the KEGG pathway database (http://www.genome.jp/kegg/pathway.html). Download GMT Files
original identifiers
gene symbols
entrez genes ids
CP:REACTOME: Reactome gene sets
(browse 674 gene sets)
Gene sets derived from the Reactome pathway database (http://www.reactome.org/). Download GMT Files
original identifiers
gene symbols
entrez genes ids
然后做出表达数据gct文件和cls表型文件~
然后就可以直接运行了
如果是芯片数据,第一列是芯片探针,那么还需要下载chip数据:ftp://ftp.broadinstitute.org/pub/gsea/annotations
11

关于芯片平台GPL15308和GPL570

它们虽然被GEO数据标记了不同的ID号,但是其实都是一种芯片,都是昂飞公司的U133++2芯片,分析过芯片数据的人肯定不会陌生了

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL15308

事实上,这个平台应该是GPL570,但是被CCLE数据库给稍微变通了一下,就给了一个GPL15308的标签,平台主页也写的很清楚,它的探针ID是伪ID,其实就是entrez gene ID

1

本来这个芯片设计的是五万多个探针,最后只剩下了18926个基因
This array is identical to GPL570 but the data were analyzed with a custom CDF Brainarray Version 15, hgu133plus2hsentrezg.
11

对CCLE数据库可以做的分析

收集了那么多的癌症细胞系的表达数据,拷贝数变异数据,突变数据,总不能放着让它发霉吧!
这些数据可以利用的地方非常多,但是在谷歌里面搜索引用了它的文章却不多,我挑了其中几个,解读了一下别人是如何利用这个数据的,当然,主要是用那个mRNA的表达数据咯!
这篇文献对CCLE的数据进行了八个步骤的处理,一个合格的生物信息学分析着完全可以重写这个过程
step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4, 2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE) [16] in March 2013.
step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.
step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.
前三步是为了得到27个胃癌相关细胞系的mRNA表达矩阵,方法是下载cel文件,用RMA归一化,对多探针基因去最大表达量探针!

step4:Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.

Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.

step5:We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k = 2 as it gives sufficiently stable similarity matrix.

step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) [18]with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.

先用甲基化数据来聚类,得到putative driver genes,然后再用这些基因的表达数据来再次聚类,分成两类,然后对这两类进行SAM找差异基因

step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1 (Msigdb v3.1) [19][20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0]. 
step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.
这里既用来gene set的富集分析,又用来超几何分布的富集分析,结果去看看这篇文章就知道了!
 
这篇文章只用了CCLE的一个地方,就是看看不同cancer type里面的某个基因表达boxplot
这个图的数据用GEOquery可以得到,样本的分类信息也用GEOquery可以得到,这样就可以做下面这个图了,非常简单
Further, the Cancer Cell Line Encyclopedia (CCLE) database demonstrated that of 1062 cell lines representing 37 distinct cancer types, glioma cell lines express the highest levels of STK17A
1

结论就是:STK17A is highly expressed in glioma cell lines compared to other cancer types. Data was obtained through the Cancer Cell Line Encyclopedia (CCLE).

第三篇文献:http://www.nature.com/ncomms/2013/130709/ncomms3126/fig_tab/ncomms3126_F4.html

这篇文献更简单了,直接对这个表达矩阵进行聚类:
 
The 5,000 most variable genes were used for unsupervised clustering of cell lines by mRNA expression data. Cell lines are colour-coded (vertical bars) according to the reported tissue of origin (a PDF version that can be enlarged at high resolution is in Supplementary InformationSupplementary Fig. S4); horizontal labels at bottom indicate the dominating tissue types within the respective branches of the dendrogram. Most ovarian cancer cell lines (magenta) cluster together, interspersed with endometrial cell lines. However, some ovarian cancer cell lines cluster with other tissue types (*). Top right panels: neighbourhoods (1) of the top cell lines in our analysis, (2) of cell line IGROV1, and (3) of cell line A2780. For the ovarian cancer cell lines in these enlarged areas, the histological subtype as assigned in the original publication is indicated by coloured letters.
就直接拿整个表达矩阵即可,然后挑选变异最大的5000个基因来进行聚类,就可以得到类似的图

 

11

CCLE数据库几个知识点

Here we describe the Cancer Cell Line Encyclopedia (CCLE): a compilation of gene expression, chromosomal copy number, and massively parallel sequencing data from 947 human cancer cell lines. 
收集了三种数据:
The mutational status of >1,600 genes was determined by targeted massively parallel sequencing, followed by removal of variants likely to be germline events . 
Moreover, 392 recurrent mutations affecting 33 known cancer genes were assessed by mass spectrometric genotyping13 . 
DNA copy number was measured using high-density single nucleotide polymorphism arrays (Affymetrix SNP 6.0; Supplementary Methods). 
Finally, mRNA expression levels were obtained for each of the lines using Affymetrix U133 plus 2.0 arrays. 
These data were also used to confirm cell line identities .
一般用得最多的就是表达数据,因为表达数据最简单,大多数生物信息学分析着只会用这个数据!
而它的突变数据又不是通常意义的高通量测序得到的,snp6芯片数据很多人听都没听过
文章的附件有对cell lines的具体描述。
different_kinds_of_cancer_in_CCLE
CCLE的数据在broad institute里面可以下载,也放在GEO数据库里面,我比较喜欢GEO里面的数据
This SuperSeries is composed of the following SubSeries:
GSE36133 Expression data from the Cancer Cell Line Encyclopedia (CCLE)
GSE36138 SNP array data from the Cancer Cell Line Encyclopedia (CCLE)
GSE36133这个study的metadata里面有对每个cellline来源的cancer进行描述!
有人喜欢把这个metadata叫做是clinical data。
library(GEOquery)
ccleFromGEO <- getGEO("GSE36133")
annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))
>dim(annotBlock1)
[1] 917  38
exprSet=exprs(ccleFromGEO[[1]])
> dim(exprSet)
[1] 18926   917
##它的表达数据矩阵,包含了18926个基因,列名是917个细胞系的名字,行是基因的entrez ID
keyColumns <- c("title", "source_name_ch1", "characteristics_ch1", "characteristics_ch1.1", 
    "characteristics_ch1.2")
options(stringsAsFactors = F)
allAnnot=annotBlock1[,keyColumns]
##这几列信息是比较重要的metadata,里面详细记录了细胞系的收集公司单位,tissue,癌症分类等信息
Cell line (1035个细胞系简介)Gene Sets
1035 sets of genes with high or low expression in each cell line relative to other cell lines from the CCLE Cell Line Gene Expression Profiles dataset.
一些关于CCLE数据库的文章:
http://onlinelibrary.wiley.com/doi/10.1002/cncy.21471/pdf 介绍了几个类似的数据库资源
Anticancer drug sensitivity analysis: An integrated approach applied to Erlotinib sensitivity prediction in the CCLE database
08

TCGA数据里面的生存分析例子

我们知道了生存分析,就是随着时间的流逝,死亡率是如何增加的,一般是用KM法来估计生存函数,然后画个图即可!而根据某些因子把样本分组,可以看到他们死亡率的变化趋势显著的不同,这就说明了我们的这个因子是非常有效的分类方式,这个因子可以是一个biomarker,也可以某些其它指标!
甚至,我们还可以用cox模型来分析这个因子是如何影响生存函数的,那个稍后再讲
这里,我们就简单讲一个例子,是TCGA里面卵巢癌的数据,根据甲基化数据分成了4个组,那么我们就下载这四个组样本的临床数据,
看看这样分组后,他们的死亡率变化趋势是不是有显著区别!

Continue reading

08

生存分析简介

一般我们谈生存分析,就是说的KM方法估计生存函数,并且画出生存曲线,然后还可以根据分组检验一下它们的生存曲线是否有显著的差异!
在R里面,非常的方便,有个包survival很容易就可以做生存分析了!
只需要记住三个函数即可:
Surv:用于创建生存数据对象
survfit:创建KM生存曲线或是Cox调整生存曲线
survdiff:用于不同组的统计检验

Continue reading

08

别人写的代码运行真快!!!

最近需要做十几万个差异基因分析,每个分析都是对大约5万个探针,200个样本的数据量进行批量T检验计算P值
然后发现自己无论怎么用R来写,每个分析都要耗时半分钟左右,因为我必须循环所有的探针,即使不用for,而用R推荐的apply系列函数,也快不到哪里去,但是我搜索时候发现有一个package里面自带了矩阵T检验,直接对5万个探针进行T检验,而不需要循环处理它们
看下面代码!
dat=matrix(rnorm(10000000),nrow = 50000)
dim(dat) #50000   200
system.time(
  apply(dat,1,function(x){
    t.test(x[1:100],x[101:200])$p.value
  })
)
#用户  系统  流逝
#29.29  0.04 30.64
library(pi0)
system.time(matrix.t.test(dat,1,100,100))
#用户 系统 流逝
#0.48 0.03 0.53
差距真的是非常的明显呀!!!
然后,我解析了它的代码,发现里面调用了C写的代码,我想这就是问题所在咯,可是他们到底怎么写,才能把速度搞这么快???
  tmp = .C("tstatistic", dat = x, n1 = n1, n2 = n2, ntests = ntests, 
        MARGIN = MARGIN, pool = pool, tstat = rep(0, ntests), 
        df = rep(0, ntests), PACKAGE = "pi0")

源码在这个package的github里面可以找到,有兴趣的童鞋可以研究一下

 
 

 

06

R语言软件的各种旧版本下载

这种编程语言,我之前以为只有包,或者模块什么的才烦人,没想到最近还碰到了版本的问题。
一般来说, 去R的官网,可以下载到基于各种操作系统的最新版R软件,但是在某些特别的情况下,我们可能需要下载以前的旧版本。这时候,想用百度这个破东西搜索出来下载地址简直是痴心妄想。
比如,我想搜索2.15.2这个版本,在Google里面很容易找到了
如果进入bin目录,可以看到各种操作系统的软件https://cran.r-project.org/bin/ ,里面都是可执行的文件,有下面几种操作系统
[DIR] linux/ 23-Jan-2008 19:47 -
[DIR] macos/ 19-Apr-2005 09:45 -
[DIR] macosx/ 12-Dec-2015 09:04 -
[DIR] windows/ 24-Feb-2012 18:41 -
大多数情况下面我们会用window这个平台的,毕竟,可视化嘛!
很多时候,我们也会需要linux平台的https://cran.r-project.org/bin/linux/ 但是linux平台本身就比较复杂
[DIR] debian/ 15-Dec-2015 02:06 -
[DIR] redhat/ 27-Jul-2014 21:12 -
[DIR] suse/ 16-Feb-2012 15:09 -
[DIR] ubuntu/ 06-Jan-2016 04:05 -
我用的Ubuntu比较多,即使选择了ubuntu,里面还有好几个选项,因为Ubuntu也有各种版本!
[DIR] precise/ 06-Jan-2016 04:03 -
[DIR] trusty/ 06-Jan-2016 04:04 -
[DIR] vivid/ 06-Jan-2016 04:04 -
[DIR] wily/ 06-Jan-2016 04:05 -

所以如果,大家是想在linux系统里面安装旧版本的R,建议大家直接下载c源码,直接三部曲就可以安装啦!

[DIR] R-0/ 04-Oct-2004 10:20 -
[DIR] R-1/ 04-Oct-2004 19:02 -
[DIR] R-2/ 01-Mar-2013 09:11 -
[DIR] R-3/ 10-Dec-2015 09:13 -
windows系统里面一般是是exe的可执行文件,直接安装,不需要源码变异,只需要不停的下一步即可
R 3.2.2 (August, 2015)
R 3.2.1 (June, 2015)
R 3.2.0 (April, 2015)
R 3.1.3 (March, 2015)
R 3.1.2 (October, 2014)
R 3.1.1 (July, 2014)
R 3.1.0 (April, 2014)
R 3.0.3 (March, 2014)
R 3.0.2 (September, 2013)
R 3.0.1 (May, 2013)
R 3.0.0 (April, 2013)
R 2.15.3 (March, 2013)
R 2.15.2 (October, 2012)
R 2.15.1 (June, 2012)
R 2.15.0 (March, 2012)
R 2.14.2 (February, 2012)
R 2.14.1 (December, 2011)
R 2.14.0 (November, 2011)
R 2.13.2 (September, 2011)
R 2.13.1 (July, 2011)
R 2.13.0 (April, 2011)
R 2.12.2 (February, 2011)
R 2.12.1 (December, 2010)
R 2.12.0 (October, 2010)
R 2.11.1 (May, 2010)
R 2.11.0 (April, 2010)
R 2.10.1 (December, 2009)
R 2.10.0 (October, 2009)
R 2.9.2 (August, 2009)
R 2.9.1 (June, 2009)
R 2.9.0 (April, 2009)
R 2.8.1 (December, 2008)
R 2.8.0 (October, 2008)
R 2.7.2 (August, 2008)
R 2.7.1 (June, 2008)
R 2.7.0 (April, 2008)
R 2.6.2 (February, 2008)
R 2.6.1 (November, 2007)
R 2.6.0 (October, 2007)
R 2.5.1 (July, 2007)
R 2.5.0 (April, 2007)
R 2.4.1 (December, 2006)
R 2.4.0 (October, 2006)
R 2.3.1 (June, 2006)
R 2.3.0 (April, 2006)
R 2.2.1 (December, 2005)
R 2.2.0 (October, 2005)
R 2.1.1 (June, 2005)
R 2.1.0 (April, 2005)
R 2.0.1 (November, 2004)
R 2.0.0 (October, 2004)
R 1.9.1 (June, 2004)
R 1.8.1 (November, 2003)
R 1.7.1 (June, 2003)
R 1.6.2 (January, 2003)
Installer for R 1.5.1 (June, 2002)
Installer for R 1.4.1 (January, 2002)
Installer for R 1.3.1 (September, 2001)
Binary files for R 1.2.2 (March, 2001)
Binary files for R 1.0.0 (February, 2000)

 

06

picnic对拷贝数变异检测芯片数据进行分析

这里的拷贝数变异检测芯片指的是Affymetrix Genome-Wide Human SNP Array 6.0

cel数据,处理成segment及genotype数据

一、程序安装
这本来是一个matlab程序,但是有linux版本,需要安装matlab编译环境
   clipboard
下载解压之后首先安装matlab环境:
./MCRInstaller.bin -console
因为我的服务器没有界面,所以我用了-console执行安装程序,很简单就安装好了
clipboard2
其中,cdf目录下面是我们从http://www.affymetrix.com 里面下载的Library FilesGenome-Wide Human SNP Array 6.0 (zip, 246 MB)
celConverter目录下面是一个java程序,可以把芯片出来的cel文件转为flat file
其中Matlab_running 是我执行安装的matlab环境
其余两个脚本run_HMM.sh  run_preprocessing.sh是picnic程序的第二步和第三步
二、输入数据准备
我随便找了两个snp6.0芯片的raw data
-rw-rw-r-- 1 jmzeng jmzeng 66M Dec 30 06:30 GSM1949207.CEL
-rw-rw-r-- 1 jmzeng jmzeng 66M Sep  9 11:08 GSM887898.CEL
三、程序使用

程序主要分为三个步骤来实现
Step 1: Convert the binary *.cel file to a flat file
Step 2: Normalise and estimate the ploidy of the genome and the level of normal contamination
Step 3: Segment the data and produce the genotype
第一步:
暂时还不需要matlab工作环境
示例的code是:java -Xmx2G -jar CelFileConverter.jar -m Snp6FeatureMappings.csv -c 'cdf_file_including_path' -s 'directory name
of cel files' -t rootDir/outdir/raw
我的code是:

jmzeng@ubuntu:/home/jmzeng/bio-soft/picnic/c_code/celConverter$ java -jar CelFileConverter.jar -m Snp6FeatureMappings.csv -c ../cdf/GenomeWideSNP_6.Full.cdf -s ../celFiles/ -t ../result/ 
Loading CDF file: ../cdf/GenomeWideSNP_6.Full.cdf
Loading feature mapping file: Snp6FeatureMappings.csv
Processing CEL file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../celFiles/GSM1949207.CEL
Created feature intensity file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../result/GSM1949207.feature_intensity
Processing CEL file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../celFiles/GSM887898.CEL
Created feature intensity file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../result/GSM887898.feature_intensity

然后在输出目录就有了一个feature_intensity后缀的文本文件,约110M
-rw-rw-r-- 1 jmzeng jmzeng 105M Dec 30 06:35 GSM1949207.feature_intensity
-rw-rw-r-- 1 jmzeng jmzeng 109M Dec 30 06:35 GSM887898.feature_intensity
第二步:
这里需要用matlab啦,但是这里经常出现库的问题!!!
示例的code是
sh run_preprocessing.sh mcr_dir cell_name info_dir feature_int_dir normalised_outdir outdir sample_type in_pi
我的code是:
jmzeng@ubuntu:/home/jmzeng/bio-soft/picnic/c_code$ sh run_preprocessing.sh Matlab_running/v710/ GSM1949207 info/ result/ result/output result/
------------------------------------------
Setting up environment variables
---
LD_LIBRARY_PATH is .:Matlab_running/v710//runtime/glnxa64:Matlab_running/v710//bin/glnxa64:Matlab_running/v710//sys/os/glnxa64:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64/native_threads:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64/server:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64/client:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64
My Own Exception: Fatal error loading library /home/jmzeng/bio-soft/picnic/c_code/Matlab_running/v710/bin/glnxa64/libmwmclmcr.so Error: libXp.so.6: cannot open shared object file: No such file or directory
第三步:
也需要用matlab,库的问题必须解决!!!
我另外一台服务器的结局办法是用v714的matlab
示例的code是
sh run_HMM.sh /nfs/team78pc2/kwl_temp/segments/PICNIC/C/release/Matlab_Compiler_Runtime/v710 A01_CGP_PD3945a.feature_intensity '/nfs/team78pc3/KWL/segments/PICNIC/matlab/C/release/info/' '/nfs/team78pc2/kwl_temp/segments/PICNIC/data/normalized/' '/nfs/team78pc2/kwl_temp/segments/PICNIC/data/' '10' '0.33598' '1.9915' '0.40997'
我的code是:无所谓了,这个服务器不知道怎么回事,总是出现库文件的问题,而这个问题需要root权限,我懒得弄了,我在其它的服务器上面都木有问题的!!!
所以我就换了MATLAB,毕竟,这个软件本来就是matlab版本的,第一步照旧,第二步在matlab里面运行!
我这里只用GSM1949207做测试吧!!!
Genomic DNA was extracted from saliva, peripheral blood, or fibroblast cell lines using the QIAamp DNA Blood Mini Kit or QIAamp DNA Mini Kit. DNA quality and quantity was assessed using a Nanodrop Spectrophotometer and agarose gel electrophoresis.
打开matlab,进入picnic目录
重复第二步,输入:
preprocessing('GSM1949207.feature_intensity','info\','result\raw\','result\output\','result\')
需要7个参数,我这个是cell lines数据,所以后面两个参数省略不写!info文件夹自己下载放在picnic目录,其中result\raw 存放你第一步的结果文件!
这一步运行,会比较久!
3
同时可以看到程序在我的result目录里面新增加了两个目录用来存放结果,不过这一步的结果还是中间文件,就不解释了!
然后再重复第三步,输入:

 HMM('GSM1949207.feature_intensity','info\','result\output\','result\',10,0,2.0221,0.40997)

这个软件的参数很没规律,最好不要像说明书那些用output做文件夹名字,不然,很容易出错。
这一步好像也很耗时间!
4

重要的结果有两个:
56
06

拷贝数变异检测芯片介绍

这里的拷贝数变异检测芯片指的是Affymetrix Genome-Wide Human SNP Array 6.0

cel数据,需要处理成segment及genotype数据
这个芯片在TCGA计划里面用的非常多,是标配了。大家只要记住,这是一个跟拷贝数变异检测相关的芯片,而且还可以测一些genotype  
Affymetrix Genome-Wide Human SNP Array 6.0是唯一可以真正将CNP(拷贝数多态性)转化成高分辨率的参考图谱的平台。主要应用领域包括全基因组SNP分型、全基因组CNV分型、全基因组关联 分析、全基因组连锁分析。除了进行基因分型外,还为拷贝数研究和LOH研究提供帮助,从而能够进行:UPD检测、亲子鉴定、异常的亲代起源分析(针对 UPD和缺失)、纯合性分析、血缘关系鉴定。
SNP Array 6.0是昂飞公司继Mapping10k、100k、500k和SNP5.0芯片后推出的新一代SNP芯片。在一张芯片上可以分析一个样本906,600 个SNP的基因型, 大约有482,000个SNP来自于前代产品500K和SNP5.0芯片。剩下424,000个SNP包括了来源于国际HapMap计划中的标签 SNP,X,Y染色体和线粒体上更具代表性的SNP,以及来自于重组热点区域和500K芯片设计完成后新加入dbSNP数据库的SNP。该芯片同时含 946,000个非多态性CNV探针,用于检测拷贝数变异,其中202,000个用于检测5677个已知拷贝数变异区域的探针,这些区域来源于多伦多基因 组变异体数据库。该数据库中每隔3,182个非重叠片段区域分别用61个探针来检测。除了检测这些已知的拷贝数多态区域,还有超过744,000个探针平 均分配到整个基因组上,用来发现未知的拷贝数变异区域。SNP和CNV两种探针高密度且均匀地分布在整个基因组,作为拷贝数变异和杂合性缺失(LOH)检 测的工具来发现微小的染色体增加和缺失。为广大生命科学研究者提高发现复杂疾病相关基因的可能提供了强有力的工具。
通过与哈佛大学合办的Broad研究所合作,SNP6.0芯片在数据准确性和一致性方面达到了新的高度。相应推出的Genotyping Console用来处理SNP6.0芯片数据和全基因组遗传分析及质量控制。

产品特点:

1.涵盖超过1,800,000个遗传变异标志物:包括超过906,600个SNP和超过946,000个用于检测拷贝数变化(CNV,Copy Number Variation)的探针;

2.SNP和CNV两种探针高密度且均匀地分布在整个基因组,不仅可以用于SNP基因精确分型,还可用于拷贝数变异CNV的研究;

3.744,000个探针平均分配到整个基因组上,用来发现未知的拷贝数变异区域;

4.可用于Copy-neutral LOH/UPD检测,亲子鉴定,纯合性分析、血缘关系鉴定、遗传病或其它疾病的研究。

参考:http://www.biomart.cn/specials/cnv2014/article/84169

在NCBI的GEO数据库里面可以查到这个芯片,已经有一万多个样本数据啦!
图中第一个是CCLE计划的近千个样本,可能是定制化了的snp6.0芯片吧
clipboard
使用这个芯片数据来发文章的非常多,见列表:http://media.affymetrix.com/support/technical/other/snp6_array_publications.pdf
还有一篇2010-nature文章讲了如何用picnic来研究cnv,http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145113/
也有一篇2010年的文章提出了新的软件来分析这个芯片cnv数据http://bioinformatics.oxfordjournals.org/content/26/11/1395.long
实现同样功能的软件,非常之多,还有一个R的bioconductor系列的包
clipboard2
随便进去都可以找到很多raw data,可以自己进行分析的!

 

05

寻找somatic突变的软件大合集

         其实somatic突变很容易理解,你测同一个人的正常组织和癌症组织,然后比较对这两个样本数据call出来的snp位点
         只存在癌症组织数据里面的snp位点就是somatic突变,在两个样本都存在的snp位点就是germline的突变,不过一般大家研究的都是somatic突变。
          当然,理论上是很简单,但是那么多统计学家要吃饭呀,肯定得把这件事搞复杂才行,所以就有了非常多的somatic突变 calling的软件,开个玩笑哈,主要是因为我们的测序并不是对单个细胞测序,我们通常意义取到的正常组织和癌症组织都不是纯的,所以会有很多关于这一点的讨论。
        正好我看到了一篇帖子,收集了大部分比较出名的做somatic mutation calling的软件,当然,我只用过mutect和varscan。

来自于:https://www.biostars.org/p/19104/

Here are a few more, a summary of the other answers, and updated links:

For a much more general discussion of variant calling (not necessarily somatic or limited to SNVs/InDels) check out this thread: What Methods Do You Use For In/Del/Snp Calling?

Some papers describing comparisons of these callers:

The ICGC-TCGA DREAM Mutation Calling challenge has a component on somatic SNV calling.

This paper used validation data to compare popular somatic SNV callers:

Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers

You'll need to update the link to MuTect. Broad Institute has begun to put portable versions of their tools on Github, like thelatest release of MuTectThe Genome Institute at WashU has been using Github for a while, but portable versions of their tools can be found here and here.

其实somatic的calling远比我们想象的要复杂:

To rehash/expand on what Dan said, if you're sequencing normal tissue, you generally expect to see single-nucleotide variant sites fall into one of three bins: 0%, 50%, or 100%, depending on whether they're heterozygous or homozygous.

With tumors, you have to deal with a whole host of other factors:

  1. Normal admixture in the tumor sample: lowers variant allele fraction (VAF)
  2. Tumor admixture in the normal - this occurs when adjacent normals are used, or in hematological cancers, when there is some blood in the skin normal sample
  3. Subclonal variants, which may occur in any fraction of the cells, meaning that your het-site VAF might be anywhere from 50% down to sub-1%, depending on the tumor's clonal architecture and the sensitivity of your method
  4. Copy number variants, cn-neutral loss of heterozygosity, or ploidy changes, all of which again shift the expected distribution of variant fractions

These, and other factors, make calling somatic variants difficult and still an area that is being heavily researched. If someone tells you that somatic variant calling is a solved problem, they probably have never tried to call somatic variants.

Sounds like somatic / tumor variant calling is something that will be solved by improvements at the wet lab side ( single cell selection / amplification / sequencing ) . Rather than at the computational side.

Well, single cell has a role to play (and would have more of one if WGA wasn't so lossy), but realistically, you can't sequence billions of cells from a tumor individually. Bulk sequencing still is going to have a role for quite a while.

Hell germ line calling isn't even a solved problem. Still get lots of false positives (and false negatives). It just tends to work so well that it is hard to improve it much except by making it faster, less memory intensive, etc

Solved was the wrong word. I just meant improved. There is only so much you can do at the computational side. Wet lab also has its part to play.

A germline variant caller generally has a ploidy-based genotyping algorithm built in to part of the algorithm/pipeline. I believe, IIRC, the GATK UnifiedGenotyper for instance does both variant calling and then genotype calling. So to call a genotype for a variant it is expecting a certain number of reads to support the alternative allele. When working with somatic variants all of the assumptions about how many reads you expect with a variant at a position to distinguish between true and false positives are no longer valid. Except for fixed mutations throughout the tumor population only some proportion of cells will hold a somatic variation. You also typically have some contamination from normal non-cancerous cells. Add in complications from significant genomic instability with lots of copy number variations and such and you have a need for a major change in your model for calling variation while minimizing artifactual calls. So you have a host of other programs that have been developed specifically for looking at somatic variation in tumor samples.

一篇文献:

Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

是qiagen公司发的

High-throughput sequencing is rapidly becoming common practice in clinical diagnosis and cancer research. Many algorithms have been developed for somatic single nucleotide variant (SNV) detection in matched tumor-normal DNA sequencing. Although numerous studies have compared the performance of various algorithms on exome data, there has not yet been a systematic evaluation using PCR-enriched amplicon data with a range of variant allele fractions. The recently developed gold standard variant set for the reference individual NA12878 by the NIST-led “Genome in a Bottle” Consortium (NIST-GIAB) provides a good resource to evaluate admixtures with various SNV fractions.

Using the NIST-GIAB gold standard, we compared the performance of five popular somatic SNV calling algorithms (GATK UnifiedGenotyper followed by simple subtraction, MuTect, Strelka, SomaticSniper and VarScan2) for matched tumor-normal amplicon and exome sequencing data.

Nevertheless, detecting somatic mutations is still challenging, especially for low-allelic-fraction variants caused by tumor heterogeneity, copy number alteration, and sample degradation

We used QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (CCP, Version 1) for enrichment and library construction in triplicate。

QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (Version 1) was used to amplify the target region of interest (124 genes, 800 Kb).

When analyzing different types of data, use of different algorithms may be appropriate.

DNA samples of NA12878 and NA19129 were purchased from Coriell Institute. Sample mixtures were created based on the actual amplifiable DNA in each sample, resulting in 0%, 8%, 16%, 36%, and 100% of NA12878 sample mixed in the NA19129 sample, respectively.We treated the mixed samples at 8%, 16%, 36%, and 100% as the virtual tumor samples and the 0% as the virtual normal sample.

五个软件的算法是:

1. NaiveSubtract — SNVs were called separately from virtual tumor and normal samples using GATK UnifiedGenotyper [22]. For exome sequencing data, reads were already mapped, locally realigned and recalibrated by the 1,000 Genomes Project. So SNVs were directly called on the BAM files using GATK Unified Genotyper. Then, SNVs detected in the virtual normal sample were removed from the list of SNVs detected in the virtual tumor sample, leaving the “somatic” SNVs.

2. MuTect — MuTect is a method developed for detecting the most likely somatic point mutations in NGS data using a Bayesian classifier approach. The method includes pre-processing aligned reads separately in tumor and normal samples and post-processing resulting variants by applying an additional set of filters. We ran MuTect under the High-Confidence mode with its default parameter settings. We disabled the “Clustered position” filter and the “dbSNP filter” for the amplicon sequencing reads, and we disabled the “dbSNP filter” for the exome sequencing.

3. SomaticSniper — SomaticSniper calculates the Bayesian posterior probability of each possible joint genotype across the normal and cancer samples. We tuned the software’s parameters to increase sensitivity and then filtered raw results using a Somatic Score cut-off of 20 to improve specificity.

4. Strelka — Strelka reports the most likely genotype for tumor and normal samples based on a Bayesian probability model. Post-calling filters built into the software are based on factors such as read depth, mismatches, and overlap with indels. We skipped depth filtration for exome and amplicon sequencing data as recommended by the Strelka authors. For the amplicon sequencing reads, we set the minimum MAPQ score at 17 for consistency with the defaults in GATK UnifiedGenotyper. We used variants passing Strelka post-calling filters for analysis.

5. VarScan2 — VarScan2 performs analyses independently on pileup files from the tumor and normal samples to heuristically call a genotype at positions achieving certain thresholds of coverage and quality. Then, sites of the genotypes not matched in tumor and normal samples are classified into somatic, germline, or ambiguous groups using Fisher’s exact test. We generated the pileup files using SAMtools mpileup command.

The compatibility of the output VCF files between different methods as well as the NIST-GIAB gold standard was examined using bcbio.variation tools and manual inspection. The reported SNP call representations between files are comparable to each other.

来自于文献:http://www.biomedcentral.com/1471-2164/15/244

05

使用oncotator做突变注释

功能:vcf格式突变数据进一步注释成maf格式

做过癌症数据分析的童鞋都知道,TCGA里面用maf格式来记录突变!那么maf格式的数据是如何得来的呢,我们都知道,做完snp-calling一般是得到vcf格式的突变记录数据文件,然后再用annovar或者其它蛋白结构功能影响预测软件注释一下,还远达不到maf的近100条记录。

而大名鼎鼎的broad institute就规定了maf格式的突变注释文件,他就是利用了十几个常见的已知数据库来注释我们得到的vcf突变记录,通常是对somatic的突变才注释成maf格式的数据!
大名鼎鼎的broadinstitute出品的突变注释工具:http://www.ncbi.nlm.nih.gov/pubmed/25703262
本身也是一个在线工具:
集成了下面所有的分析资源
而且还提供了API

Genomic Annotations

  • Gene, transcript, and functional consequence annotations using GENCODE for hg19.
  • Reference sequence around a variant.
  • GC content around a variant.
  • Human DNA Repair Gene annotations from Wood et al.

Protein Annotations

  • Site-specific protein annotations from UniProt.
  • Functional impact predictions from dbNSFP.

Cancer Variant Annotations

Non-Cancer Variant Annotations

因为要下载的数据有点多,我这里就不用自己的电脑测试了,安装过程也很简单的!

 

十二 30

用GSEA来做基因集富集分析

how to use GSEA?
这个有点类似于pathway(GO,KEGG等)的富集分析,区别在于gene set(矫正好的基于文献的数据库)的概念更广泛一点,包括了

how to download GSEA ?

what's the input for the GSEA?

说明书上写的输入数据是:GSEA supported data files are simply tab delimited ASCII text files, which have special file extensions that identify them. For example, expression data usually has the extension *.gct, phenotypes *.cls, gene sets *.gmt, and chip annotations *.chip. Click the More on file formats help button to view detailed descriptions of all the data file formats.
实际上没那么复杂,一个表达矩阵即可!然后做一个分组说明的cls文件即可。
主要是自己看说明书,做出要求的数据格式:http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
表达矩阵我这里下载GSE1009数据集做测试吧!
cls的样本说明文件,就随便搞一搞吧,下面这个是例子:
6 2 1
# good bad
good good good bad bad bad
文件如下,六个样本,根据探针来的表达数据,分组前后各三个一组。
clipboard
现在开始运行GSEA!

start to run the GSEA !

首先载入数据
clipboard
确定无误,就开始运行,运行需要设置一定的参数!
clipboard

what's the output ?

输出的数据非常多,对你选择的gene set数据集里面的每个set都会分析看看是否符合富集的标准,富集就出来一个报告。
点击success就能进入报告主页,里面的链接可以进入任意一个分报告。
最大的特色是提供了大量的数据集:You can browse the MSigDB from the Molecular Signatures Database page of the GSEA web site or the Browse MSigDB page of the GSEA application. To browse the MSigDB from the GSEA application:
 
有些文献是基于GSEA的:

 

十二 29

用firehose_get 来下载所有TCGA寄存在broad的数据

该软件是broad institute写的一个数据接口,主要是供他人下载TCGA的所有寄存在broad institute的免费数据,主要是level3,level4的数据。(说错了,好像只有level4的数据,就是可以发文章的分析结果及图片)
软件下载地址:https://confluence.broadinstitute.org/display/GDAC/Download

懂它的使用规则,编码规则即可:
就是一个很简单的shell脚本而已,根据几个用户自定义参数来选择性的下载数据。
clipboard
我们可以用-t这个参数来指定下载的数据类型,可以是mut/rna/mutsig/gistic等各种数据,至于这些单词代表什么意义,需要自己去看说明书啦
还可以指定时间,截止到什么时间的数据!
它支持的癌症种类:

ACC  BLCA  BRCA  CESC  COAD  COADREAD  DLBC  ESCA  
	GBM  HNSC  KICH  KIRC  KIRP  LAML  LGG  LIHC  
	LUAD  LUSC  OV  PAAD  PANCANCER  PANCAN8  PANCAN12  PRAD  
	READ  SARC  SKCM  STAD  THCA  UCEC  UCS
这些癌症种类的简称,也是可以去官网里面看到的!官网:http://gdac.broadinstitute.org

 

十二 29

所有TCGA收集的mRNA表达数据集数据集-GSE62944

无意中看一篇文献,有提到这个数据集,我简单看了一下,居然还真的这么全面!!!
它处理了目前(大概是2015年6月)TCGA收集的所有癌症样本的mRNA表达数据,并且统一处理成了count和RPKM两种表达量形式。
GEO地址:http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944

Title Alternatively processed and compiled RNA-Sequencing and clinical data for thousands of samples from The Cancer Genome Atlas
Organism Homo sapiens
Experiment type Expression profiling by high throughput sequencing
Summary We reprocessed RNA-Seq data for 9264 tumor samples and 741 normal samples across 24 cancer types from The Cancer Genome Atlas with "Rsubread". Rsubread is an open source R package that has shown high concordance with other existing methods of alignment and summarization, but is simple to use and takes significantly less time to process data. Additionally, we provide clinical variables publicly available as of May 20, 2015 for the tumor samples where the TCGA ids are matched.
这样就非常方便大家使用了。
可以直接拿来进行聚类,看看不同癌症种类如何聚集在一起,也可以直接拿来跟某个临床指标进行关联分析.
比如,如果我们想用DEseq来做差异分析,那么需要的数据就是基因的count,这里就有近一万个癌症样本的基因表达的count值,随便都可以分类看看差异情况,再富集分析分析。非常适合初学者练手。
癌症种类见官网:http://gdac.broadinstitute.org/
clipboard

 

十二 29

自动化出网页报告的R语言包-Nozzle

根据测试代码输出的报告如下:
clipboard
这些报告里面的元素,都是测试代码添加的,很容易就能理解
十二 29

用Mutation-Assessor软件来看突变位点对基因或者蛋白功能的影响

这是一个在线工具,非常好用,对snp位点进行注释,看看该突变是否影响蛋白功能,一定要收藏!!!

官网:http://mutationassessor.org/

也应该是有standalone版本,我没有去找,不过网页版就很好用了,只需要复制粘贴进去自己想分析的数据,按照一定的格式即可,比如:
clipboard
该软件就能分析出该突变位点发生在BRCA2这个基因上面,对氨基酸的改变也能写出来,对蛋白功能的改变等选项都是可以自由定制化的。
输入数据非常广泛:
The server accepts list of variants, one variant per line, plus optional text describing your variants,
in genomic coordinates, "+" strand assumed :
<genome build>,<chromosome>,<position>,<reference allele>,<substituted allele>
Genome build is optional (build 18 assumed), accepted values: 'hg18' and 'hg19'
Examples:

hg19,13,32912555,G,T   BRCA2
hg18,7,55178574,G,A   GBM
7,55178574,G,A   GBM

or in protein space: <protein ID> <variant> <text>, where <protein ID> can be :

1. Uniprot protein accession (i.e. EGFR_HUMAN)
2. NCBI Refseq protein ID (i.e. NP_005219)

examples:

EGFR_HUMAN R521K
EGFR_HUMAN R98Q Polymorphism
EGFR_HUMAN G719D disease
NP_000537 G356A
NP_000537 G360A dbSNP:rs35993958
NP_000537 S46A Abolishes phosphorylation

ID types can be mixed in one list in any way.

 

十二 25

使用进度条和并行计算来模拟计算赌博输赢概率

http://www.zhihu.com/question/19605599

以前看到过一个赌徒「必胜方法」
有一个广泛流传于赌徒中的“必胜方法”,是关于赌徒悖论极好的说明:一个赌场设有“猜大小”的赌局,玩家下注后猜“大”或者猜“小”,如果输了,则失去赌注,如果赢的话,则获得本金以及0.9倍的利润。 “必胜法”是这么玩的:
(1)押100猜“大”;
(2)如果赢的话,返回(1)继续;
(3)如果输的话则将赌注翻倍后继续猜“大”,因为不可能连续出现“大”,总会有获胜的时候,而且由于赌注一直是翻倍,只要赢一次,就会把所有输掉的钱赢回;
(4)只要赢了,就继续返回(1)。

一个朋友说写了程序来验证正确与否,我感觉蛮有趣的,想想自己也学了不少,也应该实践一下啦!
写出程序不难,就一个递归而已,不过是应该用并行计算,不然算的太慢了
另外一个重点是,如何随机???
[R]
judge=function(l=left,c=count,n=number,w=win){
c=c+1;
if(rnorm(1)>0){
l=l+2^n
n=1 # win
w=w+1
}else{
l=l-2^n
n=n+1 #lose
}
return(c(l,c,n,w))
}
my_fun=function(x){
tmp=judge(0,1,1,0)
for(i in 1:x){
tmp=judge(tmp[1],tmp[2],tmp[3],tmp[4])
print(tmp)
}
return(tmp[1])
}
my_fun(10)
[/R]
第一个函数是根据现在的状态,接受你的本金,第几次赌了,以及第几次连续下注,最后,你总共赢了多少次,这四个参数,然后随机给一个大小概率,这样你的本金会变化,赌的次数会加1
第二个函数是传入我们需要赌多少次,看结果:
> my_fun(10)
[1] -2  2  2  0
[1] 2 3 1 1
[1] 0 4 2 1
[1] -4  5  3  1
[1] 4 6 1 2
[1] 6 7 1 3
[1] 8 8 1 4
[1] 6 9 2 4
[1]  2 10  3  4
[1] 10 11  1  5
[1]  8 12  2  5
[1] 8
如果我们赌11次,可以看到,我最后会剩余8块钱,每次输赢的情况都反应在里面了,可以自己模拟多次看看!
因为我只赌了11次,所以很快,如果我赌1000次,而且还想检验一下10000次模拟结果,就会比较慢了!
我首先使用进度条模拟一下结果,代码如下:
2
还是比较慢的##Time difference of 1.861802 mins
我用了apply,好像时间是节省了一些,不过聊胜于无!
> library(pbapply)
> start=Sys.time()
> results=pbsapply(1:10000,function(i){my_fun(1000)})
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
> Sys.time()-start  ##9.909524 secs
Time difference of 1.301539 mins
那接下来我们分析一下模拟的结果吧
> summary(results)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -64580     974     998     985    1020    1110
可以看到,我们平均下来是会赢钱的,而且赢面很大,赢的次数非常多!!!
但是,我们看下面的图就知道,一个很诡异的结果!
而且,我这里用的模拟胜负,并不是很好
3
这其实还只是小批量模拟,如果要模拟百亿次,首先我的笔记本肯定不行,cpu太破了,如果用服务器,就需要用并行计算啦!
##下面是用多核并行计算的代码,大家有兴趣可以自己去玩一下
library(parallel)
cl.cores <- detectCores() 
cl <- makeCluster(cl.cores)
clusterExport(cl, "judge") 
start=Sys.time()
results=parSapply( 
  cl=cl, 
  1:10000,
  my_fun(1000)
)
Sys.time()-start  ##4.260994 secs
stopCluster(cl)


十二 25

做癌症研究一定要把这几十篇TCGA的大文章看完

都是发表在nature,cell还有新英格兰医学杂志上面的超级文章!每个文章附件都有一百多页,比博士论文还长,但是它们的分析套路其实都一样,都是那几种数据,包括WGS,WES,RNA-Seq,芯片表达量,miRNA表达量,甲基化数据,蛋白数据。分析过程也差不多,无法就是对癌症进行进一步的分类,癌症亚型,或者看看driver mutation,进一步解释癌症病变,转移,扩散机理,或者找标记物signature,辅助治疗等等,具体的要等我把这些文献看完了才能再进一步讲解,请做癌症研究方向的一定要把它们看完。

1

我已经下载完了,大家如果没有权限下载,就需要自己想办法啦!

image

非常值得大家阅读!!!

 

十二 24

使用R包cgdsr来下载TCGA的数据

前面我讲到TCGA的数据可以在5个组织机构可以获取,他们都提供了类似的接口来供用户下载数据

每个接口都有使用教程,比如http://firebrowse.org/tutorial/FireBrowse-Tutorial.pdf

非常详细!!!

有的还专门写了软件接口:https://confluence.broadinstitute.org/display/GDAC/Download

或者写了R的接口:http://www.cbioportal.org/cgds_r.jsp

接下来我们要讲的就是cbioportal网站提供的一个R接口,非常好用,只需记住4个函数即可!!! Continue reading