11

对CCLE数据库可以做的分析

收集了那么多的癌症细胞系的表达数据,拷贝数变异数据,突变数据,总不能放着让它发霉吧!
这些数据可以利用的地方非常多,但是在谷歌里面搜索引用了它的文章却不多,我挑了其中几个,解读了一下别人是如何利用这个数据的,当然,主要是用那个mRNA的表达数据咯!
这篇文献对CCLE的数据进行了八个步骤的处理,一个合格的生物信息学分析着完全可以重写这个过程
step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4, 2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE) [16] in March 2013.
step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.
step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.
前三步是为了得到27个胃癌相关细胞系的mRNA表达矩阵,方法是下载cel文件,用RMA归一化,对多探针基因去最大表达量探针!

step4:Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.

Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.

step5:We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k = 2 as it gives sufficiently stable similarity matrix.

step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) [18]with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.

先用甲基化数据来聚类,得到putative driver genes,然后再用这些基因的表达数据来再次聚类,分成两类,然后对这两类进行SAM找差异基因

step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1 (Msigdb v3.1) [19][20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0]. 
step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.
这里既用来gene set的富集分析,又用来超几何分布的富集分析,结果去看看这篇文章就知道了!
 
这篇文章只用了CCLE的一个地方,就是看看不同cancer type里面的某个基因表达boxplot
这个图的数据用GEOquery可以得到,样本的分类信息也用GEOquery可以得到,这样就可以做下面这个图了,非常简单
Further, the Cancer Cell Line Encyclopedia (CCLE) database demonstrated that of 1062 cell lines representing 37 distinct cancer types, glioma cell lines express the highest levels of STK17A
1

结论就是:STK17A is highly expressed in glioma cell lines compared to other cancer types. Data was obtained through the Cancer Cell Line Encyclopedia (CCLE).

第三篇文献:http://www.nature.com/ncomms/2013/130709/ncomms3126/fig_tab/ncomms3126_F4.html

这篇文献更简单了,直接对这个表达矩阵进行聚类:
 
The 5,000 most variable genes were used for unsupervised clustering of cell lines by mRNA expression data. Cell lines are colour-coded (vertical bars) according to the reported tissue of origin (a PDF version that can be enlarged at high resolution is in Supplementary InformationSupplementary Fig. S4); horizontal labels at bottom indicate the dominating tissue types within the respective branches of the dendrogram. Most ovarian cancer cell lines (magenta) cluster together, interspersed with endometrial cell lines. However, some ovarian cancer cell lines cluster with other tissue types (*). Top right panels: neighbourhoods (1) of the top cell lines in our analysis, (2) of cell line IGROV1, and (3) of cell line A2780. For the ovarian cancer cell lines in these enlarged areas, the histological subtype as assigned in the original publication is indicated by coloured letters.
就直接拿整个表达矩阵即可,然后挑选变异最大的5000个基因来进行聚类,就可以得到类似的图

 

11

CCLE数据库几个知识点

Here we describe the Cancer Cell Line Encyclopedia (CCLE): a compilation of gene expression, chromosomal copy number, and massively parallel sequencing data from 947 human cancer cell lines. 
收集了三种数据:
The mutational status of >1,600 genes was determined by targeted massively parallel sequencing, followed by removal of variants likely to be germline events . 
Moreover, 392 recurrent mutations affecting 33 known cancer genes were assessed by mass spectrometric genotyping13 . 
DNA copy number was measured using high-density single nucleotide polymorphism arrays (Affymetrix SNP 6.0; Supplementary Methods). 
Finally, mRNA expression levels were obtained for each of the lines using Affymetrix U133 plus 2.0 arrays. 
These data were also used to confirm cell line identities .
一般用得最多的就是表达数据,因为表达数据最简单,大多数生物信息学分析着只会用这个数据!
而它的突变数据又不是通常意义的高通量测序得到的,snp6芯片数据很多人听都没听过
文章的附件有对cell lines的具体描述。
different_kinds_of_cancer_in_CCLE
CCLE的数据在broad institute里面可以下载,也放在GEO数据库里面,我比较喜欢GEO里面的数据
This SuperSeries is composed of the following SubSeries:
GSE36133 Expression data from the Cancer Cell Line Encyclopedia (CCLE)
GSE36138 SNP array data from the Cancer Cell Line Encyclopedia (CCLE)
GSE36133这个study的metadata里面有对每个cellline来源的cancer进行描述!
有人喜欢把这个metadata叫做是clinical data。
library(GEOquery)
ccleFromGEO <- getGEO("GSE36133")
annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))
>dim(annotBlock1)
[1] 917  38
exprSet=exprs(ccleFromGEO[[1]])
> dim(exprSet)
[1] 18926   917
##它的表达数据矩阵,包含了18926个基因,列名是917个细胞系的名字,行是基因的entrez ID
keyColumns <- c("title", "source_name_ch1", "characteristics_ch1", "characteristics_ch1.1", 
    "characteristics_ch1.2")
options(stringsAsFactors = F)
allAnnot=annotBlock1[,keyColumns]
##这几列信息是比较重要的metadata,里面详细记录了细胞系的收集公司单位,tissue,癌症分类等信息
Cell line (1035个细胞系简介)Gene Sets
1035 sets of genes with high or low expression in each cell line relative to other cell lines from the CCLE Cell Line Gene Expression Profiles dataset.
一些关于CCLE数据库的文章:
http://onlinelibrary.wiley.com/doi/10.1002/cncy.21471/pdf 介绍了几个类似的数据库资源
Anticancer drug sensitivity analysis: An integrated approach applied to Erlotinib sensitivity prediction in the CCLE database