二 14

CCLE数据库里面的1000多个细胞系的RNA-SEQ数据和拷贝数变异数据联合分析

Posted on 2018年2月14日 by ulwvfje

我看到这篇science的补充材料最后一个图是： Continue reading →

一 11

对CCLE数据库可以做的分析

Posted on 2016年1月11日 by ulwvfje

收集了那么多的癌症细胞系的表达数据，拷贝数变异数据，突变数据，总不能放着让它发霉吧!

这些数据可以利用的地方非常多，但是在谷歌里面搜索引用了它的文章却不多，我挑了其中几个，解读了一下别人是如何利用这个数据的，当然，主要是用那个mRNA的表达数据咯！

第一篇：http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0111146

这篇文献对CCLE的数据进行了八个步骤的处理，一个合格的生物信息学分析着完全可以重写这个过程

step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4, 2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE) [16] in March 2013.

step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.

step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.

前三步是为了得到27个胃癌相关细胞系的mRNA表达矩阵，方法是下载cel文件，用RMA归一化，对多探针基因去最大表达量探针！

step4:Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.

Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.

step5:We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k = 2 as it gives sufficiently stable similarity matrix.

step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) [18]with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.

先用甲基化数据来聚类，得到putative driver genes，然后再用这些基因的表达数据来再次聚类，分成两类，然后对这两类进行SAM找差异基因

step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1 (Msigdb v3.1) [19], [20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0].

step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.

这里既用来gene set的富集分析，又用来超几何分布的富集分析，结果去看看这篇文章就知道了！

第二篇文献：http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081803#pone.0081803.s001

这篇文章只用了CCLE的一个地方，就是看看不同cancer type里面的某个基因表达boxplot

这个图的数据用GEOquery可以得到，样本的分类信息也用GEOquery可以得到，这样就可以做下面这个图了，非常简单

Further, the Cancer Cell Line Encyclopedia (CCLE) database demonstrated that of 1062 cell lines representing 37 distinct cancer types, glioma cell lines express the highest levels of STK17A

结论就是：STK17A is highly expressed in glioma cell lines compared to other cancer types. Data was obtained through the Cancer Cell Line Encyclopedia (CCLE).

第三篇文献：http://www.nature.com/ncomms/2013/130709/ncomms3126/fig_tab/ncomms3126_F4.html

这篇文献更简单了，直接对这个表达矩阵进行聚类：

Evaluating cell lines as tumour models by comparison of genomic profiles

The 5,000 most variable genes were used for unsupervised clustering of cell lines by mRNA expression data. Cell lines are colour-coded (vertical bars) according to the reported tissue of origin (a PDF version that can be enlarged at high resolution is in Supplementary Information, Supplementary Fig. S4); horizontal labels at bottom indicate the dominating tissue types within the respective branches of the dendrogram. Most ovarian cancer cell lines (magenta) cluster together, interspersed with endometrial cell lines. However, some ovarian cancer cell lines cluster with other tissue types (*). Top right panels: neighbourhoods (1) of the top cell lines in our analysis, (2) of cell line IGROV1, and (3) of cell line A2780. For the ovarian cancer cell lines in these enlarged areas, the histological subtype as assigned in the original publication is indicated by coloured letters.

就直接拿整个表达矩阵即可，然后挑选变异最大的5000个基因来进行聚类，就可以得到类似的图

一 11

CCLE数据库几个知识点

Posted on 2016年1月11日 by ulwvfje

发表ccle的文献：http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/

Here we describe the Cancer Cell Line Encyclopedia (CCLE): a compilation of gene expression, chromosomal copy number, and massively parallel sequencing data from 947 human cancer cell lines.

收集了三种数据：

The mutational status of >1,600 genes was determined by targeted massively parallel sequencing, followed by removal of variants likely to be germline events .

Moreover, 392 recurrent mutations affecting 33 known cancer genes were assessed by mass spectrometric genotyping13 .

DNA copy number was measured using high-density single nucleotide polymorphism arrays (Affymetrix SNP 6.0; Supplementary Methods).

Finally, mRNA expression levels were obtained for each of the lines using Affymetrix U133 plus 2.0 arrays.

These data were also used to confirm cell line identities .

一般用得最多的就是表达数据，因为表达数据最简单，大多数生物信息学分析着只会用这个数据！

而它的突变数据又不是通常意义的高通量测序得到的，snp6芯片数据很多人听都没听过

文章的附件有对cell lines的具体描述。

CCLE的数据在broad institute里面可以下载，也放在GEO数据库里面，我比较喜欢GEO里面的数据

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36139

This SuperSeries is composed of the following SubSeries:

GSE36133 Expression data from the Cancer Cell Line Encyclopedia (CCLE)

GSE36138 SNP array data from the Cancer Cell Line Encyclopedia (CCLE)

GSE36133这个study的metadata里面有对每个cellline来源的cancer进行描述！

有人喜欢把这个metadata叫做是clinical data。

library(GEOquery)

ccleFromGEO <- getGEO("GSE36133")

annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))

>dim(annotBlock1)

[1] 917 38

exprSet=exprs(ccleFromGEO[[1]])

> dim(exprSet)

[1] 18926 917

##它的表达数据矩阵，包含了18926个基因，列名是917个细胞系的名字，行是基因的entrez ID

keyColumns <- c("title", "source_name_ch1", "characteristics_ch1", "characteristics_ch1.1",

"characteristics_ch1.2")

options(stringsAsFactors = F)

allAnnot=annotBlock1[,keyColumns]

##这几列信息是比较重要的metadata，里面详细记录了细胞系的收集公司单位，tissue，癌症分类等信息

Cell line （1035个细胞系简介）Gene Sets

1035 sets of genes with high or low expression in each cell line relative to other cell lines from the CCLE Cell Line Gene Expression Profiles dataset.

http://amp.pharm.mssm.edu/Harmonizome/dataset/CCLE+Cell+Line+Gene+Expression+Profiles

一些关于CCLE数据库的文章：

http://cancerres.aacrjournals.org/content/73/8_Supplement/2409.short

http://cancerres.aacrjournals.org/content/74/22/6390.short

https://clincancerres.aacrjournals.org/content/19/19_Supplement/IA2.abstract

http://onlinelibrary.wiley.com/doi/10.1002/cncy.21471/pdf 介绍了几个类似的数据库资源

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0088557 讲解了high/low的知识

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7060697 药物相关

Anticancer drug sensitivity analysis: An integrated approach applied to Erlotinib sensitivity prediction in the CCLE database

http://biorxiv.org/content/biorxiv/early/2015/10/02/028159.full.pdf 比较了CCLE和TCGA的数据