根据CNV信号对细胞系分组后看表达量差异(这就是多组学)

后台有一些粉丝留言好奇怪,说怎么没看到我们分享多组学教程,我们会不会多组学联合分析啊!可能是因为看到我在B站的免费NGS数据处理视频课程合辑,都是单一组学数据吧!

能问出多组学分析会不会这样的问题的,肯定是初学者啦!

目前绝大部分所谓多组学其实仅仅是

其中一个组学用来对样本分组,然后看另外一个组学的数据在不同分组的差异,如果你仔细思考一下会分析,分组不就是样本的表型可以决定了吗?多组学就是样本的多个表型信息!!!

自己的数据量不够就公共数据库来凑

这篇文献仅仅是aCGH芯片只能拿到CNV信号,数据有点单薄,所以作者结合了CCLE数据库的公共数据的芯片表达矩阵做了一下多组学联合分析。是2014就发表的文章:Molecular Integrative Clustering of Asian Gastric Cell Lines Revealed Two Distinct Chemosensitivity Clusters , 该课题组自己做的是:array comparative genomic hybridization (aCGH) on 18 Asian gastric cell lines.

研究者对CCLE数据库的公共数据进行了八个步骤的处理,一个合格的生物信息学分析着完全可以重写这个过程:

  • step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4, 2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE) [16] in March 2013.

  • step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.

  • step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.

前三步是为了得到27个胃癌相关细胞系的 mRNA表达矩阵,方法是下载cel文件,用RMA归一化,对多探针基因去最大表达量探针,供后续分析使用!

  • step4: Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.
    • 18 Asian gastric cell lines的层次聚类
    • Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.
    • 挑选表达量和CNV具有相似性的基因
  • step5: We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k=2 as it gives sufficiently stable similarity matrix.
    • 差异分析,功能富集
  • step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) 18 with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.

也就是说,这里先用CNV信号数据来聚类,得到putative driver genes(就是CNV和表达量一起被改变的基因),然后再用这些基因的表达数据来再次聚类,分成两类,然后对这两类进行SAM找差异基因。

最后就是功能数据库注释啦,用来说明差异分析结果的意义!

  • step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1 (Msigdb v3.1) [19], [20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0].
  • step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.

生物学结论:

  • Cells in IC1 have enrichment of genes associated with oxidative phosphorylation and mitochondria functions.
  • gastric cells in IC2 are enriched for genes involved in cell signalling

最后的结论同样是“无病呻吟” : In conclusion, combination of aCGH and gene expression analysis to identify potential candidate oncogenes or tumor suppressor genes is a powerful and proven approach that has been reported in other cancer studies.

当然了,这篇文章的工作量肯定不仅仅是两个组学数据的联合分析,还有一些药物试验数据和实验验证,感兴趣的小伙伴可以自行阅读哈。

学徒作业

从TCGA数据库里面定位到BRCA数据集,然后找到BRCA1基因突变的乳腺癌病人,以及BRCA1基因启动子区域高甲基化的乳腺癌病人,看看这两个分组是否有overlap。把病人分组后,看BRCA1基因突变的乳腺癌病人与BRCA1基因启动子区域高甲基化的乳腺癌病人他们的转录组数据的差异!

历年学徒作业目录如下:

Comments are closed.