使用curatedTCGAData下载TCGA数据库信息好用吗

好久没有写TCGA数据库教程了，因为TCGA计划早在2017年就陆陆续续停止了，我那个时候写了几百个教程并且录制了视频。

我三年前的TCGA教学视频课程B站地址：https://www.bilibili.com/video/av49363776
售后文档记录 https://docs.qq.com/doc/DYkVzUmZLWlhRRXVz 请先通读文档后再发问
我这边备份的TCGA数据来源于xena，ucsc的，都在，https://share.weiyun.com/5zLnKmO

安装和加载R包相信已经无需我多说了：

 BiocManager::install("curatedTCGAData")
 BiocManager::install("TCGAutils")
library(curatedTCGAData)
library(MultiAssayExperiment)
library(TCGAutils)

首先查看TCGA数据库有哪些数据

代码如下：

> curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
Please see the list below for available cohorts and assays
Available Cancer codes:
 ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
 KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
 PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM
Available Data Types:
 CNACGH CNACGH_CGH_hg_244a
 CNACGH_CGH_hg_415k_g4124a CNASeq CNASNP
 CNVSNP GISTIC_AllByGene GISTIC_Peaks
 GISTIC_ThresholdedByGene Methylation
 Methylation_methyl27 Methylation_methyl450
 miRNAArray miRNASeqGene mRNAArray
 mRNAArray_huex mRNAArray_TX_g4502a
 mRNAArray_TX_g4502a_1
 mRNAArray_TX_ht_hg_u133a Mutation
 RNASeq2GeneNorm RNASeqGene RPPAArray

多种癌症，多种数据类型。

联网下载数据

可以使用 dry.run 控制是否真的下载，因为如果是下载甲基化信号值矩阵或者表达量矩阵，会耗时很长。

curatedTCGAData(diseaseCode = "COAD", assays = "RPPA*", dry.run = TRUE)
# dry.run logical(1) Whether to return the dataset names before actual download (default TRUE)
(accmae <- curatedTCGAData("ACC", c("CN*", "Mutation"), FALSE))

下载后得到的是一个MultiAssayExperiment对象：

A MultiAssayExperiment object of 3 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 3:
 [1] ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 180 columns
 [2] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns
 [3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
Features:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample availability DFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices

可以看到是一个简单正则表达式，匹配。这个MultiAssayExperiment对象非常重要：

The MultiAssayExperiment class can be used to manage results of diverse assays on a collection of specimen. Currently, the class can handle assays that are organized instances of SummarizedExperiment, ExpressionSet, matrix, RaggedExperiment (inherits from GRangesList), and RangedVcfStack. Create new MultiAssayExperimentinstances with the homonymous constructor, minimally with the argument ExperimentList, potentially also with the arguments colData (see section below) and sampleMap.

获取临床属性

病人多组学数据必须要有临床信息，才能活起来。

head(getSubtypeMap(accmae))
head(getClinicalNames("ACC"))
colData(accmae)[, getClinicalNames("ACC")][1:5, 1:5]
sampleTables(accmae)

可以看到样本可以区分成为 01 10 11 代表是否是肿瘤样品。

> sampleTables(accmae)
$`ACC_CNASNP-20160128`

01 10 11
90 85 5

$`ACC_CNVSNP-20160128`

01 10 11
90 85 5

$`ACC_Mutation-20160128`

01
90

> sampleTypes[sampleTypes[["Code"]] %in% c("01", "10"), ]
 Code Definition Short.Letter.Code
1 01 Primary Solid Tumor TP
10 10 Blood Derived Normal NB

提取肿瘤相关数据

前面的 01 10 11 代表是否是肿瘤样品，就可以提取。

splitAssays(accmae, c("01", "10"))
tums <- TCGAsampleSelect(colnames(accmae), "01")

accmae[, tums, ]

写出文件

只需要 exportClass 函数即可，把前面的 MultiAssayExperiment对象里面的数据写出来。

exportClass(accmae, dir = './', fmt = "csv", ext = ".csv")
Writing about 7 files to disk...
[1] ".//accmae_META_0.csv" 
[2] ".//accmae_META_1.csv" 
[3] ".//accmae_ACC_CNASNP-20160128.csv" 
[4] ".//accmae_ACC_CNVSNP-20160128.csv" 
[5] ".//accmae_ACC_Mutation-20160128.csv"
[6] ".//accmae_colData.csv" 
[7] ".//accmae_sampleMap.csv"

实战

比如提取TCGA数据库的BRCA数据集的TNBC亚型的表达量矩阵。

前面我们提到过，如果是下载甲基化信号值矩阵或者表达量矩阵，会耗时很长。如果是去ucsc的xena浏览器下载，是一个130M左右的压缩包文件。

 (a <- curatedTCGAData("BRCA", c("RNASeq2GeneNorm"), FALSE))
 head(getSubtypeMap(a))
 head(getClinicalNames("BRCA"))

信息如下：

> head(getSubtypeMap(a))
 BRCA_annotations BRCA_subtype
1 Patient_ID patientID
2 mrna_subtypes PAM50 mRNA
3 mrna_subtypes SigClust Unsupervised mRNA
4 mrna_subtypes SigClust Intrinsic mRNA
5 microrna_subtypes miRNA Clusters
6 methylation_subtypes methylation Clusters
> head(getClinicalNames("BRCA"))
[1] "years_to_birth" "vital_status" "days_to_death" "days_to_last_followup"
[5] "tumor_tissue_site" "pathologic_stage"

简单的搜索一下 ER.Status PR.Status HER2.Final.Status

 phe=as.data.frame(colData(a))
 # Gender Age.at.Initial.Pathologic.Diagnosis ER.Status PR.Status HER2.Final.Status 
> head( phe[,colnames(phe)[grep('Status', colnames(phe),ignore.case = T)][56:59] ])
 ER.Status PR.Status HER2.Final.Status Vital.Status
TCGA-A1-A0SB Positive Negative Negative LIVING
TCGA-A1-A0SD Positive Positive Negative LIVING
TCGA-A1-A0SE Positive Positive Negative LIVING
TCGA-A1-A0SF Positive Positive Negative LIVING
TCGA-A1-A0SG Positive Positive Negative LIVING
TCGA-A1-A0SH Negative Positive Negative LIVING

整体评价：用起来还行，但也不是非他不可！

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee