癌症居然有如此多种(是时候开启pan-cancer数据挖掘模式)

看到了交流群有人提问某个癌症,看起来并不是TCGA的33种,我就搜索了一下,找到了 https://www.cancer.gov/types 这个宝藏网页,确实没想到癌症居然这么多种。

感觉这样的完全依照发病器官或者组织进行癌症分类命名的思路并不是很靠谱,但是目前的多组学技术又确实没有能完全变革传统命名方式,仅仅是增加了一些癌症亚型。

常规的单个癌症的数据挖掘绝大部分都被做烂了,是时候开启pan-cancer模式啦。第一步就是下载TCGA的33种癌症的全部数据,尤其是表达量矩阵和临床表型信息啦,这里我们推荐在ucsc的xena里面下载: https://xenabrowser.net/datapages/

GDC TCGA Acute Myeloid Leukemia (LAML) (15 datasets)
GDC TCGA Adrenocortical Cancer (ACC) (14 datasets)
GDC TCGA Bile Duct Cancer (CHOL) (14 datasets)
GDC TCGA Bladder Cancer (BLCA) (14 datasets)
GDC TCGA Breast Cancer (BRCA) (20 datasets)
GDC TCGA Cervical Cancer (CESC) (14 datasets)
GDC TCGA Colon Cancer (COAD) (15 datasets)
GDC TCGA Endometrioid Cancer (UCEC) (15 datasets)
GDC TCGA Esophageal Cancer (ESCA) (14 datasets)
GDC TCGA Glioblastoma (GBM) (15 datasets)
GDC TCGA Head and Neck Cancer (HNSC) (14 datasets)
GDC TCGA Kidney Chromophobe (KICH) (14 datasets)
GDC TCGA Kidney Clear Cell Carcinoma (KIRC) (15 datasets)
GDC TCGA Kidney Papillary Cell Carcinoma (KIRP) (15 datasets)
GDC TCGA Large B-cell Lymphoma (DLBC) (14 datasets)
GDC TCGA Liver Cancer (LIHC) (14 datasets)
GDC TCGA Lower Grade Glioma (LGG) (14 datasets)
GDC TCGA Lung Adenocarcinoma (LUAD) (15 datasets)
GDC TCGA Lung Squamous Cell Carcinoma (LUSC) (15 datasets)
GDC TCGA Melanoma (SKCM) (14 datasets)
GDC TCGA Mesothelioma (MESO) (14 datasets)
GDC TCGA Ocular melanomas (UVM) (14 datasets)
GDC TCGA Ovarian Cancer (OV) (15 datasets)
GDC TCGA Pancreatic Cancer (PAAD) (14 datasets)
GDC TCGA Pheochromocytoma & Paraganglioma (PCPG) (14 datasets)
GDC TCGA Prostate Cancer (PRAD) (14 datasets)
GDC TCGA Rectal Cancer (READ) (15 datasets)
GDC TCGA Sarcoma (SARC) (14 datasets)
GDC TCGA Stomach Cancer (STAD) (15 datasets)
GDC TCGA Testicular Cancer (TGCT) (14 datasets)
GDC TCGA Thymoma (THYM) (14 datasets)
GDC TCGA Thyroid Cancer (THCA) (14 datasets)
GDC TCGA Uterine Carcinosarcoma (UCS) (14 datasets)

把这些癌症信息存储到文件里面 ,cancer.list.txt 是文件名 ,然后就可以使用命令来读取这个文件进行批量下载。

每个癌症都有固定的数据文件格式:

比如TCGA-LAML,如下所示:

https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.htseq_counts.tsv.gz

https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.GDC_phenotype.tsv.gz


https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.survival.tsv


https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.mutect2_snv.tsv.gz

很容易批量下载,首先需要把前面的癌症种类列表存储到 cancer.list.txt 文件里面

mkdir ~/tcga/ucsc_xena
cd ~/tcga/ucsc_xena
perl -alne '{/\((.*?)\)/;print $1}' cancer.list.txt |while read id;do
wget https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-$id.{mutect2_snv,GDC_phenotype,survival,htseq_counts}.tsv.gz
done

很快就可以下载全部的文件,其中占空间比较多的是表达量矩阵文件,就是htseq_counts类型的, 如下所示:

133M 1月 28 2021 TCGA-BRCA.htseq_counts.tsv.gz
 69M 1月 28 2021 TCGA-KIRC.htseq_counts.tsv.gz
 64M 1月 28 2021 TCGA-LUAD.htseq_counts.tsv.gz
 63M 1月 28 2021 TCGA-THCA.htseq_counts.tsv.gz
 62M 1月 28 2021 TCGA-LUSC.htseq_counts.tsv.gz
 60M 1月 28 2021 TCGA-PRAD.htseq_counts.tsv.gz
 60M 1月 28 2021 TCGA-LGG.htseq_counts.tsv.gz
 58M 1月 28 2021 TCGA-HNSC.htseq_counts.tsv.gz
 56M 1月 28 2021 TCGA-UCEC.htseq_counts.tsv.gz
 51M 1月 28 2021 TCGA-STAD.htseq_counts.tsv.gz
 51M 1月 28 2021 TCGA-SKCM.htseq_counts.tsv.gz
 51M 1月 28 2021 TCGA-COAD.htseq_counts.tsv.gz
 46M 1月 28 2021 TCGA-OV.htseq_counts.tsv.gz
 46M 1月 28 2021 TCGA-BLCA.htseq_counts.tsv.gz
 40M 1月 28 2021 TCGA-LIHC.htseq_counts.tsv.gz
 35M 1月 28 2021 TCGA-KIRP.htseq_counts.tsv.gz
 34M 1月 28 2021 TCGA-CESC.htseq_counts.tsv.gz
 29M 1月 28 2021 TCGA-SARC.htseq_counts.tsv.gz
 23M 1月 28 2021 TCGA-ESCA.htseq_counts.tsv.gz
 20M 1月 28 2021 TCGA-PAAD.htseq_counts.tsv.gz
 20M 1月 28 2021 TCGA-PCPG.htseq_counts.tsv.gz
 20M 1月 28 2021 TCGA-GBM.htseq_counts.tsv.gz
 18M 1月 28 2021 TCGA-READ.htseq_counts.tsv.gz
 18M 1月 28 2021 TCGA-TGCT.htseq_counts.tsv.gz
 18M 1月 28 2021 TCGA-LAML.htseq_counts.tsv.gz
 14M 1月 28 2021 TCGA-THYM.htseq_counts.tsv.gz
 11M 1月 28 2021 TCGA-KICH.htseq_counts.tsv.gz
9.5M 1月 28 2021 TCGA-MESO.htseq_counts.tsv.gz
8.2M 1月 28 2021 TCGA-ACC.htseq_counts.tsv.gz
7.9M 1月 28 2021 TCGA-UVM.htseq_counts.tsv.gz
6.7M 1月 28 2021 TCGA-UCS.htseq_counts.tsv.gz
5.3M 1月 28 2021 TCGA-DLBC.htseq_counts.tsv.gz
4.9M 1月 28 2021 TCGA-CHOL.htseq_counts.tsv.gz

当然了, 如果你需要读懂这些htseq_counts文件,还需要下载: https://gdc-hub.s3.us-east-1.amazonaws.com/download/gencode.v22.annotation.gene.probeMap 这个文件,对基因名字进行注释,主要是想可以区分编码和非编码

 id gene chrom chromStart chromEnd strand
1 ENSG00000223972.5 DDX11L1 chr1 11869 14409 +
2 ENSG00000227232.5 WASH7P chr1 14404 29570 -
3 ENSG00000278267.1 MIR6859-3 chr1 17369 17436 -
4 ENSG00000243485.3 RP11-34P13.3 chr1 29554 31109 +
5 ENSG00000274890.1 MIR1302-9 chr1 30366 30503 +
6 ENSG00000237613.2 FAM138A chr1 34554 36081 -

这里面的信息其实是缺乏编码和非编码分类信息:更详细可以去:https://www.gencodegenes.org/human/release_22.html,这里选择 gencode.v22.annotation.gtf.gz ,主要是因为TCGA的数据在这个UCSC的xena浏览器里面存储的时候,表达量矩阵 就是根据 gencode.v22.annotation.gtf.gz 进行的定量。

wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_22/gencode.v22.annotation.gtf.gz
zcat gencode.v22.annotation.gtf.gz|perl -alne '{print if $F[2] eq "gene"}'|cut -d"\"" -f 2,4|sed 's/"/\t/g' > ensembID2type.txt
ENSG00000231567.1 unprocessed_pseudogene
ENSG00000223921.1 processed_pseudogene
ENSG00000168702.15 protein_coding
ENSG00000277306.1 misc_RNA
ENSG00000236440.1 processed_pseudogene
ENSG00000238065.1 processed_pseudogene
ENSG00000201892.1 misc_RNA
ENSG00000252015.1 snRNA
ENSG00000233329.1 processed_pseudogene
ENSG00000281341.1 miRNA

有了这些,就可以批量读取下载好的表达量矩阵,并且根据基因注释信息,拆分成为蛋白编码基因和非编码基因这两个不同的表达量矩阵啦 :

rm(list=ls())
options(stringsAsFactors = F)
library(stringr) 
library(data.table) 
dir.create('Rdata')

# dataset: gene expression RNAseq - HTSeq - Counts 
# unitlog2(count+1)
fs=list.files('ucsc_xena',pattern = 'htseq_counts.tsv.gz')
fs
lapply(fs, function(x){
 # x=fs[1]
 pro=gsub('tsv.gz','',x)
 print(pro)
 a=fread(file.path('ucsc_xena',x),data.table = F)
 head(a[ ,1:4])
 tail(a[ ,1:4])
 mat=a[1:60483,] 
 rownames(mat)=mat$Ensembl_ID
 mat[1:4,1:4]
 mat=mat[,-1]

 exprSet=floor(2^mat - 1 )
 exprSet[1:4,1:4] 

 n=floor(ncol(exprSet)/20)
 n
 keep_feature <- rowSums (exprSet > 2) > n
 table(keep_feature)
 mat <- exprSet[keep_feature, ]
 mat=mat[, colSums(mat) > 1000000]
 mat[1:4,1:4]
 dim(mat) 
 colnames(mat)

 b=read.table('ucsc_xena/gencode.v22.annotation.gene.probeMap',header = T)
 head(b)
 d=read.table('ucsc_xena/ensembID2type.txt')
 head(d)
 b=merge(b,d,by.x='id',by.y='V1')
 head(b)

 length(unique(b[match(rownames(mat),b$id),2]))
 # 可以看到,多个ensembl的ID对应同一个基因的symbol,这个是需要处理掉的。
 # 下面的代码略微有一点点复杂
 dat=mat
 ids <- b[match(rownames(dat),
 b$id),1:2] #取需要的列 
 head(ids)
 colnames(ids)=c('probe_id','symbol') 
 ids=ids[ids$symbol != '',]
 ids=ids[ids$probe_id %in% rownames(dat),]
 dat[1:4,1:4] 
 dat=dat[ids$probe_id,] 

 ids$median=apply(dat,1,median) #ids新建median这一列,列名为median,同时对dat这个矩阵按行操作,取每一行的中位数,将结果给到median这一列的每一行
 ids=ids[order(ids$symbol,ids$median,decreasing = T),]#对ids$symbol按照ids$median中位数从大到小排列的顺序排序,将对应的行赋值为一个新的ids
 ids=ids[!duplicated(ids$symbol),]#将symbol这一列取取出重复项,'!'为否,即取出不重复的项,去除重复的gene ,保留每个基因最大表达量结果s
 dat=dat[ids$probe_id,] #新的ids取出probe_id这一列,将dat按照取出的这一列中的每一行组成一个新的dat
 rownames(dat)=ids$symbol#把ids的symbol这一列中的每一行给dat作为dat的行名
 dat[1:4,1:4] #保留每个基因ID第一次出现的信息

 mat=dat
 mat[1:4,1:4]
 print(fivenum(mat['GAPDH',]))
 print(fivenum(mat['ACTB',]))

 tp=b[match(rownames(mat),b$gene),7]
 print( tail(sort(table(tp))) )

 pd_mat=mat[tp=='protein_coding',]
 non_mat=mat[tp !='protein_coding',]

 save(pd_mat,non_mat,
 file = file.path('Rdata', paste0(pro,'Rdata')))

})

如下所示:

340M Aug 29 11:45 TCGA-BRCA.htseq_counts..Rdata
176M Aug 29 11:45 TCGA-KIRC.htseq_counts..Rdata
166M Aug 29 11:46 TCGA-LUAD.htseq_counts..Rdata
158M Aug 29 11:47 TCGA-UCEC.htseq_counts..Rdata
158M Aug 29 11:46 TCGA-LUSC.htseq_counts..Rdata
149M Aug 29 11:47 TCGA-THCA.htseq_counts..Rdata
147M Aug 29 11:47 TCGA-STAD.htseq_counts..Rdata
145M Aug 29 11:45 TCGA-HNSC.htseq_counts..Rdata
145M Aug 29 11:46 TCGA-PRAD.htseq_counts..Rdata
142M Aug 29 11:46 TCGA-LGG.htseq_counts..Rdata
133M Aug 29 11:45 TCGA-COAD.htseq_counts..Rdata
127M Aug 29 11:47 TCGA-SKCM.htseq_counts..Rdata
118M Aug 29 11:46 TCGA-OV.htseq_counts..Rdata
117M Aug 29 11:44 TCGA-BLCA.htseq_counts..Rdata
104M Aug 29 11:46 TCGA-LIHC.htseq_counts..Rdata
 86M Aug 29 11:45 TCGA-KIRP.htseq_counts..Rdata
 83M Aug 29 11:45 TCGA-CESC.htseq_counts..Rdata
 72M Aug 29 11:46 TCGA-SARC.htseq_counts..Rdata
 62M Aug 29 11:45 TCGA-ESCA.htseq_counts..Rdata
 49M Aug 29 11:46 TCGA-PAAD.htseq_counts..Rdata
 48M Aug 29 11:45 TCGA-GBM.htseq_counts..Rdata
 47M Aug 29 11:46 TCGA-PCPG.htseq_counts..Rdata
 47M Aug 29 11:46 TCGA-LAML.htseq_counts..Rdata
 45M Aug 29 11:46 TCGA-READ.htseq_counts..Rdata
 45M Aug 29 11:47 TCGA-TGCT.htseq_counts..Rdata
 33M Aug 29 11:47 TCGA-THYM.htseq_counts..Rdata
 24M Aug 29 11:45 TCGA-KICH.htseq_counts..Rdata
 23M Aug 29 11:46 TCGA-MESO.htseq_counts..Rdata
 21M Aug 29 11:44 TCGA-ACC.htseq_counts..Rdata
 19M Aug 29 11:47 TCGA-UVM.htseq_counts..Rdata
 17M Aug 29 11:47 TCGA-UCS.htseq_counts..Rdata
 14M Aug 29 11:45 TCGA-DLBC.htseq_counts..Rdata
 12M Aug 29 11:45 TCGA-CHOL.htseq_counts..Rdata

也可以简单的检查一下这些表达量矩阵的Rdata文件,里面的存储的蛋白编码基因和非编码基因这两个不同的表达量矩阵的基因数量和病人数量,如下所示:

fs=list.files('Rdata/',pattern = 'htseq_counts')
fs
do.call(rbind,lapply(fs, function(x){
 load(file = file.path('Rdata/',x)) 
 return(c(x,dim(pd_mat),dim(non_mat)))
}))

### 全部是批量操作:
 [,1] [,2] [,3] [,4] [,5] 
 [1,] "TCGA-ACC.htseq_counts..Rdata" "17930" "79" "15183" "79" 
 [2,] "TCGA-BLCA.htseq_counts..Rdata" "18236" "430" "17236" "430" 
 [3,] "TCGA-BRCA.htseq_counts..Rdata" "18267" "1217" "18237" "1217"
 [4,] "TCGA-CESC.htseq_counts..Rdata" "18180" "309" "16515" "309" 
 [5,] "TCGA-CHOL.htseq_counts..Rdata" "17896" "45" "14814" "45" 
 [6,] "TCGA-COAD.htseq_counts..Rdata" "17839" "512" "15864" "512" 
 [7,] "TCGA-DLBC.htseq_counts..Rdata" "18110" "48" "16876" "48" 
 [8,] "TCGA-ESCA.htseq_counts..Rdata" "18954" "173" "26819" "173" 
 [9,] "TCGA-GBM.htseq_counts..Rdata" "18047" "173" "17829" "173" 
[10,] "TCGA-HNSC.htseq_counts..Rdata" "18253" "546" "16304" "546" 
[11,] "TCGA-KICH.htseq_counts..Rdata" "17969" "89" "15989" "89" 
[12,] "TCGA-KIRC.htseq_counts..Rdata" "18299" "607" "19476" "607" 
[13,] "TCGA-KIRP.htseq_counts..Rdata" "18078" "321" "16811" "321" 
[14,] "TCGA-LAML.htseq_counts..Rdata" "17863" "151" "22300" "151" 
[15,] "TCGA-LGG.htseq_counts..Rdata" "18036" "529" "16957" "529" 
[16,] "TCGA-LIHC.htseq_counts..Rdata" "17777" "424" "13953" "424" 
[17,] "TCGA-LUAD.htseq_counts..Rdata" "18257" "585" "18764" "585" 
[18,] "TCGA-LUSC.htseq_counts..Rdata" "18411" "550" "18863" "550" 
[19,] "TCGA-MESO.htseq_counts..Rdata" "17986" "86" "15480" "86" 
[20,] "TCGA-OV.htseq_counts..Rdata" "18560" "379" "21929" "379" 
[21,] "TCGA-PAAD.htseq_counts..Rdata" "18094" "182" "16265" "182" 
[22,] "TCGA-PCPG.htseq_counts..Rdata" "17747" "186" "14982" "186" 
[23,] "TCGA-PRAD.htseq_counts..Rdata" "18050" "551" "16152" "551" 
[24,] "TCGA-READ.htseq_counts..Rdata" "17759" "177" "14892" "177" 
[25,] "TCGA-SARC.htseq_counts..Rdata" "18258" "265" "16968" "265" 
[26,] "TCGA-SKCM.htseq_counts..Rdata" "18177" "472" "16780" "472" 
[27,] "TCGA-STAD.htseq_counts..Rdata" "18995" "407" "27882" "407" 
[28,] "TCGA-TGCT.htseq_counts..Rdata" "18600" "156" "18323" "156" 
[29,] "TCGA-THCA.htseq_counts..Rdata" "17888" "568" "16228" "568" 
[30,] "TCGA-THYM.htseq_counts..Rdata" "18103" "121" "16638" "121" 
[31,] "TCGA-UCEC.htseq_counts..Rdata" "18295" "583" "17097" "583" 
[32,] "TCGA-UCS.htseq_counts..Rdata" "18576" "56" "19377" "56" 
[33,] "TCGA-UVM.htseq_counts..Rdata" "17204" "80" "12045" "80"

可以看到,在不同癌症里面,蛋白质编码相关基因数量一直在一万八附近,而非编码基因数量跨度比较大,从一万二到两万七不等。 因为非编码基因的组织病人特异性都很强,详见:LncRNA的表达调控功能研究综述

万事开头难

我们仅仅是给大家演示了,如果需要做pan-cancer模式数据挖掘,需要学会使用代码进行批量下载和数据处理,克服了第一个难关。但是俗话说得好,万事开头难,中间(统计可视化)更难,结尾(写作)更是难上加难!

首先你需要把各种数据挖掘思路烂熟于心,比如我安排我们生信菜鸟团数据挖掘团队成员帮忙收集整理了全部使用CIBERSORTbulk癌症转录组数据看LM22免疫细胞比例的文章,详细列表如下:

癌症类型 题目 URL
BLCA Tumor-infiltrating M2 macrophages driven by specific genomic alterations are associated with prognosis in bladder cancer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6610042/
BRCA Relevance of Tumor-Infiltrating Immune Cell Composition and Functionality for Disease Outcome in Breast Cancer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6284248/
COAD Immune cell infiltration as a biomarker for the diagnosis and prognosis of stage I–III colon cancer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6426802/
ESCA Paired whole exome and transcriptome analyses for the Immunogenomic changes during concurrent chemoradiotherapy in esophageal squamous cell carcinoma https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6524245/#CR23
GBM Metabolic remodeling contributes towards an immune-suppressive phenotype in glioblastoma https://link.springer.com/article/10.1007%2Fs00262-019-02347-3
HNSC Sex Differences in Using Systemic Inflammatory Markers to Prognosticate Patients with Head and Neck Squamous Cell Carcinoma https://cebp.aacrjournals.org/content/27/10/1176.long
KIRC Immune infiltration in renal cell carcinoma https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6501001/
KIRP Immune infiltration in renal cell carcinoma https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6501001/
LIHC Identification and Validation of Tumor Stromal Immunotype in Patients With Hepatocellular Carcinoma https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6691778/
LUAD Prognostic Implications of Heterogeneity in Intra-tumoral Immune Composition for Recurrence in Early Stage Lung Cancer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6196259/
LUSC The prognostic landscape of tumor-infiltrating immune cell and immunomodulators in lung cancer https://www.sciencedirect.com/science/article/pii/S0753332217320255?via%3Dihub
PRAD The establishment of immune infiltration based novel recurrence predicting nomogram in prostate cancer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6718526/
SKCM Association of LRP1B Mutation With Tumor Mutation Burden and Outcomes in Melanoma and Non-small Cell Lung Cancer Patients Treated With Immune Check-Point Blockades https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6536574/
STAD Gene expression profiles for a prognostic immunoscore in gastric cancer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6099214/
THCA Papillary thyroid carcinoma behavior: clues in the tumor microenvironment https://erc.bioscientifica.com/view/journals/erc/26/6/ERC-19-0074.xml 
肝细胞癌 Tumor-Infiltrating Leukocyte Composition and Prognostic Power in Hepatitis B- and Hepatitis C-Related Hepatocellular Carcinomas https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6722571/

你可以根据我们收集整理的列表,自己去一个个下载这些文献!

当然,我们同时收集了大家耳熟能详的数据挖掘策略,比如

  • 差异分析+PPI网络+hub基因
  • WGCNA+hub基因
  • 诊断模型构建
  • 预后模型构建
  • 肿瘤免疫,CIBERSOFT计算的LM22比例分组,以及ESTIMATE算法等等
  • m6A等生物学功能基因集
  • 药敏信息

(mRNA,lncRNA,miRNA,甲基化,蛋白)均可走上述流程,也就是说33种癌症乘以5种亚型,乘以5种分子,乘以15个策略就已经是过万篇数据挖掘课题了,而且你仔细搜索一下就发现,真的是已经有了过万篇数据挖掘文章了哦!详见: https://share.mubu.com/doc/36AU00FIxWf

是时候做一点不一样的

如果你也厌倦了那些 套路化数据分析凑图表模式,那么,是时候做出一定改变了,在这里,我们开启一个pan-cancer数据挖掘的讨论群,详见:https://mp.weixin.qq.com/s/XZzHSzBYXq8G1tvS_tWXLw

会在群里共享收集整理好的全部文献,包括TCGA团队的两次(2018和2020)高质量数据挖掘文章(CNS级别),欢迎有志于解决癌症这个目前仍然是不治之症的人类痛点的小伙伴们加入一起讨论提高。

Pan-Cancer Atlas

Comments are closed.