了解5个乳腺癌表达数据集

了解5个乳腺癌表达数据集

最近需要学习使用genefu这个包,然后应用到自己的数据里面,发现这个包的说明书里面提到了5个乳腺癌表达数据集,安装如下:

source("http://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("genefu")

biocLite("breastCancerMAINZ",ask=F,suppressUpdates=T)
biocLite("breastCancerTRANSBIG",ask=F,suppressUpdates=T)
biocLite("breastCancerUPP",ask=F,suppressUpdates=T)
biocLite("breastCancerUNT",ask=F,suppressUpdates=T)
biocLite("breastCancerNKI",ask=F,suppressUpdates=T)

这5个数据集都是以前的研究者发表的,它们 Mainz, Transbig, UPP, and UNT 数据集 分别对应的是: GSE11121,GSE7390,GSE3494,GSE2990.不过NKI数据集并没有上传在GEO里面,是从作者的补充材料里面整理的。

总共1123个病人的数据,临床信息也比较完善。

GSE11121

发表该数据的文章是The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res 2008 Jul 1;68(13):5405-13. PMID: 18593943

使用的是GPL96[HG-U133A] Affymetrix Human Genome U133A Array芯片,we analyzed the gene expression patterns of 200 tumors of patients who were not treated by systemic therapy after surgery using a discovery approach.

对这些病人收集了一些临床信息如下:

  • the biological process of proliferation
  • steroid hormone receptor expression
  • B cell and T cell infiltration.

GSE7390

发表该数据的文章是:Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007 Jun 1;13(11):3207-14. PMID: 17545524

使用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array 芯片,Gene expression profiling of frozen samples from 198 N- systemically untreated patients was performed at the Bordet Institute, blinded to clinical data and independent of Veridex.

GSE3494

发表该数据集的文章是:An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A 2005 Sep 20;102(38):13550-5. PMID: 16141321

使用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array 芯片,freshly frozen breast tumors from a population-based cohort of 315 women representing 65% of all breast cancers resected in Uppsala County, Sweden, from January 1, 1987 to December 31, 1989.

收集的患者信息比较齐全:

INDEX (ID) 
p53 seq mut status (p53+=mutant; p53-=wt) 
p53 DLDA classifier result (0=wt-like, 1=mt-like) 
DLDA error (1=yes, 0=no) 
Elston histologic grade 
ER status 
PgR status 
age at diagnosis 
tumor size (mm) 
Lymph node status 
DSS TIME (Disease-Specific Survival Time in years) 
DSS EVENT (Disease-Specific Survival EVENT; 1=death from breast cancer, 0=alive or censored )

GSE2990

发表该数据集的文章是: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006 Feb 15;98(4):262-72. PMID: 16478745

采用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array芯片,We analyzed microarray data from 189 invasive breast carcinomas and from three published gene expression datasets from breast carcinomas.

因为其重新利用了 GSE3494 的数据,所以 The patients coming from Uppsala Hospital have been also used in other studies as in GSE3494. You can find the common set of patients in removing the abbreviation “UPP_” from the sample names and compare the results with the “INDEX (ID)” from the GSE3494 series.

数据载入R

因为genefu这个包已经把这5个数据集处理好了,可以直接加载到R里面查看。

library(breastCancerMAINZ)
library(breastCancerTRANSBIG)
library(breastCancerUPP)
library(breastCancerUNT)
library(breastCancerNKI)

data(breastCancerData)
data.all <- c("transbig7g"=transbig7g, "unt7g"=unt7g, "upp7g"=upp7g,
 "mainz7g"=mainz7g, "nki7g"=nki7g)

很清楚的可以看到数据集如下:

> data.all
$transbig7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 198 samples 
 element names: exprs 
protocolData: none
phenoData
 sampleNames: VDXGUYU_4002 VDXGUYU_4008 ... VDXRHU_5240 (198 total)
 varLabels: samplename dataset ... e.os (21 total)
 varMetadata: labelDescription
featureData
 featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
 fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
 fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
 pubMedIds: 17545524 
Annotation: hgu133a

$unt7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 137 samples 
 element names: exprs 
protocolData: none
phenoData
 sampleNames: OXFU_104 OXFU_1065 ... KIU_89A64 (137 total)
 varLabels: samplename dataset ... e.os (21 total)
 varMetadata: labelDescription
featureData
 featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
 fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
 fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
 pubMedIds: 16478745 
Annotation: hgu133ab

$upp7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 251 samples 
 element names: exprs 
protocolData: none
phenoData
 sampleNames: UPP_103B41 UPP_104B91 ... UPP_9B52 (251 total)
 varLabels: samplename dataset ... e.os (21 total)
 varMetadata: labelDescription
featureData
 featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
 fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
 fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
 pubMedIds: 16141321 
Annotation: hgu133ab

$mainz7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 200 samples 
 element names: exprs 
protocolData: none
phenoData
 sampleNames: MAINZ_BC6001 MAINZ_BC6002 ... MAINZ_BC6232 (200 total)
 varLabels: samplename dataset ... e.os (21 total)
 varMetadata: labelDescription
featureData
 featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
 fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
 fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
 pubMedIds: 18593943 
Annotation: hgu133a

$nki7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 337 samples 
 element names: exprs 
protocolData: none
phenoData
 sampleNames: NKI_4 NKI_6 ... NKI_404 (337 total)
 varLabels: samplename dataset ... e.os (21 total)
 varMetadata: labelDescription
featureData
 featureNames: NM_000125 NM_004448 ... NM_004346 (7 total)
 fvarLabels: probe EntrezGene.ID ... Description (10 total)
 fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: rosetta

因为最后一个数据集是agilent公司的,前面的数据都是affy公司的芯片,所以可以拿它来练手批次效应的矫正算法。

dn <- c("transbig", "unt", "upp", "mainz", "nki")
dn.platform <- c("affy", "affy", "affy", "affy", "agilent")

参考:http://genomicsclass.github.io/book/pages/svacombat.htmlhttps://www.biostars.org/p/196430/ 很容易看懂什么是批次矫正。

更重要的是这 5 个数据集的临床信息,都被重新归纳总结啦:

cinfo <- colnames(pData(mainz7g))
> cinfo
 [1] "samplename" "dataset" "series" "id" 
 [5] "filename" "size" "age" "er" 
 [9] "grade" "pgr" "her2" "brca.mutation"
[13] "e.dmfs" "t.dmfs" "node" "t.rfs" 
[17] "e.rfs" "treatment" "tissue" "t.os" 
[21] "e.os"

真的是非常棒的数据集!!!

 

Comments are closed.