十二 29

所有TCGA收集的mRNA表达数据集数据集-GSE62944

无意中看一篇文献,有提到这个数据集,我简单看了一下,居然还真的这么全面!!!
它处理了目前(大概是2015年6月)TCGA收集的所有癌症样本的mRNA表达数据,并且统一处理成了count和RPKM两种表达量形式。
GEO地址:http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944

Title Alternatively processed and compiled RNA-Sequencing and clinical data for thousands of samples from The Cancer Genome Atlas
Organism Homo sapiens
Experiment type Expression profiling by high throughput sequencing
Summary We reprocessed RNA-Seq data for 9264 tumor samples and 741 normal samples across 24 cancer types from The Cancer Genome Atlas with "Rsubread". Rsubread is an open source R package that has shown high concordance with other existing methods of alignment and summarization, but is simple to use and takes significantly less time to process data. Additionally, we provide clinical variables publicly available as of May 20, 2015 for the tumor samples where the TCGA ids are matched.
这样就非常方便大家使用了。
可以直接拿来进行聚类,看看不同癌症种类如何聚集在一起,也可以直接拿来跟某个临床指标进行关联分析.
比如,如果我们想用DEseq来做差异分析,那么需要的数据就是基因的count,这里就有近一万个癌症样本的基因表达的count值,随便都可以分类看看差异情况,再富集分析分析。非常适合初学者练手。
癌症种类见官网:http://gdac.broadinstitute.org/
clipboard