09

affymetix的基因表达芯片数据差异基因分析

我主要是看了一个差异分析的教程,讲的非常详细,全面,我先简单列出这个教程,然后再贴出我的代码

GEO本来只有三种层级的数据,分别是Sample, Platform, and Series
现在共有14,927 platforms,包括主流的affymetrix,agilent,illumina等产商的芯片,以及它们在不同领域的应用(snp,snv,gwas等等),以及各种不同的生物体(人,小鼠,大鼠)
这个分析流程,仅仅针对于affymetrix公司的基因表达相关的芯片数据。
目录如下:
因为他也是转载,所以链接失效了,现在的链接如下:
其实根据目录名重新搜索肯定能得到内容的, 链接失效太正常了。
具体内容,我整理并且重新注释了以下,在有道云笔记里面。
基本上只需要用心看这个教程,都能上手芯片数据的差异分析,但这只是差异分析的一种方法而已,而且还是非常过时的方法。
现在比较流行DESeq,edgeR等高通量测序的差异分析包,即使是十几年前的芯片数据,也不需要下载cel那种数据,可以直接下载每个项目的表达量矩阵Series Matrix File(s)
然后在R里面用read.table,调整好参数就可以直接读取啦!
07

liftover基因组版本直接的coordinate转换

下载地址:http://hgdownload.cse.ucsc.edu/admin/exe/
使用方法:【从hg38转到hg19】
因为主流的基因组版本还是hg19,但是时代在进步,已经有很多信息都是以hg38的形式公布出来的了。
比如,我下载了pfam.df这个protein domain注释文件,对人的hg38基因组每个坐标都做了domain注释,数据形式如下:
查看文件内容head pfam.hg38.df ,如下:
PFAMID chr start end strand
Helicase_C_2 chr1 12190 12689 +
7tm_4 chr1 69157 69220 +
7TM_GPCR_Srsx chr1 69184 69817 +
7tm_1 chr1 69190 69931 +
7tm_4 chr1 69490 69910 +
7tm_1 chr1 450816 451557 -
7tm_4 chr1 450837 451263 -
EPV_E5 chr1 450924 450936 -
7TM_GPCR_Srsx chr1 450927 451572 -
我想把domain的起始终止坐标转换成hg19的,就必须要借助UCSC的liftover这个工具啦
这个工具需要一个坐标注释文件 http://hgdownload-test.cse.ucsc.edu/goldenPath/hg38/liftOver/
而且它只能对bed等符合要求的格式进行转换
示例如下:
chr7  127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
chr7  127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
很简单的,把自己的文件随便凑几列信息,做成这个9列的格式即可
cat pfam.hg38.df |sed 's/\r//g' |awk '{print $2,$3,$4,$1,0,$5,$3,$4,"255,0,0"}'  >pfam.hg38.bed
这样就有了足够的文件可以进行坐标转换啦,转换的命令非常简单!
chmod 777 liftOver
 ./liftOver pfam.hg38.bed hg38ToHg19.over.chain pfam.hg19.bed unmap
然后运行成功了会有 提示,报错一般是你的格式不符合标准bed格式,自己删掉注释行等等不符合的信息即可
Reading liftover chains
Mapping coordinates
转换后,稍微检查一下就可以看到坐标的确发生了变化,当然,我们只需要看前面几列信息即可
grep -w p53 *bed
pfam.hg19.bed:chr11 44956439 44959858 p53-inducible11 0 - 44956439 44959858 255,0,0
pfam.hg19.bed:chr11 44956439 44959767 p53-inducible11 0 - 44956439 44959767 255,0,0
pfam.hg19.bed:chr2 669635 675557 p53-inducible11 0 - 669635 675557 255,0,0
pfam.hg19.bed:chr22 35660826 35660982 p53-inducible11 0 + 35660826 35660982 255,0,0
仔细看看坐标是不是变化啦!
pfam.hg38.bed:chr11 44934888 44938307 p53-inducible11 0 - 44934888 44938307 255,0,0
pfam.hg38.bed:chr11 44934888 44938216 p53-inducible11 0 - 44934888 44938216 255,0,0
pfam.hg38.bed:chr2 669635 675557 p53-inducible11 0 - 669635 675557 255,0,0
pfam.hg38.bed:chr22 35264833 35264989 p53-inducible11 0 + 35264833 35264989 255,0,0
其实R里面的bioconductor系列包也可以进行坐标转换 http://www.bioconductor.org/help/workflows/liftOver/
这个可以直接接着下载pfam.df数据库来做下去。更方便一点。
我的数据如下,需要自己创建成一个GRanges对象

1

library(GenomicRanges)
pfam.hg38 <- GRanges(seqnames=Rle(a[,2]),
               ranges=IRanges(a[,3], a[,4]),
               strand=a[,5])
2
这样就OK拉,虽然这只是一个很简陋的GRanges对象,但是这个GRanges对象可以通过R的liftover方法来转换坐标啦。
library(rtracklayer)
ch = import.chain("hg38ToHg19.over.chain")

pfam.hg19 = liftOver(pfam.hg38, ch)

pfam.hg19 =unlist(pfam.hg19)
再把这个转换好的pfam.hg19 写出即可

 

06

JQuery学习笔记

以后写这样的文章就直接用有道云笔记分享啦,这样可以节约这个免费的云服务器的空间。

jquery学习笔记第一弹:基础语法

http://note.youdao.com/share/?id=82021515144eb4820762e9fdbc686340&type=note

JQuery笔记第二弹:ppt效果操作

http://note.youdao.com/share/?id=08eb606b2084b9b0d8c9eb5ef72e3433&type=note

JQuery笔记第三弹:操作html元素

http://note.youdao.com/share/?id=fb8ff7deeb186adb82751838bf82cfbe&type=note

JQuery笔记第四弹:循环,遍历,判断等语句实现

http://note.youdao.com/share/?id=746ac6f1a801351f49d13cb3d7a335bf&type=note

JQuery笔记第五弹:Ajax实现

http://note.youdao.com/share/?id=0b2c6fb8c89e307ec79602e6d67e7c66&type=note

JQuery参考手册-函数大全

http://note.youdao.com/share/?id=2e926f98c9bd51b1192d309706f8c1ca&type=note

 

 

01

2012-LAD的三个亚型的不同生物学意义

文献名:Differential Pathogenesis of Lung Adenocarcinoma Subtypes Involving Sequence Mutations, Copy Number, Chromosomal Instability, and Methylation
Lung adenocarcinoma (LAD)的遗传变异度很大。
这个癌症可以分成三类:The LAD molecular subtypes (Bronchioid, Magnoid, and Squamoid)
然后我们在三个subtypes里面分析了以下四个特征,发现不同subtypes差异非常显著。
1、Gene mutation rates (EGFR, KRAS, STK11, TP53),
2、chromosomal instability,
3、regional copy number
4、genomewide DNA methylation
另外三个临床特征也是很显著。
1、Patient overall survival,
2、cisplatin plus vinorelbine therapy response
3、predicted gefitinib sensitivity
所以,我们的分类非常好,而且对临床非常有帮助。
对LAD的研究数据包括
1,DNA copy number
2,gene sequence mutation
3,DNA methylation
4,gene expression
即使是TP53这样的基因在LAD的突变率也才35%,所以我们的LAD应该更加细分,因为EGFR mutation and KRAS mutation这样的突变对治疗很有指导意义,细分更加有助于临床针对性治疗方案的选择。
我们选取了116个LAD样本的数据,分析了1,genome-wide gene expression,,2,genomewide DNA copy number, 3,genome-wide DNA methylation, 4,selected gene sequence mutations
得到的结论是:LAD molecular subtypes correlate with grossly distinct genomic alterations and patient therapy response
数据来源如下:
Gene expression --> Agilent 44 K microarrays.
DNA copy number --> Affymetrix 250 K Sty and SNP6 microarrays.
DNA methylation --> MSNP microarray assay.
DNA from EGFR, KRAS, STK11 and TP53 exons --> ABI sequencers
我们用的是R语言包 ConsensusClusterPlus根据gene expression 来对我们的LAD进行分类molecular subtypes
分类的基因有506个(the top 25% most variable genes, 3,045, using ConsensusClusterPlus),A nearest centroid subtype predictor utilizing 506 genes

这三类LAD的过表达基因参与不同的生物功能,
Bronchioid – excretion genes, asthma genes, and surfactants (SFTPB, SFTPC, SFTPD);
Magnoid – DNA repair genes, such as thymine-DNA glycosylase (TDG);
Squamoid – defense response genes, such as chemokine ligand 10 (CXCL10)
而且也对应不同的临床数据
Bronchioid had the most females, nonsmokers, early stage tumors, and low grade tumors, the greatest acinar content, the least necrosis, and the least invasion.
Squamoid had the most high grade tumors, the greatest solid content, and the lowest papillary content.
Magnoid had themost smokers and the heaviest smokers by pack years.
它们的基因突变pattern也有很大区别。
Bronchioid had the greatest EGFR mutation frequency
Magnoid had the greatest mutation frequencies in TP53, KRAS and STK11.
为了研究不同亚型癌症的突变模式的不同(genomewide mutation rates),我们同时又研究了a large set of rarely mutated genes (n = 623) from the Ding et al. cohort

结论:
Bronchioid subtype 更有可能受益于EGFR inhibitory therapy
Magnoid tumors also have severe genomic alterations including the greatest CIN, the most regional CN alterations, DNA hypermethylation, and the greatest genomewide mutation rate.
the Squamoid subtype displayed the fewest distinctive alterations that included only regional CN alterations

31

2013-science-3205tumors-12types-4-ways-find-291HCD

文献名:Comprehensive identification of  mutational cancer driver genes across 12  tumor types
本文比较了四种寻找癌症驱动基因的方法,并且得到了综合性的、可靠的291个HCDs 基因列表。
数据来源于3205个肿瘤样本,共涉及到12种癌症。
 Cancer Gene Census (CGC) 数据库里面已经有了接近500个cancer genes
 癌症基因组研究分析可以得到数以万计的somatic mutations,但是其中很少一部分才是驱动肿瘤发生,发展的突变。
 而且大多数driver genes的突变频率很低,又由于肿瘤的异质性,大量样本的研究是必须的。
 主流的四种找癌症驱动基因的方法如下:
 1、Most common methods identify genes that are mutated more frequently than expected from the background mutation rate (recurrence)
 2、Other methods - a bias towards the accumulation of functional mutations (FM bias)
 3、other methods exploit the tendency to sustain mutations in certain regions of the protein sequence (CLUST bias)
 4、other approaches exploit the overrepresentation of mutations in specific functional residues, such as phosphorylation sites (ACTIVE bias)
 它们的代表软件是MuSiC, OncodriveFM, OncodriveCLUST and ActiveDriver
 本文把这四种方法进行了比较,并且综合了它们的结果。
 In summary, we provide a very reliable list of 291 HCDs and a second one, of 144 CDs, more comprehensive but with an expectedly higher false-positives rate
 One hundred and sixty-five of these candidates are novel findings not included in the CGC.
 然后,作者对这291个HCDs基因进行了功能分析,其中,它们主要集中在以下五个生物功能
Chromatin remodeling,
mRNA processing,
Cell signaling/proliferation,
Cell adhesion,
DNA repair/Cell cycle
然后把四种方法综合得到的291个HCDs基因与Cancer Gene Census (CGC) 数据库里面已经有的接近500个cancer genes进行综合比较
 本文首次展示了综合多种癌症驱动基因寻找方法的可能性,这种综合是基于两个事实:
 1,各种方法找癌症驱动基因本来就没有金标准,所以综合多种方法,更comprehensive。
 2,综合多种方法能更好的比较评估所找到的癌症驱动基因的准确性。
31

2014-4742samples-21tumors-Cancer5000-set-254-genes

文献名: Discovery and saturation analysis of  cancer genes across 21 tumour types.
我们知道对一个癌症的多个样本进行研究,其实很少高达20%样本突变 most intermediate frequencies (2–20%),还有很多低频突变,因为研究样本不够,从而不被发现
我们从 4,742个tumor-normal pairs的外显子测序数据集研究了somatic point mutations,共21种癌症。
癌症基因可能集中于以下七个功能:
proliferation,
apoptosis,
genome stability,
chromatin regulation,
immune evasion,
RNA processing
protein homeostasis
我们用有放回的抽样方法对数据进行统计,得出结论:如果我们对某个癌症的研究样本高达500-6000个的话,可以发现更多的临床低频突变。
这篇文章是为了解决以下三个问题:
1、大规模的研究cancer就能达到鉴别出所有的cancer driver genes的程度吗?(Coverage of known cancer genes)
2,增大样本量是否会揭示很多cancer driver genes?(Analysis of novel candidate cancer genes)
3、我们距离对所有的cancer driver genes的完全认知还有多远?(Saturation analysis)
突变数据的分析流程是Broad’s stringent filtering and annotation pipeline
突变情况如下:
3,078,483 somatic single nucleotide variations(SSNVs),
77,270 small insertions and deletions (SINDELs)
29,837 somatic di-, tri- or oligonucleotide variations (DNVs, TNVs and ONVs, respectively)
an average of 672 per tumour–normal pair
包括:
540,831 missense,
207,144 synonymous,
46,264 nonsense,
33,637 splice-site
2,294,935 non-coding mutations
我们找驱动基因的方法是:
We used the most recent version of the MutSig suite of tools
which looks for three independent signals:
high mutational burden relative to background expectation,
accounting for heterogeneity;
clustering of mutations within the gene;
enrichment of mutations in evolutionarily conserved sites.
我们把以上MutSig的几个独立组件分析得到的p-value组合起来,判断驱动基因,我们即对每种癌症做了单独分析,同时也对这21种癌症做了综合分析。
我们找到的驱动基因的结果:
单独对各个癌症进行分析,可以总共找到334个基因,当然不同癌症找到的基因有交集。
These 334 pairs involve 224 distinct genes.
The number of genes detected per tumour type varied considerably (range of 1–58)
找到的驱动基因的个数差异主要取决于癌症种类的不同,然后,跟该癌症的样本量有关。
只有22种基因能在超过三种癌症里面都是被判定为驱动基因。
如果我们把21种癌症合并起来找驱动基因,可以找到114个,其中有30个是单独对各个癌症进行分析所找不到的,有80个在单独癌症分析可以找到。
所以单独对各个癌症进行分析找到的224个基因里面,有140个是合并癌症分析找不到的。其实画一个韦恩图就很好理解了。
对各个癌症进行分析,共21次分析,加上合并分析,共22次飞行,总共可以得到a Cancer5000 set containing 254 genes.
我们再严格分析一下254个基因在Cancer5000 set,得到219 distinct genes.叫做Cancer5000-S (for ‘stringent’) gene。
 Cancer Gene Census (CGC)组织的 (v65)版本包含着130个cancer genes driven by somatic point mutations,其中82个被我们这次统计分析发现啦。
 Four genes encode anti-proliferative proteins, in which loss-offunction mutations would be expected to contribute to oncogenesis.
 Sixadditionalgenesencode proteins thatare clearlyinvolved incell  proliferation: RHEB, RHOA, SOS1, ELF3, SGK1 and MYOCD.
 Five genes encode pro-apoptotic factors, in which loss-of-function mutations would be expected to promote oncogenesis
Six genes encode proteins related to genome stability.
Fivegenesareassociatedwithchromatinregulation
Three genes encode proteins whose loss is expected to help tumours evade immune attack
Three genes are associated with RNA processing and metabolism.
One gene, TRIM23, is involved in protein homeostasis.
Beyond these 33 genes, the set of 81 novel genes is likely to contain
additional true cancer genes.
有返回抽样方法是:An effective test is to perform ‘down-sampling’; that is, to study how the number of discoveries increases with sample size, by repeating the analysis on random subsets of samples of various smaller sizes.
饱和度分析结果: 还远未到饱和,不同突变频率的基因被发现的个数随着样本量的增大而增多的速度不同。
Genes mutated in 20% of tumours are approaching saturation;
those mutated at frequencies of 10–20% are still rising rapidly, but at a decreasing rate;
those at 5–10%  increasing linearly;
and those at ,5% are increasingly at an accelerating rate.
我们对样本量的要求是:突变背景高的癌症(如,黑色素瘤)需要的样品更多,而那些突变背景低的癌症(如成神经细胞瘤)需要近650个样本就可以很好的分析驱动基因了
Creating a reasonably comprehensive catalogue of candidate cancer genes mutated in 2% of patients will require between approximately 650 samples (for tumours with ,0.5 mutations per Mb, such
as neuroblastoma) to approximately 5,300 samples (for melanoma, with 12.9 mutations per Mb)
31

2015-MADGiC-identify-cancer-driver-gene

最新的一个寻找cancer 的driver gene的软件:
Cancer is thought to result from the accumulation of causal  somatic mutations throughout the lifetime of an individual.
这些cancer-driving mutations 主要影响三类基因: 1、oncogenes 2、tumor-suppressor genes 3, stablity geens
第一个突变是tumorigenesis ,随后的突变就 driver tumor progression
识别这些突变非常有利于了解gene function 和药物靶点设计
区分 driver genes 和 passenger  genes 能更好的利用各种数据库得到的海量突变信息
基于频率的区分方法 rely on an estimate of a background mutation rate which  represents the rate of random passenger mutations.
也就是文献(Ding et al., 2008).提出的方法,但它忽略了以下四点
1、mutation type (transition versus transversion)
2、nucleotide context(which base is at the mutation site
3、dinucleotide context (which bases are located at neighboring sites to the mutation),
4、expression level of the gene
然后有文献提出了以下三种改进
Sjoblom et al.(2006) account for nucleotide and dinucleotide context in searching  for drivers of breast and colorectal cancer.
MuSiC (Dees et al.,2012) accounts for mutation type and allows for sample-specific mutation rates;
Lawrence et al.(2013) (MutSigCV) also allow for the inclusion of gene-specific factors such as expression level and replication timing.
但是他们有个共同延续下来的的缺点,就是默认驱动基因的突变频率要高于背景突变频率。
实际上,除了突变频率,还有一些criteria也很重要, 所以有两个数据库SIFT (first reported by Ng and Henikoff (2001), later updated by Kumar et al. (2009)),  Polyphen (Adzhubei et al., 2010)  和MutationAssessor (Reva et al., 2011)
这两个数据库整合了 sequence context, position, and protein characteristics to assess a mutation’s  functional impact.
总结一下identity cancer driver genes的criteria
1、mutation frequency,
2、mutation type,
3、gene-specific features such as replication timing and expression level that are known to affect background rates of mutation,
4、mutation-specific scores that assess functional impact, and the spatial patterning of mutations that only becomes apparent when thousands of samples are considered.
以前的方法都只是部分涉及到上面的criteria
而我们提出了a unified empirical Bayesian Model-based Approach for identifying Driver Genes in Cancer (MADGiC) that utilizes each of these features.
31

2014-REVIEW-identifying driver mutation in sequenced cancer genome

somatic  mutations 含义很广,包括:SNVs,Indel,CNAs,SVs等
However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations.
Cancer is driven largely by somatic mutations that accumulate in the genome over an individual’s lifetime, with additional contributions from epigenetic and transcriptomic alterations
低通量时代研究,成功例子: imatinib has been used to target cells expressing the BCR-ABL fusion gene  in chronic myeloid leukemia
gefitinib has been used to inhibit the epidermal growth factor receptor in lung cancer
但远远不够。
NGS的三大挑战:1,indentying somatic mutations,误差/ 肿瘤异质性 2,识别driver genes 3,确定由somatic mutations 改变的pathways和其它生物过程
误差来源:optical PCR duplicates, GC-bias, strand bias (where reads indicating a possible mutation only align to one strand of DNA) and alignment artifacts resulting from low complexity or repetitive regions in the genome.
most methods for somatic mutation detection address only a subset of the possible sources of error,call snp的软件众多
identifying driver mutations的三个要点:
1,identifying recurrent mutations;
2,predicting the functional impact of individual mutations;
3,assessing combinations of mutations using pathways, interaction networks, or statistical correlations.
三个要点分别衍生了大量的软件,它们的问题在于:
1,直接看突变频率的那些软件to determine whether the observed number of mutations in the gene is significantly greater than the number expected according to a background mutation rate (BMR).
BMR 实在是太难确定了,低了会导致很多假阳性,而高了,又错过很多真实的driver mutations,但是突变频率非常高的那些基因肯定是没有问题的,比如说TP53,无论什么样的算法都会认为它是driver gene
2,考虑突变对蛋白功能的影响评分的那些软件,引入了一些先验假设:
evolutionary conservation,
known protein domains,
non-random clustering of mutations,
protein structure,
3,pathways, interaction networks, and de novo approaches的那些软件:
pathway(KEGG,GO,GSEA) 4个limitations,首先,大多数 annotated gene sets 包含的基因数太多,而我们的突变基因占该gene set的比例远达不到统计显著性。
然后,pathway并不是独立的,各个pathway之间的联系更重要
接着,把基因分割成pathway这样的小单元,忽略了单元外的联系
最后,只关注已知的 pathways, or gene set
过去的五年见证了癌症基因组测序研究翻天覆地的变化,但是距离它真正的临床应用还有以下几个挑战:
首先,我们忽略了non-coding somatic mutations
其次,很多我们定义的癌症种类其实是a mixture of these subtypes
然后,哪些癌症是可以合并研究的
最后,不同的NGS数据如何综合研究,包括WGS,WES,RNA sequencing, DNA methylation, and chromatin modifications
对某些患者来说,癌症精准医学已经来临,但是对大部分病人来说,前面的路还很长。
31

2014-review-Next-generation sequencing to guide cancer therapy


 This reductionist thinking led the initial theories on carcinogenesis to be centered on how many “hits” or genetic mutations were necessary for a tumor to develop.
还原论者认为导致癌症发生发展的原因集中在一些必须因子-"hit" or genetic mutations
由于这个假设,早期探索多种癌症的遗传基础的方法主要是低通量的研究具体某些特定的基因或者变异情况。
分析方法的选择:microarray vs WGS vs WES
临床样品的选择:fresh frozen tissue  / FFPE specimens /CTCs / ctDNA
临床NGS数据分析方法:mapping --> SNVs CNVs and SVs --> annotation
挑战:1,低频突变很难从测序错误中区分开
            2,很多临床相关的DNA fushions发生在非编码区,所以WES也会错过不少信息的
临床NGS数据注释 :多种数据库,多种数据分析方法
NGS辅助临床医疗的三个途径: 1, diagnosis,早期诊断,精确分类 2,针对性治疗3,耐药性,及时换药
CTC: Circulating tumor cell;
ctDNA: circulating tumor DNA;
 FDA: Food and Drug Administration;
FFPE: Formalin-fixed, paraffin-embedded;
MATCH: Molecular Analysis for Therapy Choice;
MHC: Majorhistocompatibility complex;
NGS: Next-generation sequencing;
SNV: Singlenucleotide variant;
TCGA: The Cancer Genome Atlas.
31

文献笔记-2015-nature-molecular analysis of gastric cancer新的分类及预后调查

文献:Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes

A small pre-defined set of gene expression signatures

epithelial-to-mesenchymal transition (EMT)  上皮细胞向间充质细胞转化
microsatellite instability (MSI) 微卫星不稳定性
cytokine signaling 细胞因子信号
cell proliferation  细胞增殖
DNA methylation DNA甲基化
TP53 activity TP53活性
gastric tissue 胃组织

 

经典的分类方法是:Gastric cancer may be subdivided into 3 distinct subtypes—proximal, diffuse, and distal gastric cancer—based on histopathologic and anatomic criteria. Each subtype is associated with unique epidemiology.

我们用主成分分析Principal component anaylsis (PCA)

PC1

PC2

PC3

这三个主成分与上面的七个特征是相关联的。

根据我们的主成分分析,可以把我们的300个GC样本分成如下四组,命名如下:

Gene expression signatures define four molecular subtypes of GC:

MSI (n = 68),

MSS/EMT (n = 46),

MSS/TP53+ (n = 79)

MSS/TP53− (n = 107)

然后用本文的分类方法,测试了另外另个published数据,还是分成四个组

(MSI, MSS/EMT, MSS/TP53+ and MSS/TP53-)

分别是TCGA数据库的;n = 46, n = 62, n = 50 and n = 47.

Singapore的研究; n = 12, n = 85, n = 39 and n = 63 respectively

我们这样的分组可以得到一些规律:

(i) The MSS/EMT subtype occurred at a significantly younger age (P = 3e-2) than did other subtypes. The majority (>80%) of the subjects in this subtype were diagnosed with diffuse-type (P < 1e-4) at stage III/IV(P = 1e-3).

(ii) The MSI subtype occurred predominantly in the antrum (75%), >60% of subjects had the intestinal subtype, and >50% of subjects were diagnosed at an early stage (I/II).

(iii) Epstein-Barr virus (EBV) infection occurred more frequently in the MSS/TP53+ group (n = 12/18, P = 2e-4) than in the other groups.

 

然后我们对我们的300个样本做了生存分析:

预后: MSI  >   MSS/TP53+    >   MSS/TP53 >  MSS/EMT

Next, we validated the survival trend of GC subtypes in three independent cohorts: Samsung Medical Center cohort 2 (SMC-2,n = 277, GSE26253)31,

Singapore  cohort(n = 200, GSE15459)21 and

TCGA gastric cohort (n = 205).

We saw that the GC subtypes showed a significant association with overall survival

结论:我们这样的分类是最合理的,跟各个类别的预后非常相关。

 

然后我们看看突变模式:

the MSI~ hypermutation ~KRAS (23.3%), the PI3K-PTEN-mTOR pathway (42%), ALK (16.3%) and ARID1A (44.2%)18.

We observed enrichment of PIK3CA H1047R mutations in the MSI samples

we saw enrichment of E542K and E545K mutations in MSS tumors

The EMT subtype had a lower number of mutation events when compared to the other MSS groups(P = 1e−3).

The MSS/TP53− subtype showed the highest prevalence of TP53 mutations (60%), with a low frequency of other mutations

the MSS/TP53+ subtype showed a relatively higher prevalence (compared to MSS/TP53−) of mutations in APC, ARID1A, KRAS, PIK3CA and SMAD4.

再看看拷贝数变异情况:

再看看与另外两个研究团队的分类情况的比较

The TCGA study reported expression clusters (subtypes named C1–C4) and genomic subtypes (subtypes named EBV+, MSI, Genome Stable (GS) and Chromosomal Instability (CIN)).

A follow-up study of the Singapore cohort21 described three expression subtypes (Proliferative, Metabolic and Reactive)

However, a consensus on clinically relevant subtypes that encompasses molecular heterogeneity and that can be used in preclinical and clinical research has not been reported.

Here we report the molecular classification of GC linked not only to distinct patterns of genomic alterations, but also to recurrence pattern and prognosis across multiple GC cohorts.

 

 

microsatellite instability

英文简称 : MI
中文全称 : 微卫星不稳定性
所属分类 : 生物科学
词条简介 : 微卫星不稳定性(microsatellite instability,MI)检测是基于VNTR的发现,细胞内基因组含有大量的碱基重复序列,一般将6-7bp的串联重称为小卫星DNA(minisatellite DNA),又称为VNTR。而将1-4bp的串联重复称为微卫星DNA,又称简单重复序列(simple repeat sequence,SRS)。SRS是一种最常见的重复序列之一,具有丰富的多态性、高度杂合性、重组纺低等优点。最常见的为双核苷酸重复,即(AC)n和(TG)n。研究表胆,在n≥104时,2bp重复序列在人群中呈高度多态性。SRS广泛存在于原核和真核基因组中,约占真核基因组的5%,是近年来快速发展起来的新的DNA多态性标志之一。策卫星稳定性(MI)是指简重复序列的增加或丢失。MI首先在结肠癌中观察到,1993年在HNPCC中观察到多条染色体均有(AC)n重复序列的增加或毛失,以后相继在胃癌、胰腺癌、肺癌、膀胱癌、乳腺癌、前列腺癌及其他肿瘤等也好现存在微卫星不稳定现象,提示MI可能是肿瘤细胞的另一重要分子结果显示 ,MI与肿瘤与发展有关,MI仅在肿瘤细胞中发现,从未在正常组织中检测到。在原发与移肿瘤中,MI均交分布于整个肿瘤。晚期胃癌的MI频率显著高于早期胃癌。

 

31

文献笔记-2010-R-softeware-identify-cancer_driver_genes

我们用188 non-small cell lung tumors数据来测试了一个R语言程序,find driver genes in cancer ~
软件地址如下:http://linus.nci.nih.gov/Data/YounA/software.zip
这是一个R语言程序,里面有readme,用法很简单。
准备好两个文件,分别是silent_mutation_table.txt and nonsilent_mutation_table.txt ,它们都是普通文本格式数据,内容如下,就是把找到的snp格式化,根据注释结果分成silent和nonsilent即可。
#Ensembl_gene_id Chromosome Start_position Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
#ENSG00000122477 1 100390656 SNP G G A TCGA-23-1022-01A-02W-0488-09
然后直接运行程序包里面的主程序,在R语言里面source("main_R_script.r")
We reanalyzed sequence data for 623 candidate genes in 188 non-small cell lung tumors using the new method.
to identify genes that are frequently mutated and thereby are expected to have primary roles in thedevelopment of tumor
To find these driver genes, each gene is tested for whether its mutation rate is significantly higher than the background (or passenger) mutation rate.

Some investigators (Sjoblom et al., 2006) further divide mutations into several types according to the nucleotide and the neighboring nucleotides of the mutations.

Ding et al. (2008)的方法的三个缺点:
1、different types of mutations can have different impact on proteins.(越影响蛋白功能的突变,越有可能是driver mutation)
2、different samples have different background mutation rates. (在高突变背景的样本中的突变,很可能是高突变背景的原因,而不是因为癌症)
3、a different number of non-silent mutations can occur at each base pair according to the genetic code.(比如Tryptophan仅仅只有一个密码子,而arginine高达6个密码子)

我们提出的方法的4个优点是:
1,我们对非同义突变根据它们对蛋白功能的影响进行了评级打分。
2,我们允许不同的样品有着不同的BMR
3,that whether the mutation is non-silent or silent depends on the genetic code
4,we take into account uncertainties in the background mutation rate by using empirical Bayes methods

还有5个需要改进的地方:
1,However, the functional impact is also dependent on the position in which a mutation occurs.(我们仅仅考虑了突变对氨基酸的改变)
2,the current scoring system which assigns mutation scores in the order: missense mutation<inframe indel<mutation in splice sites<frameshift indel=nonsense mutation may be biased toward identifying tumor suppressor genes over oncogenes.
3,we may refine our background mutation model in Table 1 so that all six types of mutations, A:T→G:C, A:T→C:G, A: T→T :A,G:C→A:T, G:C→T :A, G:C→C:G have separate mutation rates.
4,we did not take into account correlations among mutations in identifying driver genes.
5,one might combine both copy number variation and sequencing data to identify driver genes.

HGNC定义的gene Symbol转为ensemble数据库的ID,的R语言代码:
library(biomaRt)
ensembl=useMart("ensembl",dataset = "hsapiens_gene_ensembl")
all.gene.table = read.table("all_gene.symbol", header=F)
convert=getBM(attributes = c("chromosome_name","ensembl_gene_id","hgnc_symbol"),filters =c("hgnc_symbol"),values=all.gene.table[,1],mart=ensembl)
chromosome=c(1:22,"X","Y","M")
convert=convert[!is.na(match(convert[,1],chromosome)),2:3] #remove names whose matching chromosome is not 1-22, X, or Y.
convert=convert[rowSums(convert=="")==0,]
write.table(convert,"ensembl2symbol.list",quote = F,row.names =F,col.names =F)
write.table(convert,"all_gene_name.txt",quote = F,row.names =F,col.names =F)

一个gene Symbol可能对应着多个ensemble ID号,但是在每个染色体上面是一对一的关系。
有些gene Symbol可能找不到ensemble ID号,一般情况是因为这个gene Symbol并不是纯粹的HGNC定义的,或者是比较陈旧的ID。
比如下面的TIGAR ,就很可能被写作是C12orf5
Aliases for TIGAR Gene
TP53 Induced Glycolysis Regulatory Phosphatase 2 3
TP53-Induced Glycolysis And Apoptosis Regulator 2 3 4
C12orf5 3 4 6
Probable Fructose-2,6-Bisphosphatase TIGAR 3
Fructose-2,6-Bisphosphate 2-Phosphatase 3
Chromosome 12 Open Reading Frame 5 2
Fructose-2,6-Bisphosphatase TIGAR 3
Transactivated By NS3TP2 Protein 3
EC 3.1.3.46 4
FR2BP 3
External Ids for TIGAR Gene
HGNC: 1185 Entrez Gene: 57103 Ensembl: ENSG00000078237 OMIM: 610775 UniProtKB: Q9NQ88
Previous HGNC Symbols for TIGAR Gene
C12orf5
Export aliases for TIGAR gene to outside databases

29

研究癌症领域必看文献

最近需要了解一些癌症相关知识,看到了这个文献列表,觉得非常棒,所以推荐给大家。

抽时间慢慢看,一个月应该可以把这些文献看完的。

癌症种类大全 http://www.cancer.gov/types
癌症药物大全 http://www.cancer.gov/about-cancer/treatment/drugs
癌症所有的信息几乎都能在这个网站上面找到 http://www.cancer.gov/
包括癌症的科普、treatment、diagnosis,prognosis,classification,drugs、prediction等等

different_kinds_of_cancer_in_CCLE

Cancer Precision Medicine: Improving Evidence in Practice - August 24, 2015

NCI-MATCH Trial Opens,External Web Site Icon AACR blog post, August 2015

NCI-MATCH launch highlights new trial design in precision-medicine eraExternal Web Site Icon
McNeal C , JNCI, August 2015

The Cancer Genomics Resource List, 2014External Web Site Icon
Zutter MM et al. CAP Lab Improvement Program,Archives of Pathology, August 2015

Personalized medicine and economic evaluation in oncology: all theory and no practice?External Web Site Icon
Garattini L et al. Expert Rev Pharmacoecon Outcomes Res 2015 Aug 9. 1-6

Precision medicine trials bring targeted treatments to more patients,External Web Site Icon C. Helwick, ASCO Post, Jul 25

Next-generation sequencing to guide cancer therapy External Web Site Icon
Gagan J et al, Genome Medicine, July 29, 2015

Feasibility of large-scale genomic testing to facilitate enrollment onto genomically matched clinical trials.External Web Site Icon
Meric-Bernstam F et al. J. Clin. Oncol. 2015 May 26.

Brave-ish new world-what's needed to make precision oncology a practical reality.External Web Site Icon
MacConaill LE et al. JAMA Oncol 2015 Jul 16.

Genomic profiling: Building a continuum from knowledge to careExternal Web Site Icon
Helen C et al. JAMA Oncology, July 2015

Are we there yet?External Web Site Icon
When it comes to curing cancer, targeted therapies and genomic sequencing are helping, but we still have far to go. Genome Magazine, June 29, 2015

Artificial intelligence, big data, and cancerExternal Web Site Icon
Kantarjian H et al, JAMA Oncology, June 2015

Multigene panel testing in oncology practice - how should we respond?External Web Site Icon
Kurian AW et al. JAMA Oncology, June 2015

Use of whole genome sequencing for diagnosis and discovery in the cancer genetics clinic.External Web Site Icon
Foley SB et al. EBioMedicine 2015 Jan 2(1) 74-81

The future of molecular medicine: biomarkers, BATTLEs, and big data External Web Site Icon
ES Kim, ASCO University, June 2015

NCI-MATCH trial will link targeted cancer drugs to gene abnormalitiesExternal Web Site Icon

Targeted agent and profiling utilization registry study,External Web Site Icon from the American Society for Clinical Oncology

ASCO study aims to learn from patient access to targeted cancer drugs used off-label,External Web Site Icon American Society for Clinical Oncology

Improving evidence developed from population-level experience with targeted agents Adobe PDF file [PDF 462.93 KB]External Web Site Icon
McLellan M et al Issue Brief. Conference on Clinical Cancer Research November 2014

Implementing personalized cancer care.External Web Site Icon
Schilsky RL et al. Nat Rev Clin Oncol 2014 Jul (7) 432-8

Accelerating the delivery of patient-centered, high-quality cancer care.External Web Site Icon
Abrahams E et al. Clin. Cancer Res. 2015 May 15. (10) 2263-7

Next-generation clinical trials: Novel strategies to address the challenge of tumor molecular heterogeneity.External Web Site Icon
Catenacci DV et al. Mol Oncol 2015 May (5) 967-996

Cancer Precision Medicine: Improving Evidence in Practice - May 29, 2015

Diagnosis and treatment of cancer using genomicsExternal Web Site Icon
Vockley JG et al. BMJ, May 28, 2015

Targeted agent and profiling utilization registry study,External Web Site Icon from the American Society for Clinical Oncology

ASCO study aims to learn from patient access to targeted cancer drugs used off-label,External Web Site Icon American Society for Clinical Oncology

Improving evidence developed from population-level experience with targeted agents Adobe PDF file [PDF 462.93 KB]External Web Site Icon
McLellan M et al Issue Brief. Conference on Clinical Cancer Research November 2014

Implementing personalized cancer care.External Web Site Icon
Schilsky RL et al. Nat Rev Clin Oncol 2014 Jul (7) 432-8

Accelerating the delivery of patient-centered, high-quality cancer care.External Web Site Icon
Abrahams E et al. Clin. Cancer Res. 2015 May 15. (10) 2263-7

Next-generation clinical trials: Novel strategies to address the challenge of tumor molecular heterogeneity.External Web Site Icon
Catenacci DV et al. Mol Oncol 2015 May (5) 967-996

Precision Medicine: Cancer and Genomics - May 12, 2015

Promise, peril seen in personalized cancer therapy,External Web Site Iconby Marie McCullough, Philadelphia Inquirer, May 10

A decision support framework for genomically informed investigational cancer therapy.External Web Site Icon
Meric-Bernstam F et al. J. Natl. Cancer Inst. 2015 Jul (7)

Divide and conquer: The molecular diagnosis of cancer,External Web Site Icon by Louis M. Staudt, National Cancer Insitute, Apr 13

Health: Make precision medicine work for cancer careExternal Web Site Icon
To get targeted treatments to more cancer patients pair genomic data with clinical data, and make the information widely accessible, Mark A. Rubin. Nature News, Apr 15

Using somatic mutations to guide treatment decisionsExternal Web Site Icon
Horlings H et al. JAMA Oncology, March 12, 2015

The landscape of precision cancer medicine clinical trials in the United StatesExternal Web Site Icon
Roper N et al. Cancer Treatment Reviews 2015

What is “precision medicine?External Web Site Icon Information from the National Cancer Institute

Impact of cancer genomics on precision medicine for the treatment of cancer,External Web Site Icon from the Cancer Genome Atlas, NCI

US precision-medicine proposal sparks questions,External Web Site Icon by Sara Reardon, Nature News, Jan 22

Obama's 'precision medicine' means gene mapping,External Web Site IconNBC News, Jan 21

What is President Obama's 'precision medicine' plan, and how might it help you?External Web Site Icon By Lenny Bernstein, Jan 21

Recent reviews

Companion diagnostics: the key to personalized medicine.External Web Site Icon
Jørgensen JT. Expert Rev Mol Diagn. 2015 Feb;15(2):153-6

Promoting precision cancer medicine through a community-driven knowledgebase.External Web Site Icon
Geifman N, et al. J Pers Med. 2014 Dec 15;4(4):475-88.

Toward a prostate cancer precision medicine.External Web Site Icon
Rubin MA. Urol Oncol. 2014 Nov 20.

Prioritizing targets for precision cancer medicine.External Web Site Icon
Andre F, et al. Ann Oncol. 2014 Dec;25(12):2295-303

Toward precision medicine with next-generation EGFR inhibitors in non-small-cell lung cancer.External Web Site Icon
Yap TA, Popat S. Pharmgenomics Pers Med. 2014 Sep 19;7:285-95.

Genomically driven precision medicine to improve outcomes in anaplastic thyroid cancer.External Web Site Icon
Pinto N, et al.  J Oncol. 2014;936285

Translating genomics for precision cancer medicine.External Web Site Icon
Roychowdhury S, Chinnaiyan AM. Annu Rev Genomics Hum Genet. 2014;15:395-415

The Cancer Genome Atlas: Accomplishments and Future - April 3, 2015

The Cancer Genome Atlas (TCGA): an immeasurable source of knowledgeExternal Web Site Icon
Tomczak K, et al. Contemp Oncol (Pozn). 2015; 19(1A): A68-A77.

The Cancer Genome Atlas' 4th Annual Scientific SymposiumExternal Web Site Icon
May 11-12 ~ Bethesda, MD

The Cancer Genome Atlas (TCGA) Data Portal External Web Site Icon
Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA

Cancer Genomics Hub: A resource of the National Cancer Institute,External Web Site Icon from the USC Genome Browser

Molecular classification of gastric adenocarcinoma: translating new insights from The Cancer Genome Atlas Research Network.External Web Site Icon
Sunakawa Y et al. Curr Treat Options Oncol 2015 Apr (4) 331

TCGA data and patient-derived orthotopic xenografts highlight pancreatic cancer-associated angiogenesis.External Web Site Icon
Gore J et al. Oncotarget 2015 Feb 25.

Radiogenomics of clear cell renal cell carcinoma: preliminary findings of The Cancer Genome Atlas-Renal Cell Carcinoma (TCGA-RCC) Imaging Research Group.External Web Site Icon
Shinagare AB et al. Abdom Imaging 2015 Mar 10.

Proteomics of colorectal cancer in a genomic context: First large-scale mass spectrometry-based analysis from the Cancer Genome Atlas.External Web Site Icon
Jimenez CR et al. Clin. Chem. 2015 Feb 26.

End of cancer-genome project prompts rethinkExternal Web Site Icon
Geneticists debate whether focus should shift from sequencing genomes to analysing function. Heidi Ledford, Nature News and Comments, January 2015

Cancer Genomics: Insights into Driver Mutations - March 10, 2015

Seek and destroy: Relating cancer drivers to therapiesExternal Web Site Icon
E. Martinez-Ledesma et al. Cell, March 9, 2015

In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunitiesExternal Web Site Icon
C Rubio-Perez et al. Cancer Cell, March 9, 2015

MADGiC: a model-based approach for identifying driver genes in cancer. Adobe PDF file [PDF 373.56 KB]External Web Site Icon
Keegan D. Korthauer et al. Bioinformatics, January 2015

Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine.External Web Site Icon
Benjamin J Raphael et al. Genome Medicine 2014

Novel recurrently mutated genes in African American colon cancers.External Web Site Icon
Guda K et al. Proc Natl Acad Sci U S A. 2015 Jan 12

Sparse expression bases in cancer reveal tumor drivers.External Web Site Icon
Logsdon BA, et al. Nucleic Acids Res. 2015 Jan 12

Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles.External Web Site Icon
Bertrand D, et al. Nucleic Acids Res. 2015 Jan 8

Identification of constrained cancer driver genes based on mutation timing.External Web Site Icon
Sakoparnig T, et al. PLoS Comput Biol. 2015 Jan 8;11(1):e1004027

CaMoDi: a new method for cancer module discovery.External Web Site Icon
Manolakos A, et al. BMC Genomics. 2014 Dec 12;15 Suppl 10:S8.

VHL, the story of a tumour suppressor gene.External Web Site Icon
Gossage L, et al. Nat Rev Cancer. 2014 Dec 23;15(1):55-64

Targeting the MET pathway for potential treatment of NSCLC.External Web Site Icon
Li A, et al. Expert Opin Ther Targets. 2014 Dec 23:1-12

Deciphering oncogenic drivers: from single genes to integrated pathways.External Web Site Icon
Chen J, et al. Brief Bioinform. 2014 Nov 5.

Driver and passenger mutations in cancer.External Web Site Icon
Pon JR, et al. Annu Rev Pathol. 2014 Oct 17

Hereditary Cancer Genetic Testing: Where are We? - December 18, 2014

NCI paper:Prevalence and correlates of receiving and sharing high-penetrance cancer genetic test results: Findings from the Health Information National Trends SurveyExternal Web Site Icon
Taber J.M. et al Public Health Genomics, January 2015

Clinical decisions: Screening an asymptomatic person for genetic risk--polling resultsExternal Web Site Icon
Schulte J, et al. N Engl J Med 2014 Nov;371(20):e30

Testing for hereditary breast cancer: Panel or targeted testing? Experience from a clinical cancer genetics practice.External Web Site Icon
Doherty J, J Genet Couns. 2014 Dec 5

Hereditary colorectal cancer syndromes: American Society of Clinical Oncology clinical practice guideline endorsement of the familial risk-colorectal cancer: European Society for Medical Oncology clinical practice guidelines.External Web Site Icon
Stoffel EM, et al. J Clin Oncol. 2014 Dec 1

Population testing for cancer predisposing BRCA1/BRCA2 mutations in the Ashkenazi-Jewish community: A randomized controlled trial.External Web Site Icon
Manchanda R, et al. J Natl Cancer Inst. 2014 Nov 30;107(1)

Cost-effectiveness of population screening for BRCA mutations in Ashkenazi Jewish women compared with family history-based testing.External Web Site Icon
Manchanda R et al. J Natl Cancer Inst. 2014 Nov 30;107(1). pii: dju380. doi: 10.1093/jnci/dju380. Print 2015 Jan.

Check out our Cancer Genetic Testing  Update Page for additional information and links

Cancer Genomic Tests (October 30, 2014)

Cancer Precision Medicine: Where Are We? - September 18, 2014

NIH announces the launch of 3 integrated precision medicine trials; ALCHEMIST is for patients with certain types of early-stage lung cancer,External Web Site Icon August 2014

National Cancer Institute's Precision Medicine Initiatives for the New National Clinical Trials Network.External Web Site Icon Jeffrey Abrams et al. ASCO Annual Meeting 2014

Personalized medicine: Special treatment.External Web Site Icon
Michael Eisenstein. Nature, September 11, 2014

Why the controversy? Start sequencing tumor genes at diagnosis. Tumor sequencing at the time of diagnosis can give significant insight for successful cancer treatment,External Web Site Icon by Shelly Gunn, Genetic Engineering & Biotechnology News, Sep 10

National Cancer Institute information: Precision medicine and targeted therapyExternal Web Site Icon

Genomics and precision oncology: What's a targeted therapy for cancer?External Web Site Icon An updated list of approved drugs from the National Cancer Institute (2014)

Therapy: This time it's personalExternal Web Site Icon
Gravitz L Nature 509, S52-S54 2014 May 29

Multi-marker solid tumor panels using next-generation sequencing to direct molecularly targeted therapiesExternal Web Site Icon
Michael Marrone, et al. PLoS Currents Evidence on Genomic Tests 2014 May 27

Impact of cancer genomics on precision medicine for the treatment of cancer,External Web Site Icon from the National Cancer Institute

Cancer genomics and precision medicine in the 21st century Adobe PDF file [PDF 2.20 MB]External Web Site Icon, power point presentation from the National Human Genome Research Institute

 

28

TCGA数据库的癌症种类以及癌症相关基因列表

TCGA projects 里面包含的癌症种类非常多,但是我们分析数据时候常常用pan-cancer 12,pan-cancer 17,pan-cancer 21来表示数据集有多少种癌症,一般文献会给出癌症的简称或者全名:

BLCA, BRCA, COADREAD, GBM, HNSC, KIRC, LAML, LGG, LUAD, LUSC, OV, PRAD, SKCM, STAD, THCA, UCEC.

Acute myeloid leukaemia
Bladder
Breast
Carcinoid
Chronic lymphocytic leukaemia
Colorectal
Diffuse large B-cell lymphoma
Endometrial
Oesophageal adenocarcinoma
Glioblastoma multiforme
Head and neck
Kidney clear cell
Lung adenocarcinoma
Lung squamous cell carcinoma
Medulloblastoma
Melanoma
Multiple myeloma
Neuroblastoma
Ovarian
Prostate
Rhabdoid tumour

HCD features: download

这是高置信度的癌症驱动基因列表:共280多个基因
Cancer5000 features: download

这是一篇对接近5000个癌症样本的研究得到的癌症相关基因列表:共230多个基因

参考:http://bg.upf.edu/oncodrive-role/

http://bioinformatics.oxfordjournals.org/content/30/17/i549.full

http://www.nature.com/nature/journal/v505/n7484/full/nature12912.html?WT.ec_id=NATURE-20140123

28

TCGA年度研讨会资料分享

TCGA想必搞生信都或有耳闻,尤其是癌症研究方向的,共4个年度研讨会,主要是pdf格式的ppt分享,有需要的可以具体点击到页面一个个下载自己慢慢研究,也可以用我下面链接直接下载。

本来是有youtube分享演讲视频的,但是国内被墙了,大家就看看ppt吧

http://www.genome.gov/17516564

The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

TCGA is a joint effort of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), which are both part of the National Institutes of Health, U.S. Department of Health and Human Services.

Meetings

pdf链接地址如下

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Laird.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Durbin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Ley.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Sartor.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Ciriello.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Imielinski.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Gao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Carter.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Ng.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Parvin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Raphael.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Lawrence.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Kreisberg.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Marra.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Helman.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Stuart.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Cooper.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Levine.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Natsoulis.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Haussler.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Erkkila.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Gehlenborg.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Qiao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Sivachenko.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Sumazin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Gutman.pdf

http://www.genome.gov/Multimedia/Slides/TCGA1/TCGA1_Mardis.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/01_Shaw.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/02_Chanock.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/03_Staudt.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/05_Creighton.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/06_Stojanov.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/07_Karchin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/08_Mungall.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/09_Hakimi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/10_Gao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/11_Hayes.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/12_Troester.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/13_Knobluach.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/14_Raphael.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/15_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/16_Giordano.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/17_Weinstein.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/18_Zheng.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/19_Getz.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/20_VanDneBroek.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/21_Liao.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/22_Khazanov.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/23_Levine.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/24_Miller.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/25_Ewing.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/26_Cirello.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/27_Verhaak.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/28_Hofree.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/29_Meyerson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/30_Yang.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/31_Wheeler.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/32_Parfenov.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/33_Bernard-Rovira.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/34_Hast.pdf

http://www.genome.gov/Multimedia/Slides/TCGA2/36_Sellars.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/04_Brat.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/05_Mungall.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/06_Boutros.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/07_Zmuda.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/08_Benz.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/09_Zheng.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/11_Creighton.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/12_Aksoy.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/13_Dinh.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/14_Stuart.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/15_Amin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/16_Gross.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/15_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/18_Giordano.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/19_Amin-Mansour.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/20_Oesper.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/21_Gatza.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/22_Bernard.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/23_Sinha.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/24_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/25_Watson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/26_Martignetti.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/27_Bandlamudi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/28_Fu.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/29_Akdemir.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/30_Bass.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/31_Hakimi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/32_Wheeler.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/33_Lehmann.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/34_Gordenin.pdf

http://www.genome.gov/Multimedia/Slides/TCGA3/35_Wyczalkowski.pdf

 

http://www.genome.gov/Multimedia/Slides/TCGA4/02_Zenklusen.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/03_Hutter.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/04_Brat.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/05_Mungall.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/06_Linehan.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/07_Brooks.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/08_Wu.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/09_Giger.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/10_Wilkerson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/11_Orsulic.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/12_Zhong.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/13_Knijnenburg.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/14_Akbani.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/15_Wang.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/16_Poisson.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/17_Alaeimahabadi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/18_Noushmehr.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/19_Pantazi.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/20_Shih.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/21_Stransky.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/22_Giordano.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/23_Davidsen.pdf

http://www.genome.gov/Multimedia/Slides/TCGA4/24_Gross.pdf

 

28

R语言实现并行计算

前面我提到有一个大的运算任务需要很久才完成,所以用到了进度条来监控过程,但并不是改善了计算速度,所以需要用到并行计算,我又在网上找了找。

同样也是一个包,跟matlab的实现过程很像

library(parallel)

cl.cores <- detectCores() #检查当前电脑可用核数。

cl <- makeCluster(cl.cores) #使用刚才检测的核并行运算

#这里用clusterEvalQ或者par开头的apply函数族就可以进行并行计算啦

stopCluster(cl)

R-Doc里这样描述makeCluster函数:Creates a set of copies of R running in parallel and communicating over sockets. 即同时创建数个R进行并行运算。在该函数执行后就已经开始并行运算了,电脑可能会变卡一点。尤其在执行par开头的函数时。

在并行运算环境下,常用的一些计算方法如下:

1、clusterEvalQ(cl,expr)函数利用创建的cl执行expr。这里利用刚才创建的cl核并行运算expr。expr是执行命令的语句,不过如果命令太长的话,一般写到文件里比较好。比如把想执行的命令放在Rcode.r里:clusterEvalQ(cl,source(file="Rcode.r"))

2、par开头的apply函数族。这族函数和apply的用法基本一样,不过要多加一个参数cl。一般如果cl创建如上面cl <- makeCluster(cl.cores)的话,这个参数可以直接用作parApply(cl=cl,…)。当然Apply也可以是Sapply,Lapply等等。注意par后面的第一个字母是要大写的,而一般的apply函数族第一个字母不大写。

另外要注意,即使构建了并行运算的核,不使用parApply()函数,而使用apply()函数的话,则仍然没有实现并行运算。换句话说,makeCluster只是创建了待用的核,而不是并行运算的环境。

参考:http://www.r-bloggers.com/lang/chinese/1131

然后我模仿着用并行计算实现自己的需求

#it did work very fast
library(parallel)
cl.cores <- detectCores()
cl <- makeCluster(cl.cores)
clusterExport(cl, "all_dat_t")  #这里是重点,因为并行计算里面用到了自定义函数
clusterExport(cl, "all_prob_id") #但是这个函数需要用到这两个数据,所以需要把这两个数据加载到并行计算环境里面
prob_202723_s_at=parSapply( #我这里用的parSapply来实现并行计算
cl=cl,  #其中cl是我前面探测到的core数量,

deviation_prob, #deviation_prob是我待并行处理的向量

test_pro #这里其实应该是一个自定义函数,我这里就不写出来了,对上面的deviation_prob向量的每个探针都进行判断
)

28

R语言实现进度条

我也是临时在网上搜索到的教程,然后简单看了一下就实现了,其实就是就用到了一个名称为tcltk的包,直接查看函数tkProgressBar就可以知道怎么用啦!

下面是网上的一个小的示例代码(么有实际意义,仅为举例而已):

library(tcltk2)

u <- 1:2000

plot.new()

pb <- tkProgressBar("进度","已完成 %",  0, 100)

for(i in u) {

x<-rnorm(u)

points(x,x^2,col=i)

info <- sprintf("已完成 %d%%", round(i*100/length(u)))

setTkProgressBar(pb, i*100/length(u), sprintf("进度 (%s)", info), info)

}

close(pb)#关闭进度条

但是下面的代码是我模仿上面这个教程自己实现的。

[R]

# 以下是实现进度条
library(tcltk2)
plot.new()
pb <- tkProgressBar("进度","已完成 %", 0, 100)
prob_202723_s_at_value=rep(0,length(deviation_prob))
start_time=Sys.time() #这里可以计时,因为要实现进度条的一般都是需要很长运算时间
for (i in 1:length(deviation_prob)) {
tmp=test_pro(deviation_prob[i]) #test_pro是我自定义的一个函数,判断该探针是否符合要求。
if (length(tmp)!=0){prob_202723_s_at_value[i]=tmp}
info <- sprintf("已完成 %d%%", round(i*100/length(deviation_prob)))  #进度条就是根据循环里面的i来看看循环到哪一步了
setTkProgressBar(pb, i*100/length(deviation_prob), sprintf("进度 (%s)", info), info)
}
close(pb)#关闭进度条
end_time=Sys.time()
cat(end_time-start_time)

[/R]

28

R语言-比较数据框提取列的速度

结论:从数据框里面取某列数据,三种方法的时间消耗区别很大,直接用索引值,是最快的,而用$符号其次,用列名最慢。

我在R里面建立了一个表达量矩阵,列名是一个个样品,行是一个个探针,矩阵值是该探针在该样品测定的表达量。

那么,如果我要看看名为"202723_s_at"的探针的表达向量与其它所有探针的表达向量的相关系数,我可以用以下三种方法:

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t$"202723_s_at",x)))

user  system elapsed

22.93    0.03   23.03

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t[,"202723_s_at"],x)))

Timing stopped at: 92.02 5.32 97.66

太耗时间了,省去

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t[,grep(prob,names(all_dat_t))],x)))

Timing stopped at: 13.55 0.04 13.66

> prob_num=grep(prob,names(all_dat_t))

> system.time(apply(all_dat_t,2,function(x)  cor(all_dat_t[,prob_num],x)))

user  system elapsed

8.14    0.01    8.17

可以看出,如果我首先根据探针名,grep出它在该表达量矩阵的列数,然后用列数来提取它的表达量是最快的,而且时间改善非常明显!

我们再探究一下cor函数的效率问题

探究的矩阵有54675个变量,每个变量均有189个观测值,如果取这个大矩阵的部分变量来求相关系数,结果如下!

> system.time(cor(all_dat_t[,1:10]))

user  system elapsed

0.001   0.000   0.001

> system.time(cor(all_dat_t[,1:100]))

user  system elapsed

0.003   0.000   0.003

> system.time(cor(all_dat_t[,1:1000]))

user  system elapsed

0.107   0.002   0.108

> system.time(cor(all_dat_t[,1:10000]))

user  system elapsed

11.102   0.849  11.983

> system.time(cor(all_dat_t)) 约六分钟也是可以搞定的

但是如果cor(all_dat_t),六分钟后得到的相关系数矩阵约32G,非常恐怖!

但是它很明显没有把这个32G相关系数矩阵存储到内存,因为我的机器本来就16G内存。我至今不能明白R具体实现机理

 

28

生信教程推荐-MSU的一个生信课程

http://angus.readthedocs.org/en/2014/index.html

Next-Gen Sequence Analysis Workshop (2014)

This is the schedule for the 2014 MSU NGS course.

This workshop has a Workshop Code of Conduct.

Download all of these materials or visit the GitHub repository.

Day Schedule
Monday 8/4
Tuesday 8/5
Wed 8/6
Thursday 8/7
Friday 8/8
Saturday 8/9
Monday 8/11
Tuesday 8/12
Wed 8/13
Thursday 8/14
Friday 8/15

 

27

根据基因表达量对样品进行分类ConsensusClusterPlus

bioconductor系列的包都是一样的安装方式:
source("http://bioconductor.org/biocLite.R")
biocLite("ConsensusClusterPlus")

这个包是我见过最简单的包, 加载只有做好输入数据,只需要一句话即可运行,然后默认输出所有结果

读这个包的readme,很容易学会
就是做好一个需要来进行分类的样品的表达量矩阵。或者选择上一篇日志用GEOquery这个包下载的表达量矩阵也可以进行分析
因为这个包是用ALL数据来做测试的,所以可以直接加载这个数据结果,这样就能得到表达矩阵啦
library(ALL)
data(ALL)
d=exprs(ALL)
d[1:5,1:5]
可以看到数据集如下

> d[1:5,1:5]

             01005    01010    03002    04006    04007
1000_at   7.597323 7.479445 7.567593 7.384684 7.905312
1001_at   5.046194 4.932537 4.799294 4.922627 4.844565
1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
1004_at   5.925260 5.912780 5.893209 6.170245 5.615210
> dim(d)
[1] 12625   128
共128个样品,12625个探针数据
也有文献用RNAs-seq的RPKM值矩阵来做
对上面这个芯片表达数据我们一般会简单的进行normalization ,然后取在各个样品差异很大的那些gene或者探针的数据来进行聚类分析
mads=apply(d,1,mad)
d=d[rev(order(mads))[1:5000],]

d = sweep(d,1, apply(d,1,median,na.rm=T))

#也可以对这个d矩阵用DESeq的normalization 进行归一化,取决于具体情况
library(ConsensusClusterPlus)
#title=tempdir() #这里一般改为自己的目录
title="./" #所有的图片以及数据都会输出到这里的
results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,
 title=title,clusterAlg="hc",distance="pearson",seed=1262118388.71279,plot="png")
这样就OK了,你指定的目录下面会输出大于9个图片

clipboard

大家看看说明书就知道这个包的输出文件是什么了。
很多参数都是需要调整的,一般我们的maxK=6是根据实验原理来调整,如果你的样品应该是要分成6类以上,那么你就要把maxK=6调到一点。
查看结果results[[2]][["consensusClass"] 可以看到各个样品被分到了哪个类别里面去
results[[3]][["consensusClass"]
results[[4]][["consensusClass"] 等等