生信菜鸟团 » 癌症

2016-TCGA数据挖掘系列文章之癌症男女有别

ulwvfje — Wed, 18 May 2016 15:31:28 +0000

这是TCGA数据挖掘系列文章之一，是安德森癌症研究中心的Han Liang主导的，纯粹的生物信息学数据分析文章。

文章题目是：comprehensive characterization of molecular differences in cancer between male and female patients.

研究意义：癌症病人的性别对肿瘤发生，扩散的意义不言而喻。不仅仅是因为很多癌症本来就是有性别特异性，比如卵巢癌之于女性、前列腺癌之于男性。即使对于其它并非性别特异性的癌症种类，男女病人在肿瘤发生，扩散，以及治疗阶段的反应也大不一样。但是以前对这样分子机理研究的很有限，一般集中在某些性别相关的分子pattern，比如非小细胞肺癌女性患者的EGFR突变，但那些研究要么就局限于单一的基因，要么局限于单一的数据类型，或者研究单一的癌症。严重缺乏一个全面的，系统的分析癌症患者的性别差异。而且TCGA数据库的出现让这一个研究变成了可能，这也就是本文章的出现的原因。

数据挖掘的对象：

如表所示，涉及到13种癌症，TCGA的六种数据()都用上了，因为是2016年，所以数据量也比较全面了。

还有他们的临床信息，也结合起来分析，具体样本个数，以及癌症种类分布见下表。

六种数据分别是：

全外显子组的somatic突变数据，

affymetrix的snp6.0芯片的拷贝数变异数据，

人甲基化450K芯片的DNA甲基化数据，

RNA-seq的mRNA表达量数据，

miRNA的表达量数据，

蛋白表达数据。

文章对这些数据做了6个方面的分析：

一是对各个样本进行权重矫正

这个偏统计学了，大家可以自己去看原理，主要是为了排除除性别外的其它影响因素( sex, age at diagnosis, smoking status, tumor stage, and histology subtype),采用了一种叫做propensity score.的统计学方法来矫正这些共影响因子，这一统计方法是上世纪80年代提出了的，被广泛应用于clinical research, economics, and social sciences。

二是用六种数据结合起来把癌症根据性别影响分成两类

其中一类受性别影响较弱，是LGG, GBM, COAD, READ, and LAML

另一类受性别影响较强，包括THCA, HNSC, LUSC, LUAD, LIHC, BLCA, KIRP, and KIRC

并且提出一个sex-bias index 的概念来描述他们的差异 defined on the basis of the ratio of new cases of female and male patients
受性别影响较弱那几个癌种的男性与女性患者比较起来差异特征很少(44–104, mean 67)

而受性别影响较强那几个癌种的男性与女性患者比较起来差异特征很多(240–3,521, mean 1,112)

看下面的图可知，这两组差异非常显著。而定义的差异特征是非常重要的概念，对6种数据，差异特征都不一样，下面会具体讲到。

三是单独拿somatic mutation数据来分析

作者是直接从Firehose (http://gdac.broadinstitute.org) 里面下载了所有的上面列出的样本的MAF突变数据，一般TCGA记录的MAF突变数据就是他们已经分析好的somatic mutation数据。作者只分析了non-silent mutations，只考虑那些突变频率(基于这个文章的群体)大于5%的位点，而且去掉了somatic mutation个数超过1000的个体，男女之间用费谢尔精确检验来计算差异显著度。

然后作者把这张图描述了一些生物学意义，比如某些癌种某些基因的男女患者差异非常显著，该基因功能是什么，可能的原因是什么，等等。

四是单独拿somatic的CNV数据来分析

这个分析也很简单，还是直接从Firehose (http://gdac.broadinstitute.org) 里面下载了所有的上面列出的样本的CNV数据，然后每个癌种都分男女分别跑一下GISTIC这个软件，得到somatic的拷贝数变异数据库，GISTIC软件是基于matlab的，在我的博客有详细介绍该软件如何使用。

把GISTIC的结果，包括focal and arm-level amplifications/deletions都进行了信息的生物学解释，哪些基因很重要，哪些通路很重要，都详细的描述了，这个需要作者具有渊博的生物学背景知识，而不是数据分析技巧了。

五是结合4种表达量数据来分析

分析完突变数据，然后开始分析表达数据，作者把4种表达量数据综合起来分析了，包括甲基化位点表达数据，mRNA，miRNA和蛋白的表达数据。前两个是从TCGA data portal里面下载的，后两个是从Firehose里面下载的。

其中mRNA表达数据，基于RSEM的表达值，分析表达数据差异的时候，还做了GSEA分析。

也研究了miRNA调控，用miRTarBase数据库来验证miRNA的target，或者通过TargetScan, miRanda and miRDB 数据库来预测

表达数据一般用热图来可视化，然后重点讲几个通路，为什么在癌症这么相关呢？为什么男女差异这么大呢？等等

六是根据自己的分组来探索一些临床指标以及药物可能的影响。

这个算是本文比较新颖的地方了，作者从FDA批准的一些癌症相关药物里面找到了这些药物作用的基因，然后把这些基因跟有性别差异的基因进行交叉比较。

这个研究意义非凡，因为现在对癌症病人用药都是一视同仁，不会考虑到性别的差异，而我们的分析恰恰证明了癌症患者的性别差异还是蛮大的，为了更好的治疗，这些必须考虑进去。比如SRC这个基因在HNSC这个癌症患者里面，女性比男性显著高表达。

下面这个高大上的图说明了一切，但想真搞明白，不是一天两天的事情。

对CCLE数据库可以做的分析

ulwvfje — Mon, 11 Jan 2016 11:26:21 +0000

收集了那么多的癌症细胞系的表达数据，拷贝数变异数据，突变数据，总不能放着让它发霉吧!

这些数据可以利用的地方非常多，但是在谷歌里面搜索引用了它的文章却不多，我挑了其中几个，解读了一下别人是如何利用这个数据的，当然，主要是用那个mRNA的表达数据咯！

第一篇：http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0111146

这篇文献对CCLE的数据进行了八个步骤的处理，一个合格的生物信息学分析着完全可以重写这个过程

step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4, 2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE) [16] in March 2013.

step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.

step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.

前三步是为了得到27个胃癌相关细胞系的mRNA表达矩阵，方法是下载cel文件，用RMA归一化，对多探针基因去最大表达量探针！

step4:Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.

Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.

step5:We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k = 2 as it gives sufficiently stable similarity matrix.

step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) [18]with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.

先用甲基化数据来聚类，得到putative driver genes，然后再用这些基因的表达数据来再次聚类，分成两类，然后对这两类进行SAM找差异基因

step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1 (Msigdb v3.1) [19], [20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0].

step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.

这里既用来gene set的富集分析，又用来超几何分布的富集分析，结果去看看这篇文章就知道了！

第二篇文献：http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081803#pone.0081803.s001

这篇文章只用了CCLE的一个地方，就是看看不同cancer type里面的某个基因表达boxplot

这个图的数据用GEOquery可以得到，样本的分类信息也用GEOquery可以得到，这样就可以做下面这个图了，非常简单

Further, the Cancer Cell Line Encyclopedia (CCLE) database demonstrated that of 1062 cell lines representing 37 distinct cancer types, glioma cell lines express the highest levels of STK17A

结论就是：STK17A is highly expressed in glioma cell lines compared to other cancer types. Data was obtained through the Cancer Cell Line Encyclopedia (CCLE).

第三篇文献：http://www.nature.com/ncomms/2013/130709/ncomms3126/fig_tab/ncomms3126_F4.html

这篇文献更简单了，直接对这个表达矩阵进行聚类：

Evaluating cell lines as tumour models by comparison of genomic profiles

The 5,000 most variable genes were used for unsupervised clustering of cell lines by mRNA expression data. Cell lines are colour-coded (vertical bars) according to the reported tissue of origin (a PDF version that can be enlarged at high resolution is in Supplementary Information, Supplementary Fig. S4); horizontal labels at bottom indicate the dominating tissue types within the respective branches of the dendrogram. Most ovarian cancer cell lines (magenta) cluster together, interspersed with endometrial cell lines. However, some ovarian cancer cell lines cluster with other tissue types (*). Top right panels: neighbourhoods (1) of the top cell lines in our analysis, (2) of cell line IGROV1, and (3) of cell line A2780. For the ovarian cancer cell lines in these enlarged areas, the histological subtype as assigned in the original publication is indicated by coloured letters.

就直接拿整个表达矩阵即可，然后挑选变异最大的5000个基因来进行聚类，就可以得到类似的图

寻找somatic突变的软件大合集

ulwvfje — Tue, 05 Jan 2016 12:04:31 +0000

其实somatic突变很容易理解，你测同一个人的正常组织和癌症组织，然后比较对这两个样本数据call出来的snp位点

只存在癌症组织数据里面的snp位点就是somatic突变，在两个样本都存在的snp位点就是germline的突变，不过一般大家研究的都是somatic突变。

当然，理论上是很简单，但是那么多统计学家要吃饭呀，肯定得把这件事搞复杂才行，所以就有了非常多的somatic突变 calling的软件，开个玩笑哈，主要是因为我们的测序并不是对单个细胞测序，我们通常意义取到的正常组织和癌症组织都不是纯的，所以会有很多关于这一点的讨论。

正好我看到了一篇帖子，收集了大部分比较出名的做somatic mutation calling的软件，当然，我只用过mutect和varscan。

来自于：https://www.biostars.org/p/19104/

Here are a few more, a summary of the other answers, and updated links:

deepSNV (abstract) (paper)
EBCall (abstract) (paper)
GATK SomaticIndelDetector (note: only available after an annoying sign-up and login)
Isaac variant caller (abstract) (paper)
joint-snv-mix (abstract) (paper)
LoFreq (abstract) (paper) (call on tumor & normal separately and then use a filter to derive somatic events)
MutationSeq (abstract) (paper)
MutTect (abstract) (paper) (note: only available after an annoying sign-up and login)
QuadGT (for calling single-nucleotide variants in four sequenced genomes comprising a normal-tumor pair and the two parents)
samtools mpileup - by piping BCF format output from this to bcftools view and using the '-T pair' option
Seurat (abstract) (paper)
Shimmer (abstract) (paper)
SolSNP (call on tumor & normal separately and then compare to identify somatic events)
SNVMix (abstract) (paper)
SOAPsnv
SomaticCall (manual)
SomaticSniper (abstract) (paper)
Strelka (abstract) (paper)
VarScan2 (abstract) (paper)
Virmid (abstract) (paper)

For a much more general discussion of variant calling (not necessarily somatic or limited to SNVs/InDels) check out this thread: What Methods Do You Use For In/Del/Snp Calling?

Some papers describing comparisons of these callers:

The ICGC-TCGA DREAM Mutation Calling challenge has a component on somatic SNV calling.

This paper used validation data to compare popular somatic SNV callers:

Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers

You'll need to update the link to MuTect. Broad Institute has begun to put portable versions of their tools on Github, like thelatest release of MuTect. The Genome Institute at WashU has been using Github for a while, but portable versions of their tools can be found here and here.

其实somatic的calling远比我们想象的要复杂：

To rehash/expand on what Dan said, if you're sequencing normal tissue, you generally expect to see single-nucleotide variant sites fall into one of three bins: 0%, 50%, or 100%, depending on whether they're heterozygous or homozygous.

With tumors, you have to deal with a whole host of other factors:

Normal admixture in the tumor sample: lowers variant allele fraction (VAF)
Tumor admixture in the normal - this occurs when adjacent normals are used, or in hematological cancers, when there is some blood in the skin normal sample
Subclonal variants, which may occur in any fraction of the cells, meaning that your het-site VAF might be anywhere from 50% down to sub-1%, depending on the tumor's clonal architecture and the sensitivity of your method
Copy number variants, cn-neutral loss of heterozygosity, or ploidy changes, all of which again shift the expected distribution of variant fractions

These, and other factors, make calling somatic variants difficult and still an area that is being heavily researched. If someone tells you that somatic variant calling is a solved problem, they probably have never tried to call somatic variants.

Sounds like somatic / tumor variant calling is something that will be solved by improvements at the wet lab side ( single cell selection / amplification / sequencing ) . Rather than at the computational side.

Well, single cell has a role to play (and would have more of one if WGA wasn't so lossy), but realistically, you can't sequence billions of cells from a tumor individually. Bulk sequencing still is going to have a role for quite a while.

Hell germ line calling isn't even a solved problem. Still get lots of false positives (and false negatives). It just tends to work so well that it is hard to improve it much except by making it faster, less memory intensive, etc

Solved was the wrong word. I just meant improved. There is only so much you can do at the computational side. Wet lab also has its part to play.

A germline variant caller generally has a ploidy-based genotyping algorithm built in to part of the algorithm/pipeline. I believe, IIRC, the GATK UnifiedGenotyper for instance does both variant calling and then genotype calling. So to call a genotype for a variant it is expecting a certain number of reads to support the alternative allele. When working with somatic variants all of the assumptions about how many reads you expect with a variant at a position to distinguish between true and false positives are no longer valid. Except for fixed mutations throughout the tumor population only some proportion of cells will hold a somatic variation. You also typically have some contamination from normal non-cancerous cells. Add in complications from significant genomic instability with lots of copy number variations and such and you have a need for a major change in your model for calling variation while minimizing artifactual calls. So you have a host of other programs that have been developed specifically for looking at somatic variation in tumor samples.

一篇文献：

Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

是qiagen公司发的

High-throughput sequencing is rapidly becoming common practice in clinical diagnosis and cancer research. Many algorithms have been developed for somatic single nucleotide variant (SNV) detection in matched tumor-normal DNA sequencing. Although numerous studies have compared the performance of various algorithms on exome data, there has not yet been a systematic evaluation using PCR-enriched amplicon data with a range of variant allele fractions. The recently developed gold standard variant set for the reference individual NA12878 by the NIST-led “Genome in a Bottle” Consortium (NIST-GIAB) provides a good resource to evaluate admixtures with various SNV fractions.

Using the NIST-GIAB gold standard, we compared the performance of five popular somatic SNV calling algorithms (GATK UnifiedGenotyper followed by simple subtraction, MuTect, Strelka, SomaticSniper and VarScan2) for matched tumor-normal amplicon and exome sequencing data.

Nevertheless, detecting somatic mutations is still challenging, especially for low-allelic-fraction variants caused by tumor heterogeneity, copy number alteration, and sample degradation

We used QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (CCP, Version 1) for enrichment and library construction in triplicate。

QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (Version 1) was used to amplify the target region of interest (124 genes, 800 Kb).

When analyzing different types of data, use of different algorithms may be appropriate.

DNA samples of NA12878 and NA19129 were purchased from Coriell Institute. Sample mixtures were created based on the actual amplifiable DNA in each sample, resulting in 0%, 8%, 16%, 36%, and 100% of NA12878 sample mixed in the NA19129 sample, respectively.We treated the mixed samples at 8%, 16%, 36%, and 100% as the virtual tumor samples and the 0% as the virtual normal sample.

五个软件的算法是：

1. NaiveSubtract — SNVs were called separately from virtual tumor and normal samples using GATK UnifiedGenotyper [22]. For exome sequencing data, reads were already mapped, locally realigned and recalibrated by the 1,000 Genomes Project. So SNVs were directly called on the BAM files using GATK Unified Genotyper. Then, SNVs detected in the virtual normal sample were removed from the list of SNVs detected in the virtual tumor sample, leaving the “somatic” SNVs.

2. MuTect — MuTect is a method developed for detecting the most likely somatic point mutations in NGS data using a Bayesian classifier approach. The method includes pre-processing aligned reads separately in tumor and normal samples and post-processing resulting variants by applying an additional set of filters. We ran MuTect under the High-Confidence mode with its default parameter settings. We disabled the “Clustered position” filter and the “dbSNP filter” for the amplicon sequencing reads, and we disabled the “dbSNP filter” for the exome sequencing.

3. SomaticSniper — SomaticSniper calculates the Bayesian posterior probability of each possible joint genotype across the normal and cancer samples. We tuned the software’s parameters to increase sensitivity and then filtered raw results using a Somatic Score cut-off of 20 to improve specificity.

4. Strelka — Strelka reports the most likely genotype for tumor and normal samples based on a Bayesian probability model. Post-calling filters built into the software are based on factors such as read depth, mismatches, and overlap with indels. We skipped depth filtration for exome and amplicon sequencing data as recommended by the Strelka authors. For the amplicon sequencing reads, we set the minimum MAPQ score at 17 for consistency with the defaults in GATK UnifiedGenotyper. We used variants passing Strelka post-calling filters for analysis.

5. VarScan2 — VarScan2 performs analyses independently on pileup files from the tumor and normal samples to heuristically call a genotype at positions achieving certain thresholds of coverage and quality. Then, sites of the genotypes not matched in tumor and normal samples are classified into somatic, germline, or ambiguous groups using Fisher’s exact test. We generated the pileup files using SAMtools mpileup command.

The compatibility of the output VCF files between different methods as well as the NIST-GIAB gold standard was examined using bcbio.variation tools and manual inspection. The reported SNP call representations between files are comparable to each other.

来自于文献：http://www.biomedcentral.com/1471-2164/15/244

使用oncotator做突变注释

ulwvfje — Tue, 05 Jan 2016 11:51:53 +0000

功能：vcf格式突变数据进一步注释成maf格式

做过癌症数据分析的童鞋都知道，TCGA里面用maf格式来记录突变！那么maf格式的数据是如何得来的呢，我们都知道，做完snp-calling一般是得到vcf格式的突变记录数据文件，然后再用annovar或者其它蛋白结构功能影响预测软件注释一下，还远达不到maf的近100条记录。

而大名鼎鼎的broad institute就规定了maf格式的突变注释文件，他就是利用了十几个常见的已知数据库来注释我们得到的vcf突变记录，通常是对somatic的突变才注释成maf格式的数据！

大名鼎鼎的broadinstitute出品的突变注释工具：http://www.ncbi.nlm.nih.gov/pubmed/25703262

源码在github： https://github.com/broadinstitute/oncotator

软件官网是： https://www.broadinstitute.org/oncotator/

说明书: http://gatkforums.broadinstitute.org/gatk/discussion/4154/howto-install-and-run-oncotator-for-the-first-time

需要提前自己下载14G的数据：http://www.broadinstitute.org/~lichtens/oncobeta/oncotator_v1_ds_Jan262015.tar.gz

软件可以在官网下载：https://github.com/broadinstitute/oncotator/archive/v1.8.0.0.tar.gz

本身也是一个在线工具：

input data数据指南：https://www.broadinstitute.org/oncotator/help/#inputformat

集成了下面所有的分析资源

而且还提供了API

Genomic Annotations

Gene, transcript, and functional consequence annotations using GENCODE for hg19.
Reference sequence around a variant.
GC content around a variant.
Human DNA Repair Gene annotations from Wood et al.

Protein Annotations

Site-specific protein annotations from UniProt.
Functional impact predictions from dbNSFP.

Cancer Variant Annotations

Observed cancer mutation frequency annotations from COSMIC.
Cancer gene and mutation annotations from the Cancer GenCensus.
Overlapping mutations from the Cancer Cell Line Encyclopedia.
Cancer gene annotations from the Familial Cancer Database.
Cancer variant annotations from ClinVar.

Non-Cancer Variant Annotations

Common SNP annotations from dbSNP.
Variant annotations from 1000 Genomes.
Variant annotations from NHLBI GO Exome Sequencing Project (ESP).

因为要下载的数据有点多，我这里就不用自己的电脑测试了，安装过程也很简单的！

2012-LAD的三个亚型的不同生物学意义

ulwvfje — Tue, 01 Sep 2015 03:49:34 +0000

文献名：Differential Pathogenesis of Lung Adenocarcinoma Subtypes Involving Sequence Mutations, Copy Number, Chromosomal Instability, and Methylation
Lung adenocarcinoma (LAD)的遗传变异度很大。
这个癌症可以分成三类：The LAD molecular subtypes (Bronchioid, Magnoid, and Squamoid)
然后我们在三个subtypes里面分析了以下四个特征，发现不同subtypes差异非常显著。
1、Gene mutation rates (EGFR, KRAS, STK11, TP53),
2、chromosomal instability,
3、regional copy number
4、genomewide DNA methylation
另外三个临床特征也是很显著。
1、Patient overall survival,
2、cisplatin plus vinorelbine therapy response
3、predicted gefitinib sensitivity
所以，我们的分类非常好，而且对临床非常有帮助。
对LAD的研究数据包括
1，DNA copy number
2，gene sequence mutation
3，DNA methylation
4，gene expression
即使是TP53这样的基因在LAD的突变率也才35%，所以我们的LAD应该更加细分，因为EGFR mutation and KRAS mutation这样的突变对治疗很有指导意义，细分更加有助于临床针对性治疗方案的选择。
我们选取了116个LAD样本的数据，分析了1，genome-wide gene expression,,2，genomewide DNA copy number, 3，genome-wide DNA methylation, 4，selected gene sequence mutations
得到的结论是：LAD molecular subtypes correlate with grossly distinct genomic alterations and patient therapy response
数据来源如下：
Gene expression --> Agilent 44 K microarrays.
DNA copy number --> Affymetrix 250 K Sty and SNP6 microarrays.
DNA methylation --> MSNP microarray assay.
DNA from EGFR, KRAS, STK11 and TP53 exons --> ABI sequencers
我们用的是R语言包 ConsensusClusterPlus根据gene expression 来对我们的LAD进行分类molecular subtypes
分类的基因有506个(the top 25% most variable genes, 3,045, using ConsensusClusterPlus)，A nearest centroid subtype predictor utilizing 506 genes

这三类LAD的过表达基因参与不同的生物功能，
Bronchioid – excretion genes, asthma genes, and surfactants (SFTPB, SFTPC, SFTPD);
Magnoid – DNA repair genes, such as thymine-DNA glycosylase (TDG);
Squamoid – defense response genes, such as chemokine ligand 10 (CXCL10)
而且也对应不同的临床数据
Bronchioid had the most females, nonsmokers, early stage tumors, and low grade tumors, the greatest acinar content, the least necrosis, and the least invasion.
Squamoid had the most high grade tumors, the greatest solid content, and the lowest papillary content.
Magnoid had themost smokers and the heaviest smokers by pack years.
它们的基因突变pattern也有很大区别。
Bronchioid had the greatest EGFR mutation frequency
Magnoid had the greatest mutation frequencies in TP53, KRAS and STK11.
为了研究不同亚型癌症的突变模式的不同（genomewide mutation rates），我们同时又研究了a large set of rarely mutated genes (n = 623) from the Ding et al. cohort

结论：
Bronchioid subtype 更有可能受益于EGFR inhibitory therapy
Magnoid tumors also have severe genomic alterations including the greatest CIN, the most regional CN alterations, DNA hypermethylation, and the greatest genomewide mutation rate.
the Squamoid subtype displayed the fewest distinctive alterations that included only regional CN alterations

2013-science-3205tumors-12types-4-ways-find-291HCD

ulwvfje — Mon, 31 Aug 2015 15:16:51 +0000

文献名：Comprehensive identification of mutational cancer driver genes across 12 tumor types

本文比较了四种寻找癌症驱动基因的方法，并且得到了综合性的、可靠的291个HCDs 基因列表。

数据来源于3205个肿瘤样本，共涉及到12种癌症。

Cancer Gene Census (CGC) 数据库里面已经有了接近500个cancer genes

癌症基因组研究分析可以得到数以万计的somatic mutations，但是其中很少一部分才是驱动肿瘤发生，发展的突变。

而且大多数driver genes的突变频率很低，又由于肿瘤的异质性，大量样本的研究是必须的。

主流的四种找癌症驱动基因的方法如下：

1、Most common methods identify genes that are mutated more frequently than expected from the background mutation rate (recurrence)

2、Other methods - a bias towards the accumulation of functional mutations (FM bias)

3、other methods exploit the tendency to sustain mutations in certain regions of the protein sequence (CLUST bias)

4、other approaches exploit the overrepresentation of mutations in specific functional residues, such as phosphorylation sites (ACTIVE bias)

它们的代表软件是MuSiC, OncodriveFM, OncodriveCLUST and ActiveDriver

本文把这四种方法进行了比较，并且综合了它们的结果。

In summary, we provide a very reliable list of 291 HCDs and a second one, of 144 CDs, more comprehensive but with an expectedly higher false-positives rate

One hundred and sixty-five of these candidates are novel findings not included in the CGC.

然后，作者对这291个HCDs基因进行了功能分析，其中，它们主要集中在以下五个生物功能

Chromatin remodeling,

mRNA processing,

Cell signaling/proliferation,

Cell adhesion,

DNA repair/Cell cycle

然后把四种方法综合得到的291个HCDs基因与Cancer Gene Census (CGC) 数据库里面已经有的接近500个cancer genes进行综合比较

本文首次展示了综合多种癌症驱动基因寻找方法的可能性，这种综合是基于两个事实：

1，各种方法找癌症驱动基因本来就没有金标准，所以综合多种方法，更comprehensive。

2，综合多种方法能更好的比较评估所找到的癌症驱动基因的准确性。

2014-4742samples-21tumors-Cancer5000-set-254-genes

ulwvfje — Mon, 31 Aug 2015 14:27:17 +0000

文献名： Discovery and saturation analysis of cancer genes across 21 tumour types.

我们知道对一个癌症的多个样本进行研究，其实很少高达20%样本突变 most intermediate frequencies (2–20%)，还有很多低频突变，因为研究样本不够，从而不被发现

我们从 4,742个tumor-normal pairs的外显子测序数据集研究了somatic point mutations，共21种癌症。

癌症基因可能集中于以下七个功能：

proliferation,

apoptosis,

genome stability,

chromatin regulation,

immune evasion,

RNA processing

protein homeostasis

我们用有放回的抽样方法对数据进行统计，得出结论：如果我们对某个癌症的研究样本高达500-6000个的话，可以发现更多的临床低频突变。

这篇文章是为了解决以下三个问题：

1、大规模的研究cancer就能达到鉴别出所有的cancer driver genes的程度吗？（Coverage of known cancer genes）

2，增大样本量是否会揭示很多cancer driver genes？（Analysis of novel candidate cancer genes）

3、我们距离对所有的cancer driver genes的完全认知还有多远？（Saturation analysis）

突变数据的分析流程是Broad’s stringent filtering and annotation pipeline

突变情况如下：

3,078,483 somatic single nucleotide variations(SSNVs),

77,270 small insertions and deletions (SINDELs)

29,837 somatic di-, tri- or oligonucleotide variations (DNVs, TNVs and ONVs, respectively)

an average of 672 per tumour–normal pair

包括：

540,831 missense,

207,144 synonymous,

46,264 nonsense,

33,637 splice-site

2,294,935 non-coding mutations

我们找驱动基因的方法是：

We used the most recent version of the MutSig suite of tools

which looks for three independent signals:

high mutational burden relative to background expectation,

accounting for heterogeneity;

clustering of mutations within the gene;

enrichment of mutations in evolutionarily conserved sites.

我们把以上MutSig的几个独立组件分析得到的p-value组合起来，判断驱动基因，我们即对每种癌症做了单独分析，同时也对这21种癌症做了综合分析。

我们找到的驱动基因的结果：

单独对各个癌症进行分析，可以总共找到334个基因，当然不同癌症找到的基因有交集。

These 334 pairs involve 224 distinct genes.

The number of genes detected per tumour type varied considerably (range of 1–58)

找到的驱动基因的个数差异主要取决于癌症种类的不同，然后，跟该癌症的样本量有关。

只有22种基因能在超过三种癌症里面都是被判定为驱动基因。

如果我们把21种癌症合并起来找驱动基因，可以找到114个，其中有30个是单独对各个癌症进行分析所找不到的，有80个在单独癌症分析可以找到。

所以单独对各个癌症进行分析找到的224个基因里面，有140个是合并癌症分析找不到的。其实画一个韦恩图就很好理解了。

对各个癌症进行分析，共21次分析，加上合并分析，共22次飞行，总共可以得到a Cancer5000 set containing 254 genes.

我们再严格分析一下254个基因在Cancer5000 set，得到219 distinct genes.叫做Cancer5000-S (for ‘stringent’) gene。

Cancer Gene Census (CGC)组织的 (v65)版本包含着130个cancer genes driven by somatic point mutations，其中82个被我们这次统计分析发现啦。

Four genes encode anti-proliferative proteins, in which loss-offunction mutations would be expected to contribute to oncogenesis.

Sixadditionalgenesencode proteins thatare clearlyinvolved incell proliferation: RHEB, RHOA, SOS1, ELF3, SGK1 and MYOCD.

Five genes encode pro-apoptotic factors, in which loss-of-function mutations would be expected to promote oncogenesis

Six genes encode proteins related to genome stability.

Fivegenesareassociatedwithchromatinregulation

Three genes encode proteins whose loss is expected to help tumours evade immune attack

Three genes are associated with RNA processing and metabolism.

One gene, TRIM23, is involved in protein homeostasis.

Beyond these 33 genes, the set of 81 novel genes is likely to contain

additional true cancer genes.

有返回抽样方法是：An effective test is to perform ‘down-sampling’; that is, to study how the number of discoveries increases with sample size, by repeating the analysis on random subsets of samples of various smaller sizes.

饱和度分析结果：还远未到饱和，不同突变频率的基因被发现的个数随着样本量的增大而增多的速度不同。

Genes mutated in 20% of tumours are approaching saturation;

those mutated at frequencies of 10–20% are still rising rapidly, but at a decreasing rate;

those at 5–10% increasing linearly;

and those at ,5% are increasingly at an accelerating rate.

我们对样本量的要求是：突变背景高的癌症（如，黑色素瘤）需要的样品更多，而那些突变背景低的癌症（如成神经细胞瘤）需要近650个样本就可以很好的分析驱动基因了

Creating a reasonably comprehensive catalogue of candidate cancer genes mutated in 2% of patients will require between approximately 650 samples (for tumours with ,0.5 mutations per Mb, such

as neuroblastoma) to approximately 5,300 samples (for melanoma, with 12.9 mutations per Mb)

2015-MADGiC-identify-cancer-driver-gene

ulwvfje — Mon, 31 Aug 2015 11:19:58 +0000

最新的一个寻找cancer 的driver gene的软件：

Cancer is thought to result from the accumulation of causal somatic mutations throughout the lifetime of an individual.

这些cancer-driving mutations 主要影响三类基因： 1、oncogenes 2、tumor-suppressor genes 3， stablity geens

第一个突变是tumorigenesis ，随后的突变就 driver tumor progression

识别这些突变非常有利于了解gene function 和药物靶点设计

区分 driver genes 和 passenger genes 能更好的利用各种数据库得到的海量突变信息

基于频率的区分方法 rely on an estimate of a background mutation rate which represents the rate of random passenger mutations.

也就是文献(Ding et al., 2008).提出的方法，但它忽略了以下四点

1、mutation type (transition versus transversion)

2、nucleotide context(which base is at the mutation site

3、dinucleotide context (which bases are located at neighboring sites to the mutation),

4、expression level of the gene

然后有文献提出了以下三种改进

Sjoblom et al.(2006) account for nucleotide and dinucleotide context in searching for drivers of breast and colorectal cancer.

MuSiC (Dees et al.,2012) accounts for mutation type and allows for sample-specific mutation rates;

Lawrence et al.(2013) (MutSigCV) also allow for the inclusion of gene-specific factors such as expression level and replication timing.

但是他们有个共同延续下来的的缺点，就是默认驱动基因的突变频率要高于背景突变频率。

实际上，除了突变频率，还有一些criteria也很重要，所以有两个数据库SIFT (first reported by Ng and Henikoff (2001), later updated by Kumar et al. (2009)), Polyphen (Adzhubei et al., 2010) 和MutationAssessor (Reva et al., 2011)

这两个数据库整合了 sequence context, position, and protein characteristics to assess a mutation’s functional impact.

总结一下identity cancer driver genes的criteria

1、mutation frequency,

2、mutation type,

3、gene-specific features such as replication timing and expression level that are known to affect background rates of mutation,

4、mutation-specific scores that assess functional impact, and the spatial patterning of mutations that only becomes apparent when thousands of samples are considered.

以前的方法都只是部分涉及到上面的criteria

而我们提出了a unified empirical Bayesian Model-based Approach for identifying Driver Genes in Cancer (MADGiC) that utilizes each of these features.

2014-REVIEW-identifying driver mutation in sequenced cancer genome

ulwvfje — Mon, 31 Aug 2015 11:17:59 +0000

somatic mutations 含义很广，包括：SNVs，Indel，CNAs，SVs等

However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations.

Cancer is driven largely by somatic mutations that accumulate in the genome over an individual’s lifetime, with additional contributions from epigenetic and transcriptomic alterations

低通量时代研究，成功例子： imatinib has been used to target cells expressing the BCR-ABL fusion gene in chronic myeloid leukemia

gefitinib has been used to inhibit the epidermal growth factor receptor in lung cancer

但远远不够。

NGS的三大挑战：1，indentying somatic mutations，误差/ 肿瘤异质性 2，识别driver genes 3，确定由somatic mutations 改变的pathways和其它生物过程

误差来源：optical PCR duplicates, GC-bias, strand bias (where reads indicating a possible mutation only align to one strand of DNA) and alignment artifacts resulting from low complexity or repetitive regions in the genome.

most methods for somatic mutation detection address only a subset of the possible sources of error，call snp的软件众多

identifying driver mutations的三个要点：

1，identifying recurrent mutations;

2，predicting the functional impact of individual mutations;

3，assessing combinations of mutations using pathways, interaction networks, or statistical correlations.

三个要点分别衍生了大量的软件，它们的问题在于：

1，直接看突变频率的那些软件to determine whether the observed number of mutations in the gene is significantly greater than the number expected according to a background mutation rate (BMR).

BMR 实在是太难确定了，低了会导致很多假阳性，而高了，又错过很多真实的driver mutations，但是突变频率非常高的那些基因肯定是没有问题的，比如说TP53，无论什么样的算法都会认为它是driver gene

2，考虑突变对蛋白功能的影响评分的那些软件，引入了一些先验假设:

evolutionary conservation,

known protein domains,

non-random clustering of mutations,

protein structure,

3，pathways, interaction networks, and de novo approaches的那些软件：

pathway（KEGG,GO,GSEA） 4个limitations,首先，大多数 annotated gene sets 包含的基因数太多，而我们的突变基因占该gene set的比例远达不到统计显著性。

然后，pathway并不是独立的，各个pathway之间的联系更重要

接着，把基因分割成pathway这样的小单元，忽略了单元外的联系

最后，只关注已知的 pathways, or gene set

过去的五年见证了癌症基因组测序研究翻天覆地的变化，但是距离它真正的临床应用还有以下几个挑战：

首先，我们忽略了non-coding somatic mutations

其次，很多我们定义的癌症种类其实是a mixture of these subtypes

然后，哪些癌症是可以合并研究的

最后，不同的NGS数据如何综合研究，包括WGS,WES,RNA sequencing, DNA methylation, and chromatin modifications

对某些患者来说，癌症精准医学已经来临，但是对大部分病人来说，前面的路还很长。

2014-review-Next-generation sequencing to guide cancer therapy

ulwvfje — Mon, 31 Aug 2015 11:15:51 +0000

This reductionist thinking led the initial theories on carcinogenesis to be centered on how many “hits” or genetic mutations were necessary for a tumor to develop.

还原论者认为导致癌症发生发展的原因集中在一些必须因子-"hit" or genetic mutations

由于这个假设，早期探索多种癌症的遗传基础的方法主要是低通量的研究具体某些特定的基因或者变异情况。

分析方法的选择：microarray vs WGS vs WES

临床样品的选择：fresh frozen tissue / FFPE specimens /CTCs / ctDNA

临床NGS数据分析方法：mapping --> SNVs CNVs and SVs --> annotation

挑战：1，低频突变很难从测序错误中区分开

2,很多临床相关的DNA fushions发生在非编码区，所以WES也会错过不少信息的

临床NGS数据注释：多种数据库，多种数据分析方法

NGS辅助临床医疗的三个途径： 1， diagnosis，早期诊断，精确分类 2，针对性治疗3，耐药性，及时换药

CTC: Circulating tumor cell;

ctDNA: circulating tumor DNA;

FDA: Food and Drug Administration;

FFPE: Formalin-fixed, paraffin-embedded;

MATCH: Molecular Analysis for Therapy Choice;

MHC: Majorhistocompatibility complex;

NGS: Next-generation sequencing;

SNV: Singlenucleotide variant;

TCGA: The Cancer Genome Atlas.