十一 05

使用mutsig软件来找驱动基因

从数以万计的突变里面找到driver mutation这个课题很大,里面的软件我接触的就有十几个了,但是我尝试了其中几个,总是无法运行成功,不知道为什么,终于今天成功了一个,就是mutsig软件! 其实关于突变数据找driver mutation ,台湾一个大学做了一个数据库DriverDB http://ngs.ym.edu.tw/driverdb/: 还因此发了一篇文章:http://nar.oxfordjournals.org/content/early/2013/11/07/nar.gkt1025.full.pdf,挺不错的!

关于driver mutation的理论最近也进化了很多,算是比较完善了吧,但是我一直没时间静下心来好好补充理论知识,很多软件,都只是用过,很多数据,也只是处理了一下,不知道为什么要去做,╮(╯▽╰)╭扯远了,开始谈这个软件吧!

mutsig软件是broadinstitute出品的,所以可靠性非常好咯,来源于一篇nature文章:http://www.nature.com/nature/journal/v505/n7484/full/nature12912.html,而该软件的地址是:http://www.broadinstitute.org/cancer/cga/mutsig_run 需要简单注册才能下载的。

该nature文章是这样描述这个软件的优点的:We used the most recent version of the MutSig suite of tools, which looks for three independent signals: highmutational burden relative to background expectation, accounting for heterogeneity; clustering of mutations within the gene; and enrichment of mutations in evolutionarily conserved sites. Wecombined the significance levels (P values) fromeach test to obtain a single significance level per gene (Methods).

这个软件需要安装matlab环境才能使用,所以我前面就写了教程,如何安装!http://www.bio-info-trainee.com/?p=1166

如果已经安装好了matlab环境,那么直接下载这个软件就可以使用了,软件解压就OK拉,而且人家还提供了测试文件!

Capture4

软件下载后,解压可以看到里面的一个脚本,软件说明书写的非常简单,当然,使用这个软件也的确非常简单:

run_MutSigCV.sh <path_to_MCR> mutations.maf coverage.txt covariates.txt output.txt 即可,其中所有的数据都是可以下载的,

运行完了测试数据, 就证明你的软件安装没有问题啦!如果你只有突变数据的maf格式,maf格式可以参考:https://www.biostars.org/p/69222/ ,也可以使用该软件:如下

run_MutSigCV.sh <path_to_MCR> my_mutations.maf exome_full192.coverage.txt gene.covariates.txt my_results mutation_type_dictionary_file.txt chr_files_hg19

Capture5

上面三个zip文件,都是可以在mutsig软件官网找到下载链接的,是必须下载的!使用很简单,就一个命令即可,但是把你的vcf突变数据做成该软件需要的maf格式,是一个难题!

31

2013-science-3205tumors-12types-4-ways-find-291HCD

文献名:Comprehensive identification of  mutational cancer driver genes across 12  tumor types
本文比较了四种寻找癌症驱动基因的方法,并且得到了综合性的、可靠的291个HCDs 基因列表。
数据来源于3205个肿瘤样本,共涉及到12种癌症。
 Cancer Gene Census (CGC) 数据库里面已经有了接近500个cancer genes
 癌症基因组研究分析可以得到数以万计的somatic mutations,但是其中很少一部分才是驱动肿瘤发生,发展的突变。
 而且大多数driver genes的突变频率很低,又由于肿瘤的异质性,大量样本的研究是必须的。
 主流的四种找癌症驱动基因的方法如下:
 1、Most common methods identify genes that are mutated more frequently than expected from the background mutation rate (recurrence)
 2、Other methods - a bias towards the accumulation of functional mutations (FM bias)
 3、other methods exploit the tendency to sustain mutations in certain regions of the protein sequence (CLUST bias)
 4、other approaches exploit the overrepresentation of mutations in specific functional residues, such as phosphorylation sites (ACTIVE bias)
 它们的代表软件是MuSiC, OncodriveFM, OncodriveCLUST and ActiveDriver
 本文把这四种方法进行了比较,并且综合了它们的结果。
 In summary, we provide a very reliable list of 291 HCDs and a second one, of 144 CDs, more comprehensive but with an expectedly higher false-positives rate
 One hundred and sixty-five of these candidates are novel findings not included in the CGC.
 然后,作者对这291个HCDs基因进行了功能分析,其中,它们主要集中在以下五个生物功能
Chromatin remodeling,
mRNA processing,
Cell signaling/proliferation,
Cell adhesion,
DNA repair/Cell cycle
然后把四种方法综合得到的291个HCDs基因与Cancer Gene Census (CGC) 数据库里面已经有的接近500个cancer genes进行综合比较
 本文首次展示了综合多种癌症驱动基因寻找方法的可能性,这种综合是基于两个事实:
 1,各种方法找癌症驱动基因本来就没有金标准,所以综合多种方法,更comprehensive。
 2,综合多种方法能更好的比较评估所找到的癌症驱动基因的准确性。
31

2014-4742samples-21tumors-Cancer5000-set-254-genes

文献名: Discovery and saturation analysis of  cancer genes across 21 tumour types.
我们知道对一个癌症的多个样本进行研究,其实很少高达20%样本突变 most intermediate frequencies (2–20%),还有很多低频突变,因为研究样本不够,从而不被发现
我们从 4,742个tumor-normal pairs的外显子测序数据集研究了somatic point mutations,共21种癌症。
癌症基因可能集中于以下七个功能:
proliferation,
apoptosis,
genome stability,
chromatin regulation,
immune evasion,
RNA processing
protein homeostasis
我们用有放回的抽样方法对数据进行统计,得出结论:如果我们对某个癌症的研究样本高达500-6000个的话,可以发现更多的临床低频突变。
这篇文章是为了解决以下三个问题:
1、大规模的研究cancer就能达到鉴别出所有的cancer driver genes的程度吗?(Coverage of known cancer genes)
2,增大样本量是否会揭示很多cancer driver genes?(Analysis of novel candidate cancer genes)
3、我们距离对所有的cancer driver genes的完全认知还有多远?(Saturation analysis)
突变数据的分析流程是Broad’s stringent filtering and annotation pipeline
突变情况如下:
3,078,483 somatic single nucleotide variations(SSNVs),
77,270 small insertions and deletions (SINDELs)
29,837 somatic di-, tri- or oligonucleotide variations (DNVs, TNVs and ONVs, respectively)
an average of 672 per tumour–normal pair
包括:
540,831 missense,
207,144 synonymous,
46,264 nonsense,
33,637 splice-site
2,294,935 non-coding mutations
我们找驱动基因的方法是:
We used the most recent version of the MutSig suite of tools
which looks for three independent signals:
high mutational burden relative to background expectation,
accounting for heterogeneity;
clustering of mutations within the gene;
enrichment of mutations in evolutionarily conserved sites.
我们把以上MutSig的几个独立组件分析得到的p-value组合起来,判断驱动基因,我们即对每种癌症做了单独分析,同时也对这21种癌症做了综合分析。
我们找到的驱动基因的结果:
单独对各个癌症进行分析,可以总共找到334个基因,当然不同癌症找到的基因有交集。
These 334 pairs involve 224 distinct genes.
The number of genes detected per tumour type varied considerably (range of 1–58)
找到的驱动基因的个数差异主要取决于癌症种类的不同,然后,跟该癌症的样本量有关。
只有22种基因能在超过三种癌症里面都是被判定为驱动基因。
如果我们把21种癌症合并起来找驱动基因,可以找到114个,其中有30个是单独对各个癌症进行分析所找不到的,有80个在单独癌症分析可以找到。
所以单独对各个癌症进行分析找到的224个基因里面,有140个是合并癌症分析找不到的。其实画一个韦恩图就很好理解了。
对各个癌症进行分析,共21次分析,加上合并分析,共22次飞行,总共可以得到a Cancer5000 set containing 254 genes.
我们再严格分析一下254个基因在Cancer5000 set,得到219 distinct genes.叫做Cancer5000-S (for ‘stringent’) gene。
 Cancer Gene Census (CGC)组织的 (v65)版本包含着130个cancer genes driven by somatic point mutations,其中82个被我们这次统计分析发现啦。
 Four genes encode anti-proliferative proteins, in which loss-offunction mutations would be expected to contribute to oncogenesis.
 Sixadditionalgenesencode proteins thatare clearlyinvolved incell  proliferation: RHEB, RHOA, SOS1, ELF3, SGK1 and MYOCD.
 Five genes encode pro-apoptotic factors, in which loss-of-function mutations would be expected to promote oncogenesis
Six genes encode proteins related to genome stability.
Fivegenesareassociatedwithchromatinregulation
Three genes encode proteins whose loss is expected to help tumours evade immune attack
Three genes are associated with RNA processing and metabolism.
One gene, TRIM23, is involved in protein homeostasis.
Beyond these 33 genes, the set of 81 novel genes is likely to contain
additional true cancer genes.
有返回抽样方法是:An effective test is to perform ‘down-sampling’; that is, to study how the number of discoveries increases with sample size, by repeating the analysis on random subsets of samples of various smaller sizes.
饱和度分析结果: 还远未到饱和,不同突变频率的基因被发现的个数随着样本量的增大而增多的速度不同。
Genes mutated in 20% of tumours are approaching saturation;
those mutated at frequencies of 10–20% are still rising rapidly, but at a decreasing rate;
those at 5–10%  increasing linearly;
and those at ,5% are increasingly at an accelerating rate.
我们对样本量的要求是:突变背景高的癌症(如,黑色素瘤)需要的样品更多,而那些突变背景低的癌症(如成神经细胞瘤)需要近650个样本就可以很好的分析驱动基因了
Creating a reasonably comprehensive catalogue of candidate cancer genes mutated in 2% of patients will require between approximately 650 samples (for tumours with ,0.5 mutations per Mb, such
as neuroblastoma) to approximately 5,300 samples (for melanoma, with 12.9 mutations per Mb)
31

2015-MADGiC-identify-cancer-driver-gene

最新的一个寻找cancer 的driver gene的软件:
Cancer is thought to result from the accumulation of causal  somatic mutations throughout the lifetime of an individual.
这些cancer-driving mutations 主要影响三类基因: 1、oncogenes 2、tumor-suppressor genes 3, stablity geens
第一个突变是tumorigenesis ,随后的突变就 driver tumor progression
识别这些突变非常有利于了解gene function 和药物靶点设计
区分 driver genes 和 passenger  genes 能更好的利用各种数据库得到的海量突变信息
基于频率的区分方法 rely on an estimate of a background mutation rate which  represents the rate of random passenger mutations.
也就是文献(Ding et al., 2008).提出的方法,但它忽略了以下四点
1、mutation type (transition versus transversion)
2、nucleotide context(which base is at the mutation site
3、dinucleotide context (which bases are located at neighboring sites to the mutation),
4、expression level of the gene
然后有文献提出了以下三种改进
Sjoblom et al.(2006) account for nucleotide and dinucleotide context in searching  for drivers of breast and colorectal cancer.
MuSiC (Dees et al.,2012) accounts for mutation type and allows for sample-specific mutation rates;
Lawrence et al.(2013) (MutSigCV) also allow for the inclusion of gene-specific factors such as expression level and replication timing.
但是他们有个共同延续下来的的缺点,就是默认驱动基因的突变频率要高于背景突变频率。
实际上,除了突变频率,还有一些criteria也很重要, 所以有两个数据库SIFT (first reported by Ng and Henikoff (2001), later updated by Kumar et al. (2009)),  Polyphen (Adzhubei et al., 2010)  和MutationAssessor (Reva et al., 2011)
这两个数据库整合了 sequence context, position, and protein characteristics to assess a mutation’s  functional impact.
总结一下identity cancer driver genes的criteria
1、mutation frequency,
2、mutation type,
3、gene-specific features such as replication timing and expression level that are known to affect background rates of mutation,
4、mutation-specific scores that assess functional impact, and the spatial patterning of mutations that only becomes apparent when thousands of samples are considered.
以前的方法都只是部分涉及到上面的criteria
而我们提出了a unified empirical Bayesian Model-based Approach for identifying Driver Genes in Cancer (MADGiC) that utilizes each of these features.