生信菜鸟团 » somatic

用GISTIC多个segment文件来找SCNA变异

ulwvfje — Thu, 19 May 2016 12:13:36 +0000

这个软件在TCGA计划里面被频繁使用者，用这个软件的目的很简单，就是你研究了很多癌症样本，通过芯片得到了每个样本的拷贝数变化信息，芯片结果一般是segment结果，可以解释为CNV区域，需要用GISTIC把样本综合起来分析，寻找somatic的CNV，并且注释基因信息。

有两个难点，一是在linux下面安装matlab工作环境，二是如何制作输入文件。

一、程序安装

安装指南：ftp://ftp.broadinstitute.org/pub/GISTIC2.0/INSTALL.txt

软件官网： http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=216&p=t

paper ： http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218867/

下载：wget ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTIC_2_0_22.tar.gz

它的文档写的非常详细：ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm
解压之后，需要自己安装matlab编译环境，这个会很麻烦！

二、输入数据准备

用picnic或者birdseed等软件处理snp6.0芯片的raw data之后得到的segment文件

多个样本的segment合并起来作为输入数据，还有样本列表，芯片的一些信息，根据示例文件，很容易做出input文件！

arraylistfile就是你本次运行GISTIC软件所涉及到的所有样本，一般一个癌种一起运行。

cnvfiles可以不用。

segmentationfile.txt 就是你snp6.0等芯片运行得到的segment信息，把所有样本的结果合并在一起，一般一个样本的segment有1000千左右

markersfile.txt主要取决于你的芯片平台，如果是affymetrix的snp6.0芯片，会有90多万行数据，每个探针的信息都有。

软件自带的测试数据如上，可以看到是106个样本，总共是两万多segment信息，那么也就意味着平均每个样本才200个，可能是snp6.0芯片数据的PICNIC软件的结果。但是它的

markersfile.txt 明确写着才十多万mark，也就是探针，所以应该不是

snp6.0芯片

106 arraylistfile.txt

12942 cnvfile.txt

115593 markersfile.txt

20521 segmentationfile.txt

三、程序使用

软件提供的运行脚本使用的是csh，我修改成了bash

还需要修改matlab的路径及基因组版本信息

四、输出数据解读

简单解释下输出的目录下的文件

all_data_by_genes.txt 代表了基因（包括非编码RNA如miRNA，lncRNA）在样本中具体的拷贝数值。

all_lesions.conf_90.txt 代表识别的拷贝数扩增和缺失Peak区域。

all_thresholded.by_genes.txt 代表离散化之后的数值，如-2代表丢失两个拷贝，-1代表丢失一个拷贝,0代表拷贝数正常,1代表增加一个拷贝，2代表扩增两个拷贝。

broad_significance_results.txt代表显著发生拷贝数变异的broad区域。

broad_values_by_arm.txt 代表染色体臂在样本中的拷贝数数值。

scores.gistic代表通过该方法打分之后的结果。

我写这个教程应该是2016年夏季了，现在已经是2017年秋季，这个软件又更新了，增加了对hg38版本的参考基因组数据进行处理，同时还把csh更改成了bash，真棒！

2.0.23 (2017-03-27) - The markers file input is now optional - if omitted, pseudo-markers will be
generated to satisfy GISTIC's input requirements while ensuring reasonably
uniform coverage of the genome.
- The "broad analysis" of arm-level events has been revised:
(1) arm-level events are now called from a single broad copy number profile
instead of separate amplification and deletion profiles, which had led to
arms counterintuitively called as amplified and deleted on the same sample;
(2) the frequency scores used to determine z-scores and q-values, which excludes
arms with the opposite call from the denominator, are now in a column called
"frequency score". A new column called "frequncy" gives the intuitive frequency
with the denominator inluding arms from all the samples. The analysis results
for the same data will be different from that of previous GISTIC versions.
- Error handling messages have been improved. In particular, many informative
error messages were masked by an "Index exceeds matrix dimensions" error
in the exception handler itself.
- An hg38 reference genome is included with this release.
- The gp_gistic2_from_seg binary executable is now compiled for MCR 8.3
(Matlab R2014a). The source code is compatible with versions of Matlab up to
R2016a, however, the appearance of output graphics may be altered for Matlab
versions R2015a and later.
- This release adds the convenient 'gistic2' wrapper function which sets up
the MCR and passes its command line argument to the executable. Scripts have
been converted from the C-shell to the Bourne shell.
(END)

所有TCGA的maf格式somatic突变数据均可下载

ulwvfje — Fri, 06 May 2016 12:33:36 +0000

如果你研究癌症，那么TCGA计划的如此丰富的公共数据你肯定不能错过，一般人只能获取到level3的数据，当然，其实一般人也没办法使用level1和level2的数据，毕竟近万个癌症样本的原始测序数据，还是很恐怖的，而且我们拿到原始数据，再重新跑pipeline，其实并不一定比人家TCGA本身分析的要好，所以我们直接拿到分析结果，就足够啦！

而分析结果里面，最有用的就是somatic mutation了，我前面很多博客都提到过somatic mutation，包括它的概念以及分析流程，但是我们还有更方便的办法，直接下载已经分析好的somatic mutation文件！

至少目前所有TCGA的somatic mutation文件都是可以下载的：https://wiki.nci.nih.gov/display/TCGA/TCGA+MAF+Files

里面包含的somatic mutation非常多，都是MAF格式记录的，首先，根据各个癌症种类，分成了单独的文件，这样你想研究哪个癌症就下载哪个，然后对每个癌症种类，每次TCGA发表一篇文章，就有一个对应的MAF文件。你可以根据它文章所讲的思路重新别人的分析流程。

寻找somatic突变的软件大合集

ulwvfje — Tue, 05 Jan 2016 12:04:31 +0000

其实somatic突变很容易理解，你测同一个人的正常组织和癌症组织，然后比较对这两个样本数据call出来的snp位点

只存在癌症组织数据里面的snp位点就是somatic突变，在两个样本都存在的snp位点就是germline的突变，不过一般大家研究的都是somatic突变。

当然，理论上是很简单，但是那么多统计学家要吃饭呀，肯定得把这件事搞复杂才行，所以就有了非常多的somatic突变 calling的软件，开个玩笑哈，主要是因为我们的测序并不是对单个细胞测序，我们通常意义取到的正常组织和癌症组织都不是纯的，所以会有很多关于这一点的讨论。

正好我看到了一篇帖子，收集了大部分比较出名的做somatic mutation calling的软件，当然，我只用过mutect和varscan。

来自于：https://www.biostars.org/p/19104/

Here are a few more, a summary of the other answers, and updated links:

deepSNV (abstract) (paper)
EBCall (abstract) (paper)
GATK SomaticIndelDetector (note: only available after an annoying sign-up and login)
Isaac variant caller (abstract) (paper)
joint-snv-mix (abstract) (paper)
LoFreq (abstract) (paper) (call on tumor & normal separately and then use a filter to derive somatic events)
MutationSeq (abstract) (paper)
MutTect (abstract) (paper) (note: only available after an annoying sign-up and login)
QuadGT (for calling single-nucleotide variants in four sequenced genomes comprising a normal-tumor pair and the two parents)
samtools mpileup - by piping BCF format output from this to bcftools view and using the '-T pair' option
Seurat (abstract) (paper)
Shimmer (abstract) (paper)
SolSNP (call on tumor & normal separately and then compare to identify somatic events)
SNVMix (abstract) (paper)
SOAPsnv
SomaticCall (manual)
SomaticSniper (abstract) (paper)
Strelka (abstract) (paper)
VarScan2 (abstract) (paper)
Virmid (abstract) (paper)

For a much more general discussion of variant calling (not necessarily somatic or limited to SNVs/InDels) check out this thread: What Methods Do You Use For In/Del/Snp Calling?

Some papers describing comparisons of these callers:

The ICGC-TCGA DREAM Mutation Calling challenge has a component on somatic SNV calling.

This paper used validation data to compare popular somatic SNV callers:

Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers

You'll need to update the link to MuTect. Broad Institute has begun to put portable versions of their tools on Github, like thelatest release of MuTect. The Genome Institute at WashU has been using Github for a while, but portable versions of their tools can be found here and here.

其实somatic的calling远比我们想象的要复杂：

To rehash/expand on what Dan said, if you're sequencing normal tissue, you generally expect to see single-nucleotide variant sites fall into one of three bins: 0%, 50%, or 100%, depending on whether they're heterozygous or homozygous.

With tumors, you have to deal with a whole host of other factors:

Normal admixture in the tumor sample: lowers variant allele fraction (VAF)
Tumor admixture in the normal - this occurs when adjacent normals are used, or in hematological cancers, when there is some blood in the skin normal sample
Subclonal variants, which may occur in any fraction of the cells, meaning that your het-site VAF might be anywhere from 50% down to sub-1%, depending on the tumor's clonal architecture and the sensitivity of your method
Copy number variants, cn-neutral loss of heterozygosity, or ploidy changes, all of which again shift the expected distribution of variant fractions

These, and other factors, make calling somatic variants difficult and still an area that is being heavily researched. If someone tells you that somatic variant calling is a solved problem, they probably have never tried to call somatic variants.

Sounds like somatic / tumor variant calling is something that will be solved by improvements at the wet lab side ( single cell selection / amplification / sequencing ) . Rather than at the computational side.

Well, single cell has a role to play (and would have more of one if WGA wasn't so lossy), but realistically, you can't sequence billions of cells from a tumor individually. Bulk sequencing still is going to have a role for quite a while.

Hell germ line calling isn't even a solved problem. Still get lots of false positives (and false negatives). It just tends to work so well that it is hard to improve it much except by making it faster, less memory intensive, etc

Solved was the wrong word. I just meant improved. There is only so much you can do at the computational side. Wet lab also has its part to play.

A germline variant caller generally has a ploidy-based genotyping algorithm built in to part of the algorithm/pipeline. I believe, IIRC, the GATK UnifiedGenotyper for instance does both variant calling and then genotype calling. So to call a genotype for a variant it is expecting a certain number of reads to support the alternative allele. When working with somatic variants all of the assumptions about how many reads you expect with a variant at a position to distinguish between true and false positives are no longer valid. Except for fixed mutations throughout the tumor population only some proportion of cells will hold a somatic variation. You also typically have some contamination from normal non-cancerous cells. Add in complications from significant genomic instability with lots of copy number variations and such and you have a need for a major change in your model for calling variation while minimizing artifactual calls. So you have a host of other programs that have been developed specifically for looking at somatic variation in tumor samples.

一篇文献：

Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

是qiagen公司发的

High-throughput sequencing is rapidly becoming common practice in clinical diagnosis and cancer research. Many algorithms have been developed for somatic single nucleotide variant (SNV) detection in matched tumor-normal DNA sequencing. Although numerous studies have compared the performance of various algorithms on exome data, there has not yet been a systematic evaluation using PCR-enriched amplicon data with a range of variant allele fractions. The recently developed gold standard variant set for the reference individual NA12878 by the NIST-led “Genome in a Bottle” Consortium (NIST-GIAB) provides a good resource to evaluate admixtures with various SNV fractions.

Using the NIST-GIAB gold standard, we compared the performance of five popular somatic SNV calling algorithms (GATK UnifiedGenotyper followed by simple subtraction, MuTect, Strelka, SomaticSniper and VarScan2) for matched tumor-normal amplicon and exome sequencing data.

Nevertheless, detecting somatic mutations is still challenging, especially for low-allelic-fraction variants caused by tumor heterogeneity, copy number alteration, and sample degradation

We used QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (CCP, Version 1) for enrichment and library construction in triplicate。

QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (Version 1) was used to amplify the target region of interest (124 genes, 800 Kb).

When analyzing different types of data, use of different algorithms may be appropriate.

DNA samples of NA12878 and NA19129 were purchased from Coriell Institute. Sample mixtures were created based on the actual amplifiable DNA in each sample, resulting in 0%, 8%, 16%, 36%, and 100% of NA12878 sample mixed in the NA19129 sample, respectively.We treated the mixed samples at 8%, 16%, 36%, and 100% as the virtual tumor samples and the 0% as the virtual normal sample.

五个软件的算法是：

1. NaiveSubtract — SNVs were called separately from virtual tumor and normal samples using GATK UnifiedGenotyper [22]. For exome sequencing data, reads were already mapped, locally realigned and recalibrated by the 1,000 Genomes Project. So SNVs were directly called on the BAM files using GATK Unified Genotyper. Then, SNVs detected in the virtual normal sample were removed from the list of SNVs detected in the virtual tumor sample, leaving the “somatic” SNVs.

2. MuTect — MuTect is a method developed for detecting the most likely somatic point mutations in NGS data using a Bayesian classifier approach. The method includes pre-processing aligned reads separately in tumor and normal samples and post-processing resulting variants by applying an additional set of filters. We ran MuTect under the High-Confidence mode with its default parameter settings. We disabled the “Clustered position” filter and the “dbSNP filter” for the amplicon sequencing reads, and we disabled the “dbSNP filter” for the exome sequencing.

3. SomaticSniper — SomaticSniper calculates the Bayesian posterior probability of each possible joint genotype across the normal and cancer samples. We tuned the software’s parameters to increase sensitivity and then filtered raw results using a Somatic Score cut-off of 20 to improve specificity.

4. Strelka — Strelka reports the most likely genotype for tumor and normal samples based on a Bayesian probability model. Post-calling filters built into the software are based on factors such as read depth, mismatches, and overlap with indels. We skipped depth filtration for exome and amplicon sequencing data as recommended by the Strelka authors. For the amplicon sequencing reads, we set the minimum MAPQ score at 17 for consistency with the defaults in GATK UnifiedGenotyper. We used variants passing Strelka post-calling filters for analysis.

5. VarScan2 — VarScan2 performs analyses independently on pileup files from the tumor and normal samples to heuristically call a genotype at positions achieving certain thresholds of coverage and quality. Then, sites of the genotypes not matched in tumor and normal samples are classified into somatic, germline, or ambiguous groups using Fisher’s exact test. We generated the pileup files using SAMtools mpileup command.

The compatibility of the output VCF files between different methods as well as the NIST-GIAB gold standard was examined using bcbio.variation tools and manual inspection. The reported SNP call representations between files are comparable to each other.

来自于文献：http://www.biomedcentral.com/1471-2164/15/244