06

picnic对拷贝数变异检测芯片数据进行分析

这里的拷贝数变异检测芯片指的是Affymetrix Genome-Wide Human SNP Array 6.0

cel数据,处理成segment及genotype数据

一、程序安装
这本来是一个matlab程序,但是有linux版本,需要安装matlab编译环境
   clipboard
下载解压之后首先安装matlab环境:
./MCRInstaller.bin -console
因为我的服务器没有界面,所以我用了-console执行安装程序,很简单就安装好了
clipboard2
其中,cdf目录下面是我们从http://www.affymetrix.com 里面下载的Library FilesGenome-Wide Human SNP Array 6.0 (zip, 246 MB)
celConverter目录下面是一个java程序,可以把芯片出来的cel文件转为flat file
其中Matlab_running 是我执行安装的matlab环境
其余两个脚本run_HMM.sh  run_preprocessing.sh是picnic程序的第二步和第三步
二、输入数据准备
我随便找了两个snp6.0芯片的raw data
-rw-rw-r-- 1 jmzeng jmzeng 66M Dec 30 06:30 GSM1949207.CEL
-rw-rw-r-- 1 jmzeng jmzeng 66M Sep  9 11:08 GSM887898.CEL
三、程序使用

程序主要分为三个步骤来实现
Step 1: Convert the binary *.cel file to a flat file
Step 2: Normalise and estimate the ploidy of the genome and the level of normal contamination
Step 3: Segment the data and produce the genotype
第一步:
暂时还不需要matlab工作环境
示例的code是:java -Xmx2G -jar CelFileConverter.jar -m Snp6FeatureMappings.csv -c 'cdf_file_including_path' -s 'directory name
of cel files' -t rootDir/outdir/raw
我的code是:

jmzeng@ubuntu:/home/jmzeng/bio-soft/picnic/c_code/celConverter$ java -jar CelFileConverter.jar -m Snp6FeatureMappings.csv -c ../cdf/GenomeWideSNP_6.Full.cdf -s ../celFiles/ -t ../result/ 
Loading CDF file: ../cdf/GenomeWideSNP_6.Full.cdf
Loading feature mapping file: Snp6FeatureMappings.csv
Processing CEL file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../celFiles/GSM1949207.CEL
Created feature intensity file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../result/GSM1949207.feature_intensity
Processing CEL file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../celFiles/GSM887898.CEL
Created feature intensity file: /home/jmzeng/bio-soft/picnic/c_code/celConverter/../result/GSM887898.feature_intensity

然后在输出目录就有了一个feature_intensity后缀的文本文件,约110M
-rw-rw-r-- 1 jmzeng jmzeng 105M Dec 30 06:35 GSM1949207.feature_intensity
-rw-rw-r-- 1 jmzeng jmzeng 109M Dec 30 06:35 GSM887898.feature_intensity
第二步:
这里需要用matlab啦,但是这里经常出现库的问题!!!
示例的code是
sh run_preprocessing.sh mcr_dir cell_name info_dir feature_int_dir normalised_outdir outdir sample_type in_pi
我的code是:
jmzeng@ubuntu:/home/jmzeng/bio-soft/picnic/c_code$ sh run_preprocessing.sh Matlab_running/v710/ GSM1949207 info/ result/ result/output result/
------------------------------------------
Setting up environment variables
---
LD_LIBRARY_PATH is .:Matlab_running/v710//runtime/glnxa64:Matlab_running/v710//bin/glnxa64:Matlab_running/v710//sys/os/glnxa64:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64/native_threads:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64/server:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64/client:Matlab_running/v710//sys/java/jre/glnxa64/jre/lib/amd64
My Own Exception: Fatal error loading library /home/jmzeng/bio-soft/picnic/c_code/Matlab_running/v710/bin/glnxa64/libmwmclmcr.so Error: libXp.so.6: cannot open shared object file: No such file or directory
第三步:
也需要用matlab,库的问题必须解决!!!
我另外一台服务器的结局办法是用v714的matlab
示例的code是
sh run_HMM.sh /nfs/team78pc2/kwl_temp/segments/PICNIC/C/release/Matlab_Compiler_Runtime/v710 A01_CGP_PD3945a.feature_intensity '/nfs/team78pc3/KWL/segments/PICNIC/matlab/C/release/info/' '/nfs/team78pc2/kwl_temp/segments/PICNIC/data/normalized/' '/nfs/team78pc2/kwl_temp/segments/PICNIC/data/' '10' '0.33598' '1.9915' '0.40997'
我的code是:无所谓了,这个服务器不知道怎么回事,总是出现库文件的问题,而这个问题需要root权限,我懒得弄了,我在其它的服务器上面都木有问题的!!!
所以我就换了MATLAB,毕竟,这个软件本来就是matlab版本的,第一步照旧,第二步在matlab里面运行!
我这里只用GSM1949207做测试吧!!!
Genomic DNA was extracted from saliva, peripheral blood, or fibroblast cell lines using the QIAamp DNA Blood Mini Kit or QIAamp DNA Mini Kit. DNA quality and quantity was assessed using a Nanodrop Spectrophotometer and agarose gel electrophoresis.
打开matlab,进入picnic目录
重复第二步,输入:
preprocessing('GSM1949207.feature_intensity','info\','result\raw\','result\output\','result\')
需要7个参数,我这个是cell lines数据,所以后面两个参数省略不写!info文件夹自己下载放在picnic目录,其中result\raw 存放你第一步的结果文件!
这一步运行,会比较久!
3
同时可以看到程序在我的result目录里面新增加了两个目录用来存放结果,不过这一步的结果还是中间文件,就不解释了!
然后再重复第三步,输入:

 HMM('GSM1949207.feature_intensity','info\','result\output\','result\',10,0,2.0221,0.40997)

这个软件的参数很没规律,最好不要像说明书那些用output做文件夹名字,不然,很容易出错。
这一步好像也很耗时间!
4

重要的结果有两个:
56
06

拷贝数变异检测芯片介绍

这里的拷贝数变异检测芯片指的是Affymetrix Genome-Wide Human SNP Array 6.0

cel数据,需要处理成segment及genotype数据
这个芯片在TCGA计划里面用的非常多,是标配了。大家只要记住,这是一个跟拷贝数变异检测相关的芯片,而且还可以测一些genotype  
Affymetrix Genome-Wide Human SNP Array 6.0是唯一可以真正将CNP(拷贝数多态性)转化成高分辨率的参考图谱的平台。主要应用领域包括全基因组SNP分型、全基因组CNV分型、全基因组关联 分析、全基因组连锁分析。除了进行基因分型外,还为拷贝数研究和LOH研究提供帮助,从而能够进行:UPD检测、亲子鉴定、异常的亲代起源分析(针对 UPD和缺失)、纯合性分析、血缘关系鉴定。
SNP Array 6.0是昂飞公司继Mapping10k、100k、500k和SNP5.0芯片后推出的新一代SNP芯片。在一张芯片上可以分析一个样本906,600 个SNP的基因型, 大约有482,000个SNP来自于前代产品500K和SNP5.0芯片。剩下424,000个SNP包括了来源于国际HapMap计划中的标签 SNP,X,Y染色体和线粒体上更具代表性的SNP,以及来自于重组热点区域和500K芯片设计完成后新加入dbSNP数据库的SNP。该芯片同时含 946,000个非多态性CNV探针,用于检测拷贝数变异,其中202,000个用于检测5677个已知拷贝数变异区域的探针,这些区域来源于多伦多基因 组变异体数据库。该数据库中每隔3,182个非重叠片段区域分别用61个探针来检测。除了检测这些已知的拷贝数多态区域,还有超过744,000个探针平 均分配到整个基因组上,用来发现未知的拷贝数变异区域。SNP和CNV两种探针高密度且均匀地分布在整个基因组,作为拷贝数变异和杂合性缺失(LOH)检 测的工具来发现微小的染色体增加和缺失。为广大生命科学研究者提高发现复杂疾病相关基因的可能提供了强有力的工具。
通过与哈佛大学合办的Broad研究所合作,SNP6.0芯片在数据准确性和一致性方面达到了新的高度。相应推出的Genotyping Console用来处理SNP6.0芯片数据和全基因组遗传分析及质量控制。

产品特点:

1.涵盖超过1,800,000个遗传变异标志物:包括超过906,600个SNP和超过946,000个用于检测拷贝数变化(CNV,Copy Number Variation)的探针;

2.SNP和CNV两种探针高密度且均匀地分布在整个基因组,不仅可以用于SNP基因精确分型,还可用于拷贝数变异CNV的研究;

3.744,000个探针平均分配到整个基因组上,用来发现未知的拷贝数变异区域;

4.可用于Copy-neutral LOH/UPD检测,亲子鉴定,纯合性分析、血缘关系鉴定、遗传病或其它疾病的研究。

参考:http://www.biomart.cn/specials/cnv2014/article/84169

在NCBI的GEO数据库里面可以查到这个芯片,已经有一万多个样本数据啦!
图中第一个是CCLE计划的近千个样本,可能是定制化了的snp6.0芯片吧
clipboard
使用这个芯片数据来发文章的非常多,见列表:http://media.affymetrix.com/support/technical/other/snp6_array_publications.pdf
还有一篇2010-nature文章讲了如何用picnic来研究cnv,http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145113/
也有一篇2010年的文章提出了新的软件来分析这个芯片cnv数据http://bioinformatics.oxfordjournals.org/content/26/11/1395.long
实现同样功能的软件,非常之多,还有一个R的bioconductor系列的包
clipboard2
随便进去都可以找到很多raw data,可以自己进行分析的!

 

05

寻找somatic突变的软件大合集

         其实somatic突变很容易理解,你测同一个人的正常组织和癌症组织,然后比较对这两个样本数据call出来的snp位点
         只存在癌症组织数据里面的snp位点就是somatic突变,在两个样本都存在的snp位点就是germline的突变,不过一般大家研究的都是somatic突变。
          当然,理论上是很简单,但是那么多统计学家要吃饭呀,肯定得把这件事搞复杂才行,所以就有了非常多的somatic突变 calling的软件,开个玩笑哈,主要是因为我们的测序并不是对单个细胞测序,我们通常意义取到的正常组织和癌症组织都不是纯的,所以会有很多关于这一点的讨论。
        正好我看到了一篇帖子,收集了大部分比较出名的做somatic mutation calling的软件,当然,我只用过mutect和varscan。

来自于:https://www.biostars.org/p/19104/

Here are a few more, a summary of the other answers, and updated links:

For a much more general discussion of variant calling (not necessarily somatic or limited to SNVs/InDels) check out this thread: What Methods Do You Use For In/Del/Snp Calling?

Some papers describing comparisons of these callers:

The ICGC-TCGA DREAM Mutation Calling challenge has a component on somatic SNV calling.

This paper used validation data to compare popular somatic SNV callers:

Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers

You'll need to update the link to MuTect. Broad Institute has begun to put portable versions of their tools on Github, like thelatest release of MuTectThe Genome Institute at WashU has been using Github for a while, but portable versions of their tools can be found here and here.

其实somatic的calling远比我们想象的要复杂:

To rehash/expand on what Dan said, if you're sequencing normal tissue, you generally expect to see single-nucleotide variant sites fall into one of three bins: 0%, 50%, or 100%, depending on whether they're heterozygous or homozygous.

With tumors, you have to deal with a whole host of other factors:

  1. Normal admixture in the tumor sample: lowers variant allele fraction (VAF)
  2. Tumor admixture in the normal - this occurs when adjacent normals are used, or in hematological cancers, when there is some blood in the skin normal sample
  3. Subclonal variants, which may occur in any fraction of the cells, meaning that your het-site VAF might be anywhere from 50% down to sub-1%, depending on the tumor's clonal architecture and the sensitivity of your method
  4. Copy number variants, cn-neutral loss of heterozygosity, or ploidy changes, all of which again shift the expected distribution of variant fractions

These, and other factors, make calling somatic variants difficult and still an area that is being heavily researched. If someone tells you that somatic variant calling is a solved problem, they probably have never tried to call somatic variants.

Sounds like somatic / tumor variant calling is something that will be solved by improvements at the wet lab side ( single cell selection / amplification / sequencing ) . Rather than at the computational side.

Well, single cell has a role to play (and would have more of one if WGA wasn't so lossy), but realistically, you can't sequence billions of cells from a tumor individually. Bulk sequencing still is going to have a role for quite a while.

Hell germ line calling isn't even a solved problem. Still get lots of false positives (and false negatives). It just tends to work so well that it is hard to improve it much except by making it faster, less memory intensive, etc

Solved was the wrong word. I just meant improved. There is only so much you can do at the computational side. Wet lab also has its part to play.

A germline variant caller generally has a ploidy-based genotyping algorithm built in to part of the algorithm/pipeline. I believe, IIRC, the GATK UnifiedGenotyper for instance does both variant calling and then genotype calling. So to call a genotype for a variant it is expecting a certain number of reads to support the alternative allele. When working with somatic variants all of the assumptions about how many reads you expect with a variant at a position to distinguish between true and false positives are no longer valid. Except for fixed mutations throughout the tumor population only some proportion of cells will hold a somatic variation. You also typically have some contamination from normal non-cancerous cells. Add in complications from significant genomic instability with lots of copy number variations and such and you have a need for a major change in your model for calling variation while minimizing artifactual calls. So you have a host of other programs that have been developed specifically for looking at somatic variation in tumor samples.

一篇文献:

Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

是qiagen公司发的

High-throughput sequencing is rapidly becoming common practice in clinical diagnosis and cancer research. Many algorithms have been developed for somatic single nucleotide variant (SNV) detection in matched tumor-normal DNA sequencing. Although numerous studies have compared the performance of various algorithms on exome data, there has not yet been a systematic evaluation using PCR-enriched amplicon data with a range of variant allele fractions. The recently developed gold standard variant set for the reference individual NA12878 by the NIST-led “Genome in a Bottle” Consortium (NIST-GIAB) provides a good resource to evaluate admixtures with various SNV fractions.

Using the NIST-GIAB gold standard, we compared the performance of five popular somatic SNV calling algorithms (GATK UnifiedGenotyper followed by simple subtraction, MuTect, Strelka, SomaticSniper and VarScan2) for matched tumor-normal amplicon and exome sequencing data.

Nevertheless, detecting somatic mutations is still challenging, especially for low-allelic-fraction variants caused by tumor heterogeneity, copy number alteration, and sample degradation

We used QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (CCP, Version 1) for enrichment and library construction in triplicate。

QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (Version 1) was used to amplify the target region of interest (124 genes, 800 Kb).

When analyzing different types of data, use of different algorithms may be appropriate.

DNA samples of NA12878 and NA19129 were purchased from Coriell Institute. Sample mixtures were created based on the actual amplifiable DNA in each sample, resulting in 0%, 8%, 16%, 36%, and 100% of NA12878 sample mixed in the NA19129 sample, respectively.We treated the mixed samples at 8%, 16%, 36%, and 100% as the virtual tumor samples and the 0% as the virtual normal sample.

五个软件的算法是:

1. NaiveSubtract — SNVs were called separately from virtual tumor and normal samples using GATK UnifiedGenotyper [22]. For exome sequencing data, reads were already mapped, locally realigned and recalibrated by the 1,000 Genomes Project. So SNVs were directly called on the BAM files using GATK Unified Genotyper. Then, SNVs detected in the virtual normal sample were removed from the list of SNVs detected in the virtual tumor sample, leaving the “somatic” SNVs.

2. MuTect — MuTect is a method developed for detecting the most likely somatic point mutations in NGS data using a Bayesian classifier approach. The method includes pre-processing aligned reads separately in tumor and normal samples and post-processing resulting variants by applying an additional set of filters. We ran MuTect under the High-Confidence mode with its default parameter settings. We disabled the “Clustered position” filter and the “dbSNP filter” for the amplicon sequencing reads, and we disabled the “dbSNP filter” for the exome sequencing.

3. SomaticSniper — SomaticSniper calculates the Bayesian posterior probability of each possible joint genotype across the normal and cancer samples. We tuned the software’s parameters to increase sensitivity and then filtered raw results using a Somatic Score cut-off of 20 to improve specificity.

4. Strelka — Strelka reports the most likely genotype for tumor and normal samples based on a Bayesian probability model. Post-calling filters built into the software are based on factors such as read depth, mismatches, and overlap with indels. We skipped depth filtration for exome and amplicon sequencing data as recommended by the Strelka authors. For the amplicon sequencing reads, we set the minimum MAPQ score at 17 for consistency with the defaults in GATK UnifiedGenotyper. We used variants passing Strelka post-calling filters for analysis.

5. VarScan2 — VarScan2 performs analyses independently on pileup files from the tumor and normal samples to heuristically call a genotype at positions achieving certain thresholds of coverage and quality. Then, sites of the genotypes not matched in tumor and normal samples are classified into somatic, germline, or ambiguous groups using Fisher’s exact test. We generated the pileup files using SAMtools mpileup command.

The compatibility of the output VCF files between different methods as well as the NIST-GIAB gold standard was examined using bcbio.variation tools and manual inspection. The reported SNP call representations between files are comparable to each other.

来自于文献:http://www.biomedcentral.com/1471-2164/15/244

05

使用oncotator做突变注释

功能:vcf格式突变数据进一步注释成maf格式

做过癌症数据分析的童鞋都知道,TCGA里面用maf格式来记录突变!那么maf格式的数据是如何得来的呢,我们都知道,做完snp-calling一般是得到vcf格式的突变记录数据文件,然后再用annovar或者其它蛋白结构功能影响预测软件注释一下,还远达不到maf的近100条记录。

而大名鼎鼎的broad institute就规定了maf格式的突变注释文件,他就是利用了十几个常见的已知数据库来注释我们得到的vcf突变记录,通常是对somatic的突变才注释成maf格式的数据!
大名鼎鼎的broadinstitute出品的突变注释工具:http://www.ncbi.nlm.nih.gov/pubmed/25703262
本身也是一个在线工具:
集成了下面所有的分析资源
而且还提供了API

Genomic Annotations

  • Gene, transcript, and functional consequence annotations using GENCODE for hg19.
  • Reference sequence around a variant.
  • GC content around a variant.
  • Human DNA Repair Gene annotations from Wood et al.

Protein Annotations

  • Site-specific protein annotations from UniProt.
  • Functional impact predictions from dbNSFP.

Cancer Variant Annotations

Non-Cancer Variant Annotations

因为要下载的数据有点多,我这里就不用自己的电脑测试了,安装过程也很简单的!