生信菜鸟团 » cnv

单细胞转录组数据分析CNV

ulwvfje — Sat, 17 Feb 2018 10:17:02 +0000

单细胞转录组数据分析CNV

都来aviv Regev自于实验室，一系列文章都利用了单细胞转录组数据分析CNV。

2014年关于GBM的science文章

首先是2014年关于GBM的science文章；PMID: 24925914 ，提到了这个分析点，然后还用了CCLE数据库验证可靠性。

该文章自己的单细胞转录组数据建库选用了 SMART-seq 方法，公布在 GSE57872

430(576) single glioblastoma cells isolated from 5 individual tumors
102(192) single cells from gliomasphere cells lines

这个单细胞转录组建库方式有点落后了：

SMART-seq protocol was implemented to generate single cell full length transcriptomes (modified from Shalek, et al Nature 2013) and sequenced using 25 bp paired end reads. Single cell cDNA libraries for MGH30 were resequenced using 100 bp paired end reads to allow for isoform and splice junction reconstruction (96 samples, annotated MGH30L).

所以作者过滤的比较严格，可以直接下载其分析好的表达矩阵，也可以下载原始测序数据自己走一波转录组流程。

第一次提出的公式如下：

2016年关于melanoma的science文章

然后是2016年关于melanoma的science文章：PMID: 27124452 也应用了单细胞转录组数据分析CNV，该文章的数据公布在 GSE72056 这次使用的Smart-seq2建库技术，共计 4645 个细胞，仅仅是表达矩阵就由71Mb，但是原始的测试数据在 dbGaP 数据库，需要申请才能下载。

Supplementary file	Size	Download	File type/resource
GSE72056_melanoma_single_cell_revised_v2.txt.gz	71.6 Mb	(ftp)(http)	TXT

we applied single-cell RNA sequencing (RNA-seq) to 4645 single cells isolated from 19 patients, profiling malignant, immune, stromal, and endothelial cells.

值得注意的是作者还做了bulk的转录组测序，针对6个处理 RAF or RAF+MEK inhibitors 前后供12个数据，公布在 GSE77940

这个时候的计算公式稍微有点变化了，如下：

2016年CELL杂志发表的关于头颈癌

接着是2016年CELL杂志发表的关于头颈癌的文章：Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer 测序如下；

We profiled transcriptomes of ∼6,000 single cells from 18 head and neck squamous cell carcinoma (HNSCC) patients, including five matched pairs of primary tumors and lymph node metastases.

同时也对这些病人测了whole-exome sequencing (WES) and targeted genotyping (SNaPshot) data，但是这些数据公布在 phs001474.v1.p1 ，不是很方便下载。

单细胞转录组建库用的Smart-seq2方法，所有的数据公布在 GSE103322 ，仅仅是表达矩阵都有近100Mb了。

GSE103322_HNSCC_all_data.txt.gz | 86.0 Mb |

下载地址是： (ftp)(http)

用CCLE数据做验证

2014年关于GBM的science文章；PMID: 24925914 ，文章提到：

We downloaded the CCLE gene-centric RMA-normalized Affymetrix data (http://www.broadinstitute.org/ccle/), and centered the expression of each gene across all cell lines at zero.

需要简单注册后才能下载：https://portals.broadinstitute.org/ccle/users/sign_in

理论上要得到下面的图：

](http://www.bio-info-trainee.com/wp-content/uploads/2018/02/highly-correlated-CNV-by-SNP6array-and-RNA-seq.png)

说明使用转录组数据分析到的CNV情况和SNP6.0芯片的结果差异不大。

还有GTEx数据库的验证

To compare these patterns to an external reference of normal cells we downloaded RNA-Seq data from the GTEX portal (http://www.gtexportal.org/; gene read counts file from Jan. 2013), and estimated CNV values as above: we normalized the read counts into log2(TPM+1), averaged all brain samples, restricted the data to the ~6,000 analyzed genes, subtracted for each gene the average normalized expression from the GBM single-cell data (this step is comparable to the centering of the single cell data) and then used a moving average of 100 genes over the genomically-ordered list of genes to define CNV-cont.

总结

上述文章及数据都是有表达矩阵可以下载，所以仅仅是根据这些文章的补充材料公布的公式即可重复整个流程啦。

用GISTIC多个segment文件来找SCNA变异

ulwvfje — Thu, 19 May 2016 12:13:36 +0000

这个软件在TCGA计划里面被频繁使用者，用这个软件的目的很简单，就是你研究了很多癌症样本，通过芯片得到了每个样本的拷贝数变化信息，芯片结果一般是segment结果，可以解释为CNV区域，需要用GISTIC把样本综合起来分析，寻找somatic的CNV，并且注释基因信息。

有两个难点，一是在linux下面安装matlab工作环境，二是如何制作输入文件。

一、程序安装

安装指南：ftp://ftp.broadinstitute.org/pub/GISTIC2.0/INSTALL.txt

软件官网： http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=216&p=t

paper ： http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218867/

下载：wget ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTIC_2_0_22.tar.gz

它的文档写的非常详细：ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm
解压之后，需要自己安装matlab编译环境，这个会很麻烦！

二、输入数据准备

用picnic或者birdseed等软件处理snp6.0芯片的raw data之后得到的segment文件

多个样本的segment合并起来作为输入数据，还有样本列表，芯片的一些信息，根据示例文件，很容易做出input文件！

arraylistfile就是你本次运行GISTIC软件所涉及到的所有样本，一般一个癌种一起运行。

cnvfiles可以不用。

segmentationfile.txt 就是你snp6.0等芯片运行得到的segment信息，把所有样本的结果合并在一起，一般一个样本的segment有1000千左右

markersfile.txt主要取决于你的芯片平台，如果是affymetrix的snp6.0芯片，会有90多万行数据，每个探针的信息都有。

软件自带的测试数据如上，可以看到是106个样本，总共是两万多segment信息，那么也就意味着平均每个样本才200个，可能是snp6.0芯片数据的PICNIC软件的结果。但是它的

markersfile.txt 明确写着才十多万mark，也就是探针，所以应该不是

snp6.0芯片

106 arraylistfile.txt

12942 cnvfile.txt

115593 markersfile.txt

20521 segmentationfile.txt

三、程序使用

软件提供的运行脚本使用的是csh，我修改成了bash

还需要修改matlab的路径及基因组版本信息

四、输出数据解读

简单解释下输出的目录下的文件

all_data_by_genes.txt 代表了基因（包括非编码RNA如miRNA，lncRNA）在样本中具体的拷贝数值。

all_lesions.conf_90.txt 代表识别的拷贝数扩增和缺失Peak区域。

all_thresholded.by_genes.txt 代表离散化之后的数值，如-2代表丢失两个拷贝，-1代表丢失一个拷贝,0代表拷贝数正常,1代表增加一个拷贝，2代表扩增两个拷贝。

broad_significance_results.txt代表显著发生拷贝数变异的broad区域。

broad_values_by_arm.txt 代表染色体臂在样本中的拷贝数数值。

scores.gistic代表通过该方法打分之后的结果。

我写这个教程应该是2016年夏季了，现在已经是2017年秋季，这个软件又更新了，增加了对hg38版本的参考基因组数据进行处理，同时还把csh更改成了bash，真棒！

2.0.23 (2017-03-27) - The markers file input is now optional - if omitted, pseudo-markers will be
generated to satisfy GISTIC's input requirements while ensuring reasonably
uniform coverage of the genome.
- The "broad analysis" of arm-level events has been revised:
(1) arm-level events are now called from a single broad copy number profile
instead of separate amplification and deletion profiles, which had led to
arms counterintuitively called as amplified and deleted on the same sample;
(2) the frequency scores used to determine z-scores and q-values, which excludes
arms with the opposite call from the denominator, are now in a column called
"frequency score". A new column called "frequncy" gives the intuitive frequency
with the denominator inluding arms from all the samples. The analysis results
for the same data will be different from that of previous GISTIC versions.
- Error handling messages have been improved. In particular, many informative
error messages were masked by an "Index exceeds matrix dimensions" error
in the exception handler itself.
- An hg38 reference genome is included with this release.
- The gp_gistic2_from_seg binary executable is now compiled for MCR 8.3
(Matlab R2014a). The source code is compatible with versions of Matlab up to
R2016a, however, the appearance of output graphics may be altered for Matlab
versions R2015a and later.
- This release adds the convenient 'gistic2' wrapper function which sets up
the MCR and passes its command line argument to the executable. Scripts have
been converted from the C-shell to the Bourne shell.
(END)

拷贝数变异检测芯片介绍

ulwvfje — Wed, 06 Jan 2016 01:00:08 +0000

这里的拷贝数变异检测芯片指的是Affymetrix Genome-Wide Human SNP Array 6.0

cel数据，需要处理成segment及genotype数据

这个芯片在TCGA计划里面用的非常多，是标配了。大家只要记住，这是一个跟拷贝数变异检测相关的芯片，而且还可以测一些genotype

Affymetrix Genome-Wide Human SNP Array 6.0是唯一可以真正将CNP(拷贝数多态性)转化成高分辨率的参考图谱的平台。主要应用领域包括全基因组SNP分型、全基因组CNV分型、全基因组关联分析、全基因组连锁分析。除了进行基因分型外，还为拷贝数研究和LOH研究提供帮助，从而能够进行：UPD检测、亲子鉴定、异常的亲代起源分析（针对 UPD和缺失）、纯合性分析、血缘关系鉴定。

参考：http://www.affymetrix.com/support/technical/byproduct.affx?product=genomewidesnp_6

SNP Array 6.0是昂飞公司继Mapping10k、100k、500k和SNP5.0芯片后推出的新一代SNP芯片。在一张芯片上可以分析一个样本906,600 个SNP的基因型, 大约有482，000个SNP来自于前代产品500K和SNP5.0芯片。剩下424，000个SNP包括了来源于国际HapMap计划中的标签 SNP，X，Y染色体和线粒体上更具代表性的SNP,以及来自于重组热点区域和500K芯片设计完成后新加入dbSNP数据库的SNP。该芯片同时含 946,000个非多态性CNV探针，用于检测拷贝数变异，其中202,000个用于检测5677个已知拷贝数变异区域的探针，这些区域来源于多伦多基因组变异体数据库。该数据库中每隔3,182个非重叠片段区域分别用61个探针来检测。除了检测这些已知的拷贝数多态区域，还有超过744,000个探针平均分配到整个基因组上，用来发现未知的拷贝数变异区域。SNP和CNV两种探针高密度且均匀地分布在整个基因组，作为拷贝数变异和杂合性缺失(LOH)检测的工具来发现微小的染色体增加和缺失。为广大生命科学研究者提高发现复杂疾病相关基因的可能提供了强有力的工具。
通过与哈佛大学合办的Broad研究所合作，SNP6.0芯片在数据准确性和一致性方面达到了新的高度。相应推出的Genotyping Console用来处理SNP6.0芯片数据和全基因组遗传分析及质量控制。

产品特点：

1.涵盖超过1,800,000个遗传变异标志物：包括超过906,600个SNP和超过946,000个用于检测拷贝数变化（CNV，Copy Number Variation）的探针；

2.SNP和CNV两种探针高密度且均匀地分布在整个基因组，不仅可以用于SNP基因精确分型，还可用于拷贝数变异CNV的研究；

3.744,000个探针平均分配到整个基因组上，用来发现未知的拷贝数变异区域；

4.可用于Copy-neutral LOH/UPD检测，亲子鉴定，纯合性分析、血缘关系鉴定、遗传病或其它疾病的研究。

参考：http://www.biomart.cn/specials/cnv2014/article/84169

在NCBI的GEO数据库里面可以查到这个芯片，已经有一万多个样本数据啦!

图中第一个是CCLE计划的近千个样本，可能是定制化了的snp6.0芯片吧

使用这个芯片数据来发文章的非常多，见列表：http://media.affymetrix.com/support/technical/other/snp6_array_publications.pdf

还有一篇2010-nature文章讲了如何用picnic来研究cnv，http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145113/

也有一篇2010年的文章提出了新的软件来分析这个芯片cnv数据http://bioinformatics.oxfordjournals.org/content/26/11/1395.long

实现同样功能的软件，非常之多，还有一个R的bioconductor系列的包

http://www.bioconductor.org/help/search/index.html?q=cnv/

随便进去都可以找到很多raw data，可以自己进行分析的！

http://www.ncbi.nlm.nih.gov/geo/browse/?view=samples&platform=6801

比如：ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1949nnn/GSM1949207/suppl/GSM1949207%5FSB%5FCID0102B%5F071708%2ECEL%2Egz