前些天逛bioStar论坛的时候看到了一个问题,是关于miRNA分析,提问者从NCBI的SRA数据下载文献提供的原始数据,然后处理的时候有些不懂,我看到他列出的数据是iron torrent测序仪的,而且我以前还没玩过miRNA-seq的数据分析, 就抽空自学了一下。因为我有RNA-seq的基础,所以理解学习起来比较简单。特记录一下自己的学习过程,希望对后学者有帮助。 Continue reading



文献名:Comprehensive identification of  mutational cancer driver genes across 12  tumor types
本文比较了四种寻找癌症驱动基因的方法,并且得到了综合性的、可靠的291个HCDs 基因列表。
 Cancer Gene Census (CGC) 数据库里面已经有了接近500个cancer genes
 癌症基因组研究分析可以得到数以万计的somatic mutations,但是其中很少一部分才是驱动肿瘤发生,发展的突变。
 而且大多数driver genes的突变频率很低,又由于肿瘤的异质性,大量样本的研究是必须的。
 1、Most common methods identify genes that are mutated more frequently than expected from the background mutation rate (recurrence)
 2、Other methods - a bias towards the accumulation of functional mutations (FM bias)
 3、other methods exploit the tendency to sustain mutations in certain regions of the protein sequence (CLUST bias)
 4、other approaches exploit the overrepresentation of mutations in specific functional residues, such as phosphorylation sites (ACTIVE bias)
 它们的代表软件是MuSiC, OncodriveFM, OncodriveCLUST and ActiveDriver
 In summary, we provide a very reliable list of 291 HCDs and a second one, of 144 CDs, more comprehensive but with an expectedly higher false-positives rate
 One hundred and sixty-five of these candidates are novel findings not included in the CGC.
Chromatin remodeling,
mRNA processing,
Cell signaling/proliferation,
Cell adhesion,
DNA repair/Cell cycle
然后把四种方法综合得到的291个HCDs基因与Cancer Gene Census (CGC) 数据库里面已经有的接近500个cancer genes进行综合比较


最新的一个寻找cancer 的driver gene的软件:
Cancer is thought to result from the accumulation of causal  somatic mutations throughout the lifetime of an individual.
这些cancer-driving mutations 主要影响三类基因: 1、oncogenes 2、tumor-suppressor genes 3, stablity geens
第一个突变是tumorigenesis ,随后的突变就 driver tumor progression
识别这些突变非常有利于了解gene function 和药物靶点设计
区分 driver genes 和 passenger  genes 能更好的利用各种数据库得到的海量突变信息
基于频率的区分方法 rely on an estimate of a background mutation rate which  represents the rate of random passenger mutations.
也就是文献(Ding et al., 2008).提出的方法,但它忽略了以下四点
1、mutation type (transition versus transversion)
2、nucleotide context(which base is at the mutation site
3、dinucleotide context (which bases are located at neighboring sites to the mutation),
4、expression level of the gene
Sjoblom et al.(2006) account for nucleotide and dinucleotide context in searching  for drivers of breast and colorectal cancer.
MuSiC (Dees et al.,2012) accounts for mutation type and allows for sample-specific mutation rates;
Lawrence et al.(2013) (MutSigCV) also allow for the inclusion of gene-specific factors such as expression level and replication timing.
实际上,除了突变频率,还有一些criteria也很重要, 所以有两个数据库SIFT (first reported by Ng and Henikoff (2001), later updated by Kumar et al. (2009)),  Polyphen (Adzhubei et al., 2010)  和MutationAssessor (Reva et al., 2011)
这两个数据库整合了 sequence context, position, and protein characteristics to assess a mutation’s  functional impact.
总结一下identity cancer driver genes的criteria
1、mutation frequency,
2、mutation type,
3、gene-specific features such as replication timing and expression level that are known to affect background rates of mutation,
4、mutation-specific scores that assess functional impact, and the spatial patterning of mutations that only becomes apparent when thousands of samples are considered.
而我们提出了a unified empirical Bayesian Model-based Approach for identifying Driver Genes in Cancer (MADGiC) that utilizes each of these features.

2014-REVIEW-identifying driver mutation in sequenced cancer genome

somatic  mutations 含义很广,包括:SNVs,Indel,CNAs,SVs等
However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations.
Cancer is driven largely by somatic mutations that accumulate in the genome over an individual’s lifetime, with additional contributions from epigenetic and transcriptomic alterations
低通量时代研究,成功例子: imatinib has been used to target cells expressing the BCR-ABL fusion gene  in chronic myeloid leukemia
gefitinib has been used to inhibit the epidermal growth factor receptor in lung cancer
NGS的三大挑战:1,indentying somatic mutations,误差/ 肿瘤异质性 2,识别driver genes 3,确定由somatic mutations 改变的pathways和其它生物过程
误差来源:optical PCR duplicates, GC-bias, strand bias (where reads indicating a possible mutation only align to one strand of DNA) and alignment artifacts resulting from low complexity or repetitive regions in the genome.
most methods for somatic mutation detection address only a subset of the possible sources of error,call snp的软件众多
identifying driver mutations的三个要点:
1,identifying recurrent mutations;
2,predicting the functional impact of individual mutations;
3,assessing combinations of mutations using pathways, interaction networks, or statistical correlations.
1,直接看突变频率的那些软件to determine whether the observed number of mutations in the gene is significantly greater than the number expected according to a background mutation rate (BMR).
BMR 实在是太难确定了,低了会导致很多假阳性,而高了,又错过很多真实的driver mutations,但是突变频率非常高的那些基因肯定是没有问题的,比如说TP53,无论什么样的算法都会认为它是driver gene
evolutionary conservation,
known protein domains,
non-random clustering of mutations,
protein structure,
3,pathways, interaction networks, and de novo approaches的那些软件:
pathway(KEGG,GO,GSEA) 4个limitations,首先,大多数 annotated gene sets 包含的基因数太多,而我们的突变基因占该gene set的比例远达不到统计显著性。
最后,只关注已知的 pathways, or gene set
首先,我们忽略了non-coding somatic mutations
其次,很多我们定义的癌症种类其实是a mixture of these subtypes
最后,不同的NGS数据如何综合研究,包括WGS,WES,RNA sequencing, DNA methylation, and chromatin modifications

2014-review-Next-generation sequencing to guide cancer therapy

 This reductionist thinking led the initial theories on carcinogenesis to be centered on how many “hits” or genetic mutations were necessary for a tumor to develop.
还原论者认为导致癌症发生发展的原因集中在一些必须因子-"hit" or genetic mutations
分析方法的选择:microarray vs WGS vs WES
临床样品的选择:fresh frozen tissue  / FFPE specimens /CTCs / ctDNA
临床NGS数据分析方法:mapping --> SNVs CNVs and SVs --> annotation
            2,很多临床相关的DNA fushions发生在非编码区,所以WES也会错过不少信息的
临床NGS数据注释 :多种数据库,多种数据分析方法
NGS辅助临床医疗的三个途径: 1, diagnosis,早期诊断,精确分类 2,针对性治疗3,耐药性,及时换药
CTC: Circulating tumor cell;
ctDNA: circulating tumor DNA;
 FDA: Food and Drug Administration;
FFPE: Formalin-fixed, paraffin-embedded;
MATCH: Molecular Analysis for Therapy Choice;
MHC: Majorhistocompatibility complex;
NGS: Next-generation sequencing;
SNV: Singlenucleotide variant;
TCGA: The Cancer Genome Atlas.

文献笔记-2015-nature-molecular analysis of gastric cancer新的分类及预后调查

文献:Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes

A small pre-defined set of gene expression signatures

epithelial-to-mesenchymal transition (EMT)  上皮细胞向间充质细胞转化
microsatellite instability (MSI) 微卫星不稳定性
cytokine signaling 细胞因子信号
cell proliferation  细胞增殖
DNA methylation DNA甲基化
TP53 activity TP53活性
gastric tissue 胃组织


经典的分类方法是:Gastric cancer may be subdivided into 3 distinct subtypes—proximal, diffuse, and distal gastric cancer—based on histopathologic and anatomic criteria. Each subtype is associated with unique epidemiology.

我们用主成分分析Principal component anaylsis (PCA)






Gene expression signatures define four molecular subtypes of GC:

MSI (n = 68),

MSS/EMT (n = 46),

MSS/TP53+ (n = 79)

MSS/TP53− (n = 107)


(MSI, MSS/EMT, MSS/TP53+ and MSS/TP53-)

分别是TCGA数据库的;n = 46, n = 62, n = 50 and n = 47.

Singapore的研究; n = 12, n = 85, n = 39 and n = 63 respectively


(i) The MSS/EMT subtype occurred at a significantly younger age (P = 3e-2) than did other subtypes. The majority (>80%) of the subjects in this subtype were diagnosed with diffuse-type (P < 1e-4) at stage III/IV(P = 1e-3).

(ii) The MSI subtype occurred predominantly in the antrum (75%), >60% of subjects had the intestinal subtype, and >50% of subjects were diagnosed at an early stage (I/II).

(iii) Epstein-Barr virus (EBV) infection occurred more frequently in the MSS/TP53+ group (n = 12/18, P = 2e-4) than in the other groups.



预后: MSI  >   MSS/TP53+    >   MSS/TP53 >  MSS/EMT

Next, we validated the survival trend of GC subtypes in three independent cohorts: Samsung Medical Center cohort 2 (SMC-2,n = 277, GSE26253)31,

Singapore  cohort(n = 200, GSE15459)21 and

TCGA gastric cohort (n = 205).

We saw that the GC subtypes showed a significant association with overall survival




the MSI~ hypermutation ~KRAS (23.3%), the PI3K-PTEN-mTOR pathway (42%), ALK (16.3%) and ARID1A (44.2%)18.

We observed enrichment of PIK3CA H1047R mutations in the MSI samples

we saw enrichment of E542K and E545K mutations in MSS tumors

The EMT subtype had a lower number of mutation events when compared to the other MSS groups(P = 1e−3).

The MSS/TP53− subtype showed the highest prevalence of TP53 mutations (60%), with a low frequency of other mutations

the MSS/TP53+ subtype showed a relatively higher prevalence (compared to MSS/TP53−) of mutations in APC, ARID1A, KRAS, PIK3CA and SMAD4.



The TCGA study reported expression clusters (subtypes named C1–C4) and genomic subtypes (subtypes named EBV+, MSI, Genome Stable (GS) and Chromosomal Instability (CIN)).

A follow-up study of the Singapore cohort21 described three expression subtypes (Proliferative, Metabolic and Reactive)

However, a consensus on clinically relevant subtypes that encompasses molecular heterogeneity and that can be used in preclinical and clinical research has not been reported.

Here we report the molecular classification of GC linked not only to distinct patterns of genomic alterations, but also to recurrence pattern and prognosis across multiple GC cohorts.



microsatellite instability

英文简称 : MI
中文全称 : 微卫星不稳定性
所属分类 : 生物科学
词条简介 : 微卫星不稳定性(microsatellite instability,MI)检测是基于VNTR的发现,细胞内基因组含有大量的碱基重复序列,一般将6-7bp的串联重称为小卫星DNA(minisatellite DNA),又称为VNTR。而将1-4bp的串联重复称为微卫星DNA,又称简单重复序列(simple repeat sequence,SRS)。SRS是一种最常见的重复序列之一,具有丰富的多态性、高度杂合性、重组纺低等优点。最常见的为双核苷酸重复,即(AC)n和(TG)n。研究表胆,在n≥104时,2bp重复序列在人群中呈高度多态性。SRS广泛存在于原核和真核基因组中,约占真核基因组的5%,是近年来快速发展起来的新的DNA多态性标志之一。策卫星稳定性(MI)是指简重复序列的增加或丢失。MI首先在结肠癌中观察到,1993年在HNPCC中观察到多条染色体均有(AC)n重复序列的增加或毛失,以后相继在胃癌、胰腺癌、肺癌、膀胱癌、乳腺癌、前列腺癌及其他肿瘤等也好现存在微卫星不稳定现象,提示MI可能是肿瘤细胞的另一重要分子结果显示 ,MI与肿瘤与发展有关,MI仅在肿瘤细胞中发现,从未在正常组织中检测到。在原发与移肿瘤中,MI均交分布于整个肿瘤。晚期胃癌的MI频率显著高于早期胃癌。





我本科的前两年在海南儋州读书,那时候旁边就是橡胶所,很多同学也在那边做毕业论文什么的,我一直以为那里是全世界的橡胶中心,所有的先进技术都在那里产生,结果,前些天跟一个橡胶所的老师聊天才发现,居然橡胶(Hevea brasiliensis)的基因组已经发表了,可是,跟橡胶所没有半毛钱关系,更搞笑的事情是,堂堂一个基因组文章居然发表在BMC这样的杂志,真不知道是基因组的年代已经过去了还是他们做的实在是太差了,反正我看不过去了,所以研读他们的文章,并且下载数据测试一下。





可以看到所有的测序数据的描述,45个G的i  llumina的200bp的双端测序,27个G的illumina的200bp的双端测序,约10G左右的长片段(8kb,20kb)罗氏454数据,最后还有一点点solid数据,它这样的测序策略好像是模仿的2011年发布的草莓基因组数据。








De novo transcriptome analysis of abiotic stress responsive transcripts of Hevea brasiliensis.


所以我只好找了他们所参考的草莓(strawberry, Fragaria vesca (2n = 2x = 14),a small genome (240 Mb),)的文章,是发表是nature genetics上面的