生信菜鸟团 » 文献

自学miRNA-seq分析第一讲~文献选择与解读

ulwvfje — Sat, 25 Jun 2016 08:29:11 +0000

前些天逛bioStar论坛的时候看到了一个问题，是关于miRNA分析，提问者从NCBI的SRA数据下载文献提供的原始数据，然后处理的时候有些不懂，我看到他列出的数据是iron torrent测序仪的，而且我以前还没玩过miRNA-seq的数据分析，就抽空自学了一下。因为我有RNA-seq的基础，所以理解学习起来比较简单。特记录一下自己的学习过程，希望对后学者有帮助。

这里选择的文章是2014年发表的，作者用ET-1刺激human iPSCs (hiPSC-CMs) 细胞前后，想看看 miRNA和mRNA表达量的变化，我并没有细看该文章的生物学意义，仅仅从数据分析的角度解读一下这篇文章，mRNA表达量用的是Affymetrix Human Genome U133 Plus 2.0 Array，分析起来特别容易，就是得到表达矩阵，然后用limma这个包找找差异表达基因即可。但是mRNA分析起来就有点麻烦了，作者用的是iron torrent测序仪，但是从SRA数据中心下载的是已经去掉接头的测序数据，fastq格式的，所以这里其实并不需要考虑测序仪的特异性。

关于该文章的几个资料收集如下：

## paper : http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0108051

## Aggarwal P, Turner A, Matter A, Kattman SJ et al. RNA expression profiling of human iPSC-derived cardiomyocytes in a cardiac hypertrophy model. PLoS One 2014;9(9):e108051. PMID: 25255322

## The accession numbers are 1. SuperSeries (mRNA+miRNA) - GSE60293

## 2. mRNA expression array - GSE60291 (Affymetrix Human Genome U133 Plus 2.0 Array)

## 3. miRNA-Seq - GSE60292 (Ion Torrent)

## GEO : http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60292

## FTP : ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045420

仔细看看该文章做了哪些分析，然后才能自己模仿，得到同样的数据分析结果。

该文章处理数据的流程是：
Ion Torrent's Torrent Suite version 3.6 was used for basecalling
Raw sequencing reads were aligned using the SHRiMP2 aligner and were aligned against the human reference genome (hg19) for novel miRNA prediction and then against a custom reference sequence file containing miRBase v.20 known human miRNA hairpins, tRNA, rRNA, adapter sequences and predicted novel miRNA sequences.(Genome_build: hg19, miRBase v.20 human miRNA hairpins)

The miRDeep2 package (default parameters) was used to predict novel (as yet undescribed) miRNAs

Alignments with less than 17 bp matches and a custom 3′ end phred q-score threshold of 17 were filtered out.

miRNA quanitification was done using HTSeq v0.5.3p3 using the default union parameter.
Differential miRNA expression was analyzed using the DESeq (v.1.12.1) R/Bioconductor package

In this study, differentially expressed genes that had a false discovery rate cutoff at 10% (FDR< = 0.1), a log₂ fold change greater than 1.5 and less than −1.5 were considered significant.

Target gene prediction was performed using the TargetScan (version 6.2) database

We also used miRTarBase (version 4.3), to identify targets that have been experimentally validated

## miR-Deep2 and miReap ## predict exact precursor sequence according from mature sequence .

文章提到了fastq数据质量控制标准，数据比对工具，比对的参考基因组（两条比对线路），miRNA表达量的得到，新的miRNA预测，miRNA靶基因预测，这也是我们学习miRNA-seq的数据分析的标准套路，而且作者给出了所有的分析结果，我们完全可以通过自己的学习来重现他的分析过程。

Supplementary_files_format_and_content: tab-delimited text files containing raw read counts for known mature human miRNAs.（表达矩阵）

We detected 836 known human mature miRNAs in the control-CMs and 769 in the ET1-CMs

Based on our miRNA-Seq data, we predicted 506 sequences to be potentially novel, as yet undescribed miRNAs.

In order to validate the expression profiles of the miRNAs detected, we performed RT-qPCR on a subset of five known human mature and five of our predicted novel miRNAs.

we obtained a total of 1,922 predicted miRNA-mRNA pairs represented by 309 genes and 174 known mature human miRNAs. （）

当然仅仅是套路分析无法发文章的，所以他结合了 miRNA和mRNA 进行网络分析，还做了少量湿实验来验证，最后还扯了一些生物学意义，当然这种纯粹理论分析肯定不好扯什么治病救人的伟大理想。

下一篇我会讲自学miRNA-seq分析搜集到的参考资料

2013-science-3205tumors-12types-4-ways-find-291HCD

ulwvfje — Mon, 31 Aug 2015 15:16:51 +0000

文献名：Comprehensive identification of mutational cancer driver genes across 12 tumor types

本文比较了四种寻找癌症驱动基因的方法，并且得到了综合性的、可靠的291个HCDs 基因列表。

数据来源于3205个肿瘤样本，共涉及到12种癌症。

Cancer Gene Census (CGC) 数据库里面已经有了接近500个cancer genes

癌症基因组研究分析可以得到数以万计的somatic mutations，但是其中很少一部分才是驱动肿瘤发生，发展的突变。

而且大多数driver genes的突变频率很低，又由于肿瘤的异质性，大量样本的研究是必须的。

主流的四种找癌症驱动基因的方法如下：

1、Most common methods identify genes that are mutated more frequently than expected from the background mutation rate (recurrence)

2、Other methods - a bias towards the accumulation of functional mutations (FM bias)

3、other methods exploit the tendency to sustain mutations in certain regions of the protein sequence (CLUST bias)

4、other approaches exploit the overrepresentation of mutations in specific functional residues, such as phosphorylation sites (ACTIVE bias)

它们的代表软件是MuSiC, OncodriveFM, OncodriveCLUST and ActiveDriver

本文把这四种方法进行了比较，并且综合了它们的结果。

In summary, we provide a very reliable list of 291 HCDs and a second one, of 144 CDs, more comprehensive but with an expectedly higher false-positives rate

One hundred and sixty-five of these candidates are novel findings not included in the CGC.

然后，作者对这291个HCDs基因进行了功能分析，其中，它们主要集中在以下五个生物功能

Chromatin remodeling,

mRNA processing,

Cell signaling/proliferation,

Cell adhesion,

DNA repair/Cell cycle

然后把四种方法综合得到的291个HCDs基因与Cancer Gene Census (CGC) 数据库里面已经有的接近500个cancer genes进行综合比较

本文首次展示了综合多种癌症驱动基因寻找方法的可能性，这种综合是基于两个事实：

1，各种方法找癌症驱动基因本来就没有金标准，所以综合多种方法，更comprehensive。

2，综合多种方法能更好的比较评估所找到的癌症驱动基因的准确性。

2015-MADGiC-identify-cancer-driver-gene

ulwvfje — Mon, 31 Aug 2015 11:19:58 +0000

最新的一个寻找cancer 的driver gene的软件：

Cancer is thought to result from the accumulation of causal somatic mutations throughout the lifetime of an individual.

这些cancer-driving mutations 主要影响三类基因： 1、oncogenes 2、tumor-suppressor genes 3， stablity geens

第一个突变是tumorigenesis ，随后的突变就 driver tumor progression

识别这些突变非常有利于了解gene function 和药物靶点设计

区分 driver genes 和 passenger genes 能更好的利用各种数据库得到的海量突变信息

基于频率的区分方法 rely on an estimate of a background mutation rate which represents the rate of random passenger mutations.

也就是文献(Ding et al., 2008).提出的方法，但它忽略了以下四点

1、mutation type (transition versus transversion)

2、nucleotide context(which base is at the mutation site

3、dinucleotide context (which bases are located at neighboring sites to the mutation),

4、expression level of the gene

然后有文献提出了以下三种改进

Sjoblom et al.(2006) account for nucleotide and dinucleotide context in searching for drivers of breast and colorectal cancer.

MuSiC (Dees et al.,2012) accounts for mutation type and allows for sample-specific mutation rates;

Lawrence et al.(2013) (MutSigCV) also allow for the inclusion of gene-specific factors such as expression level and replication timing.

但是他们有个共同延续下来的的缺点，就是默认驱动基因的突变频率要高于背景突变频率。

实际上，除了突变频率，还有一些criteria也很重要，所以有两个数据库SIFT (first reported by Ng and Henikoff (2001), later updated by Kumar et al. (2009)), Polyphen (Adzhubei et al., 2010) 和MutationAssessor (Reva et al., 2011)

这两个数据库整合了 sequence context, position, and protein characteristics to assess a mutation’s functional impact.

总结一下identity cancer driver genes的criteria

1、mutation frequency,

2、mutation type,

3、gene-specific features such as replication timing and expression level that are known to affect background rates of mutation,

4、mutation-specific scores that assess functional impact, and the spatial patterning of mutations that only becomes apparent when thousands of samples are considered.

以前的方法都只是部分涉及到上面的criteria

而我们提出了a unified empirical Bayesian Model-based Approach for identifying Driver Genes in Cancer (MADGiC) that utilizes each of these features.

2014-REVIEW-identifying driver mutation in sequenced cancer genome

ulwvfje — Mon, 31 Aug 2015 11:17:59 +0000

somatic mutations 含义很广，包括：SNVs，Indel，CNAs，SVs等

However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations.

Cancer is driven largely by somatic mutations that accumulate in the genome over an individual’s lifetime, with additional contributions from epigenetic and transcriptomic alterations

低通量时代研究，成功例子： imatinib has been used to target cells expressing the BCR-ABL fusion gene in chronic myeloid leukemia

gefitinib has been used to inhibit the epidermal growth factor receptor in lung cancer

但远远不够。

NGS的三大挑战：1，indentying somatic mutations，误差/ 肿瘤异质性 2，识别driver genes 3，确定由somatic mutations 改变的pathways和其它生物过程

误差来源：optical PCR duplicates, GC-bias, strand bias (where reads indicating a possible mutation only align to one strand of DNA) and alignment artifacts resulting from low complexity or repetitive regions in the genome.

most methods for somatic mutation detection address only a subset of the possible sources of error，call snp的软件众多

identifying driver mutations的三个要点：

1，identifying recurrent mutations;

2，predicting the functional impact of individual mutations;

3，assessing combinations of mutations using pathways, interaction networks, or statistical correlations.

三个要点分别衍生了大量的软件，它们的问题在于：

1，直接看突变频率的那些软件to determine whether the observed number of mutations in the gene is significantly greater than the number expected according to a background mutation rate (BMR).

BMR 实在是太难确定了，低了会导致很多假阳性，而高了，又错过很多真实的driver mutations，但是突变频率非常高的那些基因肯定是没有问题的，比如说TP53，无论什么样的算法都会认为它是driver gene

2，考虑突变对蛋白功能的影响评分的那些软件，引入了一些先验假设:

evolutionary conservation,

known protein domains,

non-random clustering of mutations,

protein structure,

3，pathways, interaction networks, and de novo approaches的那些软件：

pathway（KEGG,GO,GSEA） 4个limitations,首先，大多数 annotated gene sets 包含的基因数太多，而我们的突变基因占该gene set的比例远达不到统计显著性。

然后，pathway并不是独立的，各个pathway之间的联系更重要

接着，把基因分割成pathway这样的小单元，忽略了单元外的联系

最后，只关注已知的 pathways, or gene set

过去的五年见证了癌症基因组测序研究翻天覆地的变化，但是距离它真正的临床应用还有以下几个挑战：

首先，我们忽略了non-coding somatic mutations

其次，很多我们定义的癌症种类其实是a mixture of these subtypes

然后，哪些癌症是可以合并研究的

最后，不同的NGS数据如何综合研究，包括WGS,WES,RNA sequencing, DNA methylation, and chromatin modifications

对某些患者来说，癌症精准医学已经来临，但是对大部分病人来说，前面的路还很长。

2014-review-Next-generation sequencing to guide cancer therapy

ulwvfje — Mon, 31 Aug 2015 11:15:51 +0000

This reductionist thinking led the initial theories on carcinogenesis to be centered on how many “hits” or genetic mutations were necessary for a tumor to develop.

还原论者认为导致癌症发生发展的原因集中在一些必须因子-"hit" or genetic mutations

由于这个假设，早期探索多种癌症的遗传基础的方法主要是低通量的研究具体某些特定的基因或者变异情况。

分析方法的选择：microarray vs WGS vs WES

临床样品的选择：fresh frozen tissue / FFPE specimens /CTCs / ctDNA

临床NGS数据分析方法：mapping --> SNVs CNVs and SVs --> annotation

挑战：1，低频突变很难从测序错误中区分开

2,很多临床相关的DNA fushions发生在非编码区，所以WES也会错过不少信息的

临床NGS数据注释：多种数据库，多种数据分析方法

NGS辅助临床医疗的三个途径： 1， diagnosis，早期诊断，精确分类 2，针对性治疗3，耐药性，及时换药

CTC: Circulating tumor cell;

ctDNA: circulating tumor DNA;

FDA: Food and Drug Administration;

FFPE: Formalin-fixed, paraffin-embedded;

MATCH: Molecular Analysis for Therapy Choice;

MHC: Majorhistocompatibility complex;

NGS: Next-generation sequencing;

SNV: Singlenucleotide variant;

TCGA: The Cancer Genome Atlas.

文献笔记-2015-nature-molecular analysis of gastric cancer新的分类及预后调查

ulwvfje — Mon, 31 Aug 2015 10:35:05 +0000

文献：Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes

A small pre-defined set of gene expression signatures

epithelial-to-mesenchymal transition (EMT)	上皮细胞向间充质细胞转化
microsatellite instability (MSI)	微卫星不稳定性
cytokine signaling	细胞因子信号
cell proliferation	细胞增殖
DNA methylation	DNA甲基化
TP53 activity	TP53活性
gastric tissue	胃组织

经典的分类方法是：Gastric cancer may be subdivided into 3 distinct subtypes—proximal, diffuse, and distal gastric cancer—based on histopathologic and anatomic criteria. Each subtype is associated with unique epidemiology.

我们用主成分分析Principal component anaylsis (PCA)

PC1

PC2

PC3

这三个主成分与上面的七个特征是相关联的。

根据我们的主成分分析，可以把我们的300个GC样本分成如下四组，命名如下：

Gene expression signatures define four molecular subtypes of GC:

MSI (n = 68),

MSS/EMT (n = 46),

MSS/TP53+ (n = 79)

MSS/TP53− (n = 107)

然后用本文的分类方法，测试了另外另个published数据，还是分成四个组

(MSI, MSS/EMT, MSS/TP53+ and MSS/TP53-)

分别是TCGA数据库的；n = 46, n = 62, n = 50 and n = 47.

Singapore的研究; n = 12, n = 85, n = 39 and n = 63 respectively

我们这样的分组可以得到一些规律：

(i) The MSS/EMT subtype occurred at a significantly younger age (P = 3e-2) than did other subtypes. The majority (>80%) of the subjects in this subtype were diagnosed with diffuse-type (P < 1e-4) at stage III/IV(P = 1e-3).

(ii) The MSI subtype occurred predominantly in the antrum (75%), >60% of subjects had the intestinal subtype, and >50% of subjects were diagnosed at an early stage (I/II).

(iii) Epstein-Barr virus (EBV) infection occurred more frequently in the MSS/TP53+ group (n = 12/18, P = 2e-4) than in the other groups.

然后我们对我们的300个样本做了生存分析：

预后： MSI > MSS/TP53+ > MSS/TP53 > MSS/EMT

Next, we validated the survival trend of GC subtypes in three independent cohorts: Samsung Medical Center cohort 2 (SMC-2,n = 277, GSE26253)31,

Singapore cohort(n = 200, GSE15459)21 and

TCGA gastric cohort (n = 205).

We saw that the GC subtypes showed a significant association with overall survival

结论：我们这样的分类是最合理的，跟各个类别的预后非常相关。

然后我们看看突变模式：

the MSI~ hypermutation ~KRAS (23.3%), the PI3K-PTEN-mTOR pathway (42%), ALK (16.3%) and ARID1A (44.2%)18.

We observed enrichment of PIK3CA H1047R mutations in the MSI samples

we saw enrichment of E542K and E545K mutations in MSS tumors

The EMT subtype had a lower number of mutation events when compared to the other MSS groups(P = 1e−3).

The MSS/TP53− subtype showed the highest prevalence of TP53 mutations (60%), with a low frequency of other mutations

the MSS/TP53+ subtype showed a relatively higher prevalence (compared to MSS/TP53−) of mutations in APC, ARID1A, KRAS, PIK3CA and SMAD4.

再看看拷贝数变异情况：

再看看与另外两个研究团队的分类情况的比较

The TCGA study reported expression clusters (subtypes named C1–C4) and genomic subtypes (subtypes named EBV+, MSI, Genome Stable (GS) and Chromosomal Instability (CIN)).

A follow-up study of the Singapore cohort21 described three expression subtypes (Proliferative, Metabolic and Reactive)

However, a consensus on clinically relevant subtypes that encompasses molecular heterogeneity and that can be used in preclinical and clinical research has not been reported.

Here we report the molecular classification of GC linked not only to distinct patterns of genomic alterations, but also to recurrence pattern and prognosis across multiple GC cohorts.

microsatellite instability

英文简称 : MI
中文全称 : 微卫星不稳定性
所属分类 : 生物科学
词条简介 : 微卫星不稳定性（microsatellite instability,MI）检测是基于VNTR的发现，细胞内基因组含有大量的碱基重复序列，一般将6-7bp的串联重称为小卫星DNA（minisatellite DNA）,又称为VNTR。而将1-4bp的串联重复称为微卫星DNA，又称简单重复序列（simple repeat sequence,SRS）。SRS是一种最常见的重复序列之一，具有丰富的多态性、高度杂合性、重组纺低等优点。最常见的为双核苷酸重复，即（AC）n和（TG）n。研究表胆，在n≥104时，2bp重复序列在人群中呈高度多态性。SRS广泛存在于原核和真核基因组中，约占真核基因组的5％，是近年来快速发展起来的新的DNA多态性标志之一。策卫星稳定性（MI）是指简重复序列的增加或丢失。MI首先在结肠癌中观察到，1993年在HNPCC中观察到多条染色体均有（AC）n重复序列的增加或毛失，以后相继在胃癌、胰腺癌、肺癌、膀胱癌、乳腺癌、前列腺癌及其他肿瘤等也好现存在微卫星不稳定现象，提示MI可能是肿瘤细胞的另一重要分子结果显示，MI与肿瘤与发展有关，MI仅在肿瘤细胞中发现，从未在正常组织中检测到。在原发与移肿瘤中，MI均交分布于整个肿瘤。晚期胃癌的MI频率显著高于早期胃癌。

研读橡胶的基因组文章-结果没有原始测序数据

ulwvfje — Tue, 17 Mar 2015 15:02:56 +0000

研读橡胶的基因组文章

我本科的前两年在海南儋州读书，那时候旁边就是橡胶所，很多同学也在那边做毕业论文什么的，我一直以为那里是全世界的橡胶中心，所有的先进技术都在那里产生，结果，前些天跟一个橡胶所的老师聊天才发现，居然橡胶(Hevea brasiliensis)的基因组已经发表了，可是，跟橡胶所没有半毛钱关系，更搞笑的事情是，堂堂一个基因组文章居然发表在BMC这样的杂志，真不知道是基因组的年代已经过去了还是他们做的实在是太差了，反正我看不过去了，所以研读他们的文章，并且下载数据测试一下。

文章地址如下:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575267/

可以看到它过于数据的描述都在补充材料1里面，所以我下载了补充材料。

可以看到所有的测序数据的描述，45个G的i llumina的200bp的双端测序，27个G的illumina的200bp的双端测序，约10G左右的长片段（8kb，20kb）罗氏454数据，最后还有一点点solid数据，它这样的测序策略好像是模仿的2011年发布的草莓基因组数据。

但是补充材料里面没有列出下载地址，我有点困惑！

按照道理我研读文献的步骤应该没有错，有可能是因为这个文章发表的杂志水平太低，所以不要求他们把测序原始数据上传到NCBI的SRA里面。或者是他们本身觉得文章发的不够档次，不想公布数据，所以先留着自己做精细分析，等发了大文章再公布原始数据。

然后我在NCBI的SRA里面查找了关于橡胶的原始数据，果真没有

仅有的10个数据，都是别的小组做的RNA-seq的内容。

De novo transcriptome analysis of abiotic stress responsive transcripts of Hevea brasiliensis.

所以我只好找了他们所参考的草莓（strawberry, Fragaria vesca (2n = 2x = 14)，a small genome (240 Mb),）的文章，是发表是nature genetics上面的

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326587/