<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 生信组学技术</title>
	<atom:link href="http://www.bio-info-trainee.com/category/omics/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>CCLE数据库里面的1000多个细胞系的RNA-SEQ数据和拷贝数变异数据联合分析</title>
		<link>http://www.bio-info-trainee.com/3040.html</link>
		<comments>http://www.bio-info-trainee.com/3040.html#comments</comments>
		<pubDate>Wed, 14 Feb 2018 14:43:20 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[cancer]]></category>
		<category><![CDATA[生信组学技术]]></category>
		<category><![CDATA[CCLE]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=3040</guid>
		<description><![CDATA[我看到这篇science的补充材料最后一个图是： 所以希望可以重复一遍这个分析。 &#8230; <a href="http://www.bio-info-trainee.com/3040.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>我看到这篇science的补充材料最后一个图是：<span id="more-3040"></span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/02/highly-correlated-CNV-by-SNP6array-and-RNA-seq.png"><img class="alignnone size-full wp-image-3041" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/02/highly-correlated-CNV-by-SNP6array-and-RNA-seq.png" alt="highly-correlated-cnv-by-snp6array-and-rna-seq" width="1970" height="1428" /></a></p>
<p>所以希望可以重复一遍这个分析。</p>
<p>重现完毕了，我再来更新哈</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/3040.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bioconductor包chimeraviz嵌合RNA可视化</title>
		<link>http://www.bio-info-trainee.com/2955.html</link>
		<comments>http://www.bio-info-trainee.com/2955.html#comments</comments>
		<pubDate>Sat, 06 Jan 2018 09:41:26 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[可视化]]></category>
		<category><![CDATA[融合基因]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2955</guid>
		<description><![CDATA[Bioconductor包chimeraviz嵌合RNA可视化 高通量RNA测序 &#8230; <a href="http://www.bio-info-trainee.com/2955.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h1 class="md-end-block md-heading">Bioconductor包chimeraviz嵌合RNA可视化</h1>
<p><span class="md-line md-end-block"><span class="">高通量RNA测序已经能够更高效地检测融合转录本，但是融合检测的技术和相关软件通常产生高错误发现率。而一个自动整合RNA数据和已知基因组特征的可视化框架对于结果的检验是有帮助的。2017年发布的一个bioconductor包，chimeraviz就可以做到自动创建嵌合RNA可视化。 </span></span></p>
<p><span class="md-line md-end-block">支持来自9种不同融合发现工具（<span class=""><a spellcheck="false" href="http://www.bioinformatics.com.cn/?/article/601">deFuse</a></span>、<span class=""><a spellcheck="false" href="http://www.bioinformatics.com.cn/?/article/497">EricScript</a></span>、InFusion、<span class=""><a spellcheck="false" href="http://www.bioinformatics.com.cn/?/article/367">JAFFA</a></span>、FusionCatcher、FusionMap、PRADA、SOAPfuse和STAR-FUSION）的输入。</span><span id="more-2955"></span></p>
<h2 class="md-end-block md-heading">官网教程</h2>
<p><span class="md-line md-end-block">直接在bioconductor可以看到详细说明：<span spellcheck="false"><a href="https://bioconductor.org/packages/release/bioc/html/chimeraviz.html">https://bioconductor.org/packages/release/bioc/html/chimeraviz.html</a></span> | <span class=""><a spellcheck="false" href="https://bioconductor.org/packages/release/bioc/vignettes/chimeraviz/inst/doc/chimeraviz-vignette.html">HTML</a></span> | <span class=""><a spellcheck="false" href="https://bioconductor.org/packages/release/bioc/vignettes/chimeraviz/inst/doc/chimeraviz-vignette.R">R Script</a></span> |</span></p>
<p><span class="md-line md-end-block">下载安装好该R包后，自带一系列的融合基因可视化的测试数据，文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">  1.1K Oct 16 22:36 5267readsAligned.bam
   96B Oct 16 22:36 5267readsAligned.bam.bai
   22K Oct 16 22:36 FusionMap_01_TestDataset_InputFastq.FusionReport.txt
   37K Oct 16 22:36 Homo_sapiens.GRCh37.74.sqlite
   68K Oct 16 22:36 Homo_sapiens.GRCh37.74_subset.gtf
  1.9K Oct 16 22:36 PRADA.acc.fusion.fq.TAF.tsv
   32K Oct 16 22:36 UCSC.HG19.Human.CytoBandIdeogram.txt
   32K Oct 16 22:36 UCSC.HG38.Human.CytoBandIdeogram.txt
   16K Oct 16 22:36 defuse_833ke_results.filtered.tsv
  4.6K Oct 16 22:36 ericscript_SRR1657556.results.total.tsv
  1.7M Oct 16 22:36 fusion5267and11759reads.bam
   57K Oct 16 22:36 fusion5267and11759reads.bam.bai
  4.1K Oct 16 22:36 fusioncatcher_833ke_final-list-candidate-fusion-genes.txt
  2.1K Oct 16 22:36 infusion_fusions.txt
  4.3K Oct 16 22:36 jaffa_results.csv
  2.6K Oct 16 22:36 reads.1.fq
  2.6K Oct 16 22:36 reads.2.fq
  1.0K Oct 16 22:36 reads_supporting_defuse_fusion_5267.1.fq
  1.0K Oct 16 22:36 reads_supporting_defuse_fusion_5267.2.fq
  3.3K Oct 16 22:36 soapfuse_833ke_final.Fusion.specific.for.genes
  2.0K Oct 16 22:36 star-fusion.fusion_candidates.final.abridged.txt</pre>
<p><span class="md-line md-end-block">可以看到，所支持的9种融合基因检测工具的示例结果都在这里了，比如我最喜欢的star-fusion的结果节选如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">#FusionName JunctionReadCount   SpanningFragCount   SpliceType  LeftGene    LeftBreakpoint  RightGene   RightBreakpoint
THRA--AC090627.1    27  93  ONLY_REF_SPLICE THRA^ENSG00000126351.8  chr17:38243106:+    AC090627.1^ENSG00000235300.3    chr17:46371709:+
THRA--AC090627.1    5   93  ONLY_REF_SPLICE THRA^ENSG00000126351.8  chr17:38243106:+    AC090627.1^ENSG00000235300.3    chr17:46384693:+
ACACA--STAC2    12  51  ONLY_REF_SPLICE ACACA^ENSG00000132142.15    chr17:35479453:-    STAC2^ENSG00000141750.6 chr17:37374426:-
RPS6KB1--SNF8   10  43  ONLY_REF_SPLICE RPS6KB1^ENSG00000108443.9   chr17:57970686:+    SNF8^ENSG00000159210.5  chr17:47021337:-
TOB1--SYNRG 8   30  ONLY_REF_SPLICE TOB1^ENSG00000141232.4  chr17:48943419:-    SYNRG^ENSG00000006114.11    chr17:35880751:-
VAPB--IKZF3 4   46  ONLY_REF_SPLICE VAPB^ENSG00000124164.11 chr20:56964573:+    IKZF3^ENSG00000161405.12    chr17:37934020:-
ZMYND8--CEP250  2   44  ONLY_REF_SPLICE ZMYND8^ENSG00000101040.15   chr20:45852970:-    CEP250^ENSG00000126001.11   chr20:34078463:+
AHCTF1--NAAA    3   38  ONLY_REF_SPLICE AHCTF1^ENSG00000153207.10   chr1:247094880:-    NAAA^ENSG00000138744.10 chr4:76846964:-
VAPB--IKZF3 1   46  ONLY_REF_SPLICE VAPB^ENSG00000124164.11 chr20:56964573:+    IKZF3^ENSG00000161405.12    chr17:37944627:-
VAPB--IKZF3 1   46  ONLY_REF_SPLICE VAPB^ENSG00000124164.11 chr20:56964573:+    IKZF3^ENSG00000161405.12    chr17:37922746:-
STX16--RAE1 4   33  ONLY_REF_SPLICE STX16^ENSG00000124222.17    chr20:57227143:+    RAE1^ENSG00000101146.8  chr20:55929088:+</pre>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">这些结果文件导入R里面统一用import系列函数，比如：</span></span></p>
<pre class="md-fences md-end-block" lang="R" contenteditable="false"><span class="cm-variable">library</span>(<span class="cm-variable">chimeraviz</span>)
​
<span class="cm-comment"># Get reference to results file from deFuse</span>
<span class="cm-variable">defuse833ke</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
  <span class="cm-string">"extdata"</span>,
  <span class="cm-string">"defuse_833ke_results.filtered.tsv"</span>,
  <span class="cm-variable">package</span><span class="cm-arg-is">=</span><span class="cm-string">"chimeraviz"</span>)
​
<span class="cm-comment"># Load the results file into a list of fusion objects</span>
<span class="cm-variable">fusions</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">importDefuse</span>(<span class="cm-variable">defuse833ke</span>, <span class="cm-string">"hg19"</span>)
​
<span class="cm-comment">## ---- message = FALSE------------------------------------------------------</span>
<span class="cm-variable">length</span>(<span class="cm-variable">fusions</span>)</pre>
<h2 class="md-end-block md-heading">基因组全局可视化</h2>
<pre class="md-fences md-end-block" lang="R" contenteditable="false"><span class="cm-variable">soapfuse833ke</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
  <span class="cm-string">"extdata"</span>,
  <span class="cm-string">"soapfuse_833ke_final.Fusion.specific.for.genes"</span>,
  <span class="cm-variable">package</span> <span class="cm-arg-is">=</span> <span class="cm-string">"chimeraviz"</span>)
<span class="cm-variable">fusions</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">importSoapfuse</span>(<span class="cm-variable">soapfuse833ke</span>, <span class="cm-string">"hg38"</span>, <span class="cm-number">10</span>)
<span class="cm-comment"># Plot!</span>
<span class="cm-variable">plotCircle</span>(<span class="cm-variable">fusions</span>)</pre>
<p><span class="md-line md-end-block">主要是一个环形图，如下：</span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-circle-plot.png"><img class="alignnone size-full wp-image-2957" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-circle-plot.png" alt="chimeraviz-fusion-circle-plot" width="1094" height="998" /></a></p>
<p><span class="">红色条带-</span><span class=""><strong>染色体内融合</strong></span>，蓝色条带-<span class=""><strong>染色体间融合。</strong></span></p>
<h3 class="md-end-block md-heading">单独可视化某个融合事件</h3>
<pre class="md-fences md-end-block" lang="R" contenteditable="false">​
<span class="cm-keyword">if</span>(<span class="cm-operator">!</span><span class="cm-variable">exists</span>(<span class="cm-string">"defuse833ke"</span>))
  <span class="cm-variable">defuse833ke</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
    <span class="cm-string">"extdata"</span>,
    <span class="cm-string">"defuse_833ke_results.filtered.tsv"</span>,
    <span class="cm-variable">package</span> <span class="cm-arg-is">=</span> <span class="cm-string">"chimeraviz"</span>)
<span class="cm-variable">fusions</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">importDefuse</span>(<span class="cm-variable">defuse833ke</span>, <span class="cm-string">"hg19"</span>, <span class="cm-number">1</span>)
<span class="cm-comment"># Choose a fusion object</span>
<span class="cm-variable">fusion</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">getFusionById</span>(<span class="cm-variable">fusions</span>, <span class="cm-number">5267</span>)
<span class="cm-comment"># Load edb</span>
<span class="cm-keyword">if</span>(<span class="cm-operator">!</span><span class="cm-variable">exists</span>(<span class="cm-string">"edbSqliteFile"</span>))
  <span class="cm-variable">edbSqliteFile</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
    <span class="cm-string">"extdata"</span>,
    <span class="cm-string">"Homo_sapiens.GRCh37.74.sqlite"</span>,
    <span class="cm-variable">package</span><span class="cm-arg-is">=</span><span class="cm-string">"chimeraviz"</span>)
<span class="cm-variable">edb</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">ensembldb</span><span class="cm-operator">::</span><span class="cm-variable">EnsDb</span>(<span class="cm-variable">edbSqliteFile</span>)
<span class="cm-comment"># bamfile with reads in the regions of this fusion event</span>
<span class="cm-keyword">if</span>(<span class="cm-operator">!</span><span class="cm-variable">exists</span>(<span class="cm-string">"fusion5267and11759reads"</span>))
  <span class="cm-variable">fusion5267and11759reads</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
    <span class="cm-string">"extdata"</span>,
    <span class="cm-string">"fusion5267and11759reads.bam"</span>,
    <span class="cm-variable">package</span> <span class="cm-arg-is">=</span> <span class="cm-string">"chimeraviz"</span>)
<span class="cm-comment"># Plot!</span>
<span class="cm-variable">plotFusion</span>(
  <span class="cm-variable">fusion</span> <span class="cm-arg-is">=</span> <span class="cm-variable">fusion</span>,
  <span class="cm-variable">bamfile</span> <span class="cm-arg-is">=</span> <span class="cm-variable">fusion5267and11759reads</span>,
  <span class="cm-variable">edb</span> <span class="cm-arg-is">=</span> <span class="cm-variable">edb</span>,
  <span class="cm-variable">nonUCSC</span> <span class="cm-arg-is">=</span> <span class="cm-variable">TRUE</span>)
​
<span class="cm-comment">## ---- echo = FALSE, message = FALSE, fig.height = 5, fig.width = 10, dev='png'----</span>
<span class="cm-comment"># Plot!</span>
<span class="cm-variable">plotFusion</span>(
  <span class="cm-variable">fusion</span> <span class="cm-arg-is">=</span> <span class="cm-variable">fusion</span>,
  <span class="cm-variable">bamfile</span> <span class="cm-arg-is">=</span> <span class="cm-variable">bamfile5267</span>,
  <span class="cm-variable">edb</span> <span class="cm-arg-is">=</span> <span class="cm-variable">edb</span>,
  <span class="cm-variable">nonUCSC</span> <span class="cm-arg-is">=</span> <span class="cm-variable">TRUE</span>,
  <span class="cm-variable">reduceTranscripts</span> <span class="cm-arg-is">=</span> <span class="cm-variable">TRUE</span>)
​</pre>
<p><span class="md-line md-end-block">这个可视化比较复杂一点，需要融合基因的事件详情，包含两个融合基因的bam片段文件，以及参考基因组的数据库信息。</span></p>
<p><span class="md-line md-end-block">然后有两种展现方式，一种是基于转录本的融合情况，一种是基于基因</span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-plot.png"><img class="alignnone size-full wp-image-2958" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-plot.png" alt="chimeraviz-fusion-plot" width="1310" height="1406" /></a></p>
<p><span class="md-line md-end-block">RCC1-HENMT1融合例子。</span></p>
<p><span class="md-line md-end-block md-focus">顶部：显示融合的染色体位置。支持断裂点（红色曲线）的discordant reads数10（其中split的6，spanning的4），注释的转录本及read数图。</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2955.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用LeafCutter探索转录组数据的可变剪切</title>
		<link>http://www.bio-info-trainee.com/2949.html</link>
		<comments>http://www.bio-info-trainee.com/2949.html#comments</comments>
		<pubDate>Fri, 05 Jan 2018 01:49:59 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2949</guid>
		<description><![CDATA[用LeafCutter探索转录组数据的可变剪切 该软件早在2016年就公布了，发 &#8230; <a href="http://www.bio-info-trainee.com/2949.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h1 class="md-end-block md-heading"><span class="">用LeafCutter探索转录组数据的可变剪切</span></h1>
<p><span class="md-line md-end-block">该软件早在2016年就公布了，发表在biorxiv预印本上面，但直到2017年的双11，才发表在NG上面，文章是 : <span class=""><a spellcheck="false" href="https://www.nature.com/articles/s41588-017-0004-9">Annotation-free quantification of RNA splicing using LeafCutter</a></span> 最大的特点应该是不需要参考基因组的基因注释信息了吧，就是gtf/gff文件可以省略，当然，比对还是需要的。它还有另外一个非常重要的功能，splicing quantitative trait loci (sQTLs) 但是跟我目前关系不大， 就不介绍了。</span><span id="more-2949"></span></p>
<h3 class="md-end-block md-heading">背景介绍</h3>
<p><span class="md-line md-end-block md-focus"><span class="md-expand">目前主流的探究转录组数据的可变剪切的算法要么是基于estimate isoform ratios 或者 exon inclusion levels ，但是挑战还是蛮多的，可变剪切本跟正常转录本重合的比例很大，技术误差也是有的，依赖于基因现有的注释信息，既不准确，也不完全。所以作者开发了LeafCutter工具。</span></span></p>
<h3 class="md-end-block md-heading">LeafCutter workflow.</h3>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">First, short reads are <span class=""><strong>mapped</strong></span> to the genome. When SNP data are available, WASP should be used to filter allele-specific reads that map with a bias. </span></li>
<li><span class="md-line md-end-block">Next, LeafCutter <span class=""><strong>extracts junction reads</strong></span> from.bam files, identifies alternatively excised intron clusters, and summarizes <span class=""><strong>intron usage</strong></span> as counts or proportions. </span></li>
<li><span class="md-line md-end-block">Finally, LeafCutter <span class=""><strong>identifies intron clusters</strong></span> with differentially excised introns between two user-defined groups by using a <span class=""><strong>Dirichlet-multinomial model,</strong></span> or maps genetic variants associated with intron excision levels by using a linear model. </span></li>
</ul>
<p><span class="md-line md-end-block">作者在Genotype-Tissue Expression (GTEx) Consortium数据集上面测试了，并且把结果跟 GENCODE v19, Ensembl, and UCSC 着3大主流的基因注释信息数据库比较。还在其它数据库里面验证了，数据下载地址是：dbGaP under accession <span class=""><a spellcheck="false" href="https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v6.p1">phs000424.v6.p1</a></span> (GTEx), GEO under accession <span class=""><a spellcheck="false" href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637">GSE41637</a></span> (RNA-seq data from mammalian organs), and ENA under accession <span class=""><a spellcheck="false" href="https://www.ebi.ac.uk/ena/data/view/PRJEB3366">PRJEB3366</a></span> (Geuvadis).</span></p>
<h3 class="md-end-block md-heading">软件下载地址：</h3>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">LeafCutter software, <a href="https://github.com/davidaknowles/leafcutter">https://github.com/davidaknowles/leafcutter</a>; </span></li>
<li><span class="md-line md-end-block">LeafViz visualizations, <a href="https://leafcutter.shinyapps.io/leafviz/">https://leafcutter.shinyapps.io/leafviz/</a>; </span></li>
<li><span class="md-line md-end-block">rheumatoid arthritis summary statistics, <a href="http://plaza.umin.ac.jp/yokada/datasource/software.htm">http://plaza.umin.ac.jp/yokada/datasource/software.htm</a>.</span></li>
</ul>
<h3 class="md-end-block md-heading">软件安装及使用</h3>
<p><span class="md-line md-end-block">最简单的就是conda进行安装了：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
conda install <span class="cm-attribute">-c</span> davidaknowles r-leafcutter</pre>
<p><span class="md-line md-end-block">如果安装失败，可能需要单独为它创建一个环境。</span></p>
<p><span class="md-line md-end-block">不过，它本身就是一个R包，所以在个人电脑里面的rstudio里面安装即可。</span></p>
<pre class="md-fences md-end-block" lang="r" contenteditable="false">
<span class="cm-keyword">if</span> (<span class="cm-operator">!</span><span class="cm-variable">require</span>(<span class="cm-string">"devtools"</span>)) <span class="cm-variable">install.packages</span>(<span class="cm-string">"devtools"</span>, <span class="cm-variable">repos</span><span class="cm-operator">=</span><span class="cm-string">'http://cran.us.r-project.org'</span>)
<span class="cm-variable">devtools</span><span class="cm-operator">::</span><span class="cm-variable">install_github</span>(<span class="cm-string">"davidaknowles/leafcutter/leafcutter"</span>)</pre>
<p><span class="md-line md-end-block">但是源代码里面有一些脚本和测试数据，所以还是要下载看看</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">mkdir</span> <span class="cm-attribute">-p</span> ~/biosoft 
<span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-builtin">git</span> clone https://github.com/davidaknowles/leafcutter
<span class="cm-builtin">cd</span> leafcutter
<span class="cm-comment">## 需要修改里面的一个脚本 scripts/bam2junc.sh 把软件路径增添进去即可</span></pre>
<p><span class="md-line md-end-block">里面又是perl又是python的，感觉他们团队开发环境不统一。</span></p>
<h2 class="md-end-block md-heading">第一步:bam2junc</h2>
<p><span class="md-line md-end-block">比对一般来说，优先选择STAR等支持跨越内含子的转录组比对工具得到bam文件，运行下面的脚本即可进行批量转换：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-builtin">cat</span> bam_path.txt |while read id
<span class="cm-keyword">do</span>
<span class="cm-def">file</span><span class="cm-operator">=</span><span class="cm-quote">$(basename </span><span class="cm-def">$id</span><span class="cm-quote"> )</span>
<span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-def">${file%%.*}</span>
    <span class="cm-builtin">echo</span> Converting <span class="cm-def">$id</span> to <span class="cm-def">$sample</span>.junc
    <span class="cm-builtin">sh</span> /public/biosoft/leafcutter/scripts/bam2junc.sh  <span class="cm-def">$id</span> <span class="cm-def">$sample</span>.junc
<span class="cm-keyword">done</span></pre>
<p><span class="md-line md-end-block">得到的junc文件如下:</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
chr7    134840725   134843893   .   1   -
chr2    234355442   234355737   .   1   +
chr4    37828435    37831585    .   13  +
chr19   39101772    39101882    .   5   +
chr11   109735445   109827551   .   19  +
chr18   48458730    48465939    .   8   -
chr12   82751048    82752457    .   12  -
chr15   51018323    51018517    .   14  -
chr1    247323115   247335149   .   2   +
chr10   92920631    92982445    .   1   +</pre>
<p><span class="md-line md-end-block">这个步骤有点耗时，所有的junc文件地址需要保存给下一步使用</span></p>
<h3 class="md-end-block md-heading">第二步：Intron clustering</h3>
<p><span class="md-line md-end-block">这个步骤，需要python2.7版本，这个是python的一个大坑，到现在版本仍然不统一。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">ls *.junc &gt;test_juncfiles.txt
python /public/biosoft/leafcutter/clustering/leafcutter_cluster.py -j test_juncfiles.txt -m 50 -o testYRIvsEU -l 500000</pre>
<p><span class="md-line md-end-block">几分钟就运行完毕。</span></p>
<p><span class="md-line md-end-block">得到的比较重要的文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
1.3M Jan  4 17:45 testYRIvsEU_perind.counts.gz
680K Jan  4 17:45 testYRIvsEU_perind_numers.counts.gz
5.0M Jan  4 17:45 testYRIvsEU_pooled
540K Jan  4 17:45 testYRIvsEU_refined
 877 Jan  4 17:45 testYRIvsEU_sortedlibs
 854 Jan  4 17:43 test_juncfiles.txt</pre>
<p><span class="md-line md-end-block">值得注意的是 <span spellcheck="false"><code>testYRIvsEU_perind_numers.counts.gz</code></span> 文件，里面每一行都是一个内含子，每一列都是一个样本，写明了它们的表达值，这些数值就可以用来做可变剪切分析。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
 #  zcat testYRIvsEU_perind_numers.counts.gz |tail
chr8:145651155:145651305:clu_6538 21 14 19 8 9 0 13 33 0 0 4 0 5 8 12 0 12 34 15 0 0 10 11
chr8:145651155:145651409:clu_6538 1021 611 186 190 294 284 681 89 222 57 257 363 694 807 523 44 469 812 926 71 80 260 214
chr8:145652362:145653872:clu_6539 1265 694 132 74 302 71 178 34 44 12 63 122 230 218 472 6 146 1421 1084 16 14 83 46
chr8:145652654:145653872:clu_6539 48 24 56 0 26 0 13 0 2 5 2 0 3 19 17 0 2 8 64 0 0 3 0
chr8:145652674:145653872:clu_6539 18 26 0 0 0 7 2 0 5 0 0 0 1 6 11 0 3 34 37 0 0 9 6
chr8:146017525:146017630:clu_6540 2 3 44 0 2 12 4 0 0 0 22 5 9 10 2 0 1 9 11 0 0 1 0
chr8:146017525:146017751:clu_6540 1067 671 620 41 295 347 224 89 62 33 262 136 229 223 356 17 288 480 1842 9 35 70 23
chr8:146076780:146078224:clu_6541 18 3 0 0 17 17 8 0 0 3 2 3 16 6 12 0 4 45 29 9 0 10 2
chr8:146076780:146078378:clu_6541 22 17 0 0 0 3 1 0 0 0 3 2 15 7 2 0 7 62 55 0 0 4 0
chr8:146076780:146078757:clu_6541 10 1 16 0 12 52 0 0 11 0 24 9 27 3 0 0 7 0 28 0 0 2 0</pre>
<h3 class="md-end-block md-heading">第三步：制作分组矩阵进行差异分析</h3>
<p><span class="md-line md-end-block">避免暴露我真实的项目，这里就给作者的示例文件吧：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
RNA.NA18486_YRI.chr1.bam YRI
RNA.NA18487_YRI.chr1.bam YRI
RNA.NA18488_YRI.chr1.bam YRI
RNA.NA18489_YRI.chr1.bam YRI
RNA.NA18498_YRI.chr1.bam YRI
RNA.NA06984_CEU.chr1.bam CEU
RNA.NA06985_CEU.chr1.bam CEU
RNA.NA06986_CEU.chr1.bam CEU
RNA.NA06989_CEU.chr1.bam CEU
RNA.NA06994_CEU.chr1.bam CEU</pre>
<p><span class="md-line md-end-block">很简单的两列文件，说明每一个样本属于哪个组即可。</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"> /public/biosoft/leafcutter/scripts/leafcutter_ds.R <span class="cm-attribute">--num_threads</span> <span class="cm-number">4</span> \
 <span class="cm-attribute">--exon_file</span><span class="cm-operator">=</span>/public/biosoft/leafcutter/leafcutter/data/gencode19_exons.txt.gz \
testYRIvsEU_perind_numers.counts.gz group_info.txt</pre>
<p><span class="md-line md-end-block" contenteditable="true">这里的<span spellcheck="false"><code>group_info.txt</code></span> 就是自己制作好的分组矩阵。值得提醒的是，<span class=""><strong>上面的文件有且只能有2个分组，</strong></span><span class="">这样软件才知道怎么样去比较，如果自己的分组很多，可以考虑制作多个分组文件，运行多次。</span></span></p>
<p><span class="md-line md-end-block">当然，上面的脚本已经没有必要在linux服务器里面运行啦。</span></p>
<p><span class="md-line md-end-block">既然有了内含子的表达矩阵，又有了分组信息，差异分析根本就不会消耗多少计算资源，全部下载到自己的电脑里面去做吧。</span></p>
<p><span class="md-line md-end-block">自己打开文件 <span class="" spellcheck="false"><code>/public/biosoft/leafcutter/scripts/leafcutter_ds.R</code></span> 就明白了整个流程。</span></p>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">也是几分钟就完成了全部结果。</span></span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
Running differential splicing analysis...
Differential splicing summary:
                                             statuses Freq
1 &lt;2 introns used in &gt;=min_samples_per_intron samples  425
2                          &lt;=1 sample with coverage&gt;0   62
3               &lt;=1 sample with coverage&gt;min_coverage  939
4                            Not enough valid samples 3047
5                                             Success 2068
Saving results...
Loading exons from /Users/jmzeng/biosoft/leafcutter/leafcutter/data/gencode19_exons.txt.gz
All done, exiting</pre>
<p><span class="md-line md-end-block" contenteditable="true">得到的文件里面，需要详细了解的是 <span class="" spellcheck="false"><code>leafcutter_ds_cluster_significance.txt</code></span><span class=""> 主要靠自己看readme啦。</span></span></p>
<h3 class="md-end-block md-heading">第四步：可视化那些可变剪切</h3>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">也是包装好的脚本。</span></span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"> /Users/jmzeng/biosoft/leafcutter/scripts/ds_plots.R <span class="cm-attribute">-e</span>  /Users/jmzeng/biosoft/leafcutter/leafcutter/data/gencode19_exons.txt.gz testYRIvsEU_perind_numers.counts.gz   group_info.txt leafcutter_ds_cluster_significance.txt <span class="cm-attribute">-f</span> <span class="cm-number">0</span>.05</pre>
<p><span class="md-line md-end-block">所有的可变剪切形式都会可视化在一张PDF图里面。如下：</span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter1.jpeg"><img class="alignnone size-full wp-image-2950" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter1.jpeg" alt="leafcutter1" width="2236" height="2124" /></a> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter2.jpeg"><img class="alignnone size-full wp-image-2951" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter2.jpeg" alt="leafcutter2" width="2232" height="2122" /></a> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter3.jpeg"><img class="alignnone size-full wp-image-2952" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter3.jpeg" alt="leafcutter3" width="2228" height="2154" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2949.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>使用SGSeq探索可变剪切</title>
		<link>http://www.bio-info-trainee.com/2890.html</link>
		<comments>http://www.bio-info-trainee.com/2890.html#comments</comments>
		<pubDate>Thu, 14 Dec 2017 03:17:11 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2890</guid>
		<description><![CDATA[可变剪切是指mRNA前体以多种方式将exon连接在一起的过程。 由于可变剪切使一 &#8230; <a href="http://www.bio-info-trainee.com/2890.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="markdown-here-wrapper" data-md-url="http://www.bio-info-trainee.com/wp-admin/post-new.php">
<blockquote style="margin: 1.2em 0px; border-left: 4px solid #dddddd; padding: 0px 1em; color: #777777; quotes: none;">
<p style="margin: 0px 0px 1.2em !important;"><strong>可变剪切</strong>是指mRNA前体以多种方式将exon连接在一起的过程。 由于<strong>可变剪切</strong>使一个基因产生多个mRNA<strong>转录本</strong>，不同mRNA可能翻译成不同蛋白。</p>
</blockquote>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">可变剪切背景知识</h2>
<p style="margin: 0px 0px 1.2em !important;">转录组一般是指从细胞或组织的基因组所转录出来的RNA的总和，包括编码蛋白质的mRNA和各种非编码RNA（<strong>rRNA,tRNA,snRNA,snoRNA,lncRNA,microRNA</strong>等）。真核生物的基因结构是不连续的，如下图：</p>
<p style="margin: 0px 0px 1.2em !important;"><span id="more-2890"></span></p>
<p style="margin: 0px 0px 1.2em !important;"><img src="http://www.bio-info-trainee.com/wp-content/uploads/2017/11/gene-structure.png" alt="真核生物的基因结构" /></p>
<p style="margin: 0px 0px 1.2em !important;">其基因组最初的转录产物其实并不是成熟的mRNA分子，而是它的前体pre-mRNA，那么怎么变成成熟的mRNA呢，就需要从pre-mRNA中将非编码蛋白质的内含子（intron）切除，然后拼接剩下的编码蛋白质的外显子（exon）。但实际上，在这个过程中，有多种多样的前切和拼接方式，从而产生不同的剪切异构体，也就咱们要说的可变剪切。</p>
<p style="margin: 0px 0px 1.2em !important;">可变剪切的形式复杂多样，大致可以分为5大类。</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;">第一类是外显子跳跃型（exon skipping），发生跳跃的外显子和其两侧的内含子都被剪切掉，上游和下游的外显子被直接连着一起保留在剪切后的产物中。</li>
<li style="margin: 0.5em 0px;">第二类是内含子滞留型（intron retention），某一段核苷酸序列在一个剪切体中是外显子的一部分，而在与之对照的剪切体中却是内含子而被剪切掉。</li>
<li style="margin: 0.5em 0px;">第三类是可变5’或3’端剪切（alternative 5’ss splice or alternative 3’ss splice，其中5’ss称供体位点，3’ss称受体位点），和与它对照的另一个剪切体相比，发生剪切的位点在5’或3’端不同，除此，其他剪切选择一致。</li>
<li style="margin: 0.5em 0px;">第四类是转录起始区域可变剪切（alternative TSS），发生剪切的位点在转录起始区域，即与之对应的另一个剪切体除转录起始位点不同外，其余一致。</li>
<li style="margin: 0.5em 0px;">第五类是转录终止区域可变剪切（alternative TTS），与第四类对应，发生剪切的位点只是在转录终止位点不同。</li>
</ul>
<p style="margin: 0px 0px 1.2em !important;"><img src="http://www.bio-info-trainee.com/wp-content/uploads/2017/11/splicing.png" alt="可变剪切的5种形式" /></p>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">软件算法</h2>
<p style="margin: 0px 0px 1.2em !important;"><strong>比较旧的</strong>分析可变剪切的软件主要有SpliceR、SpliceGrapher、ASprofile以及Splicing Express等，它们是基于cufflinks软件的结果，将reads回帖到基因组序列后，根据位置和长度及结构信息，来确定或预测可能的剪切体的类型。目前主流已经不再使用tophat+cufflinks流程了。</p>
<h3 id="sgseq-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.3em;">SGSeq流程</h3>
<p style="margin: 0px 0px 1.2em !important;">这里介绍一下<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">SGSeq</code>软件，输入文件是bam，但是需要用支持转录组数据比对的工具得到的bam文件，比如</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;">GSNAP (T. D. Wu and Nacu 2010)</li>
<li style="margin: 0.5em 0px;">HISAT (Kim, Langmead, and Salzberg 2015)</li>
<li style="margin: 0.5em 0px;">STAR (Dobin et al. 2013)</li>
</ul>
<p style="margin: 0px 0px 1.2em !important;">其实是需要bam文件里面有<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">XS</code> 这样的标记！</p>
<p style="margin: 0px 0px 1.2em !important;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">SGSeq</code>包的安装说明，使用方法都可以见官网：</p>
<table style="margin: 1.2em 0px; padding: 0px; border-collapse: collapse; border-spacing: 0px; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; font-size: inherit; line-height: inherit; font-family: inherit; border: 0px;">
<thead>
<tr style="border-width: 1px 0px 0px; border-image: initial; background-color: white; margin: 0px; padding: 0px; border-color: #cccccc initial initial initial; border-style: solid initial initial initial;">
<th style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em; font-weight: bold; background-color: #f0f0f0;"><a href="https://bioconductor.org/packages/release/bioc/vignettes/SGSeq/inst/doc/SGSeq.html">HTML</a></th>
<th style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em; font-weight: bold; background-color: #f0f0f0;"><a href="https://bioconductor.org/packages/release/bioc/vignettes/SGSeq/inst/doc/SGSeq.R">R Script</a></th>
<th style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em; font-weight: bold; background-color: #f0f0f0;">SGSeq</th>
</tr>
</thead>
<tbody style="margin: 0px; padding: 0px; border: 0px;">
<tr style="border-width: 1px 0px 0px; border-image: initial; background-color: white; margin: 0px; padding: 0px; border-color: #cccccc initial initial initial; border-style: solid initial initial initial;">
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"><a href="https://bioconductor.org/packages/release/bioc/manuals/SGSeq/man/SGSeq.pdf">PDF</a></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;">Reference Manual</td>
</tr>
<tr style="border-width: 1px 0px 0px; border-image: initial; background-color: #f8f8f8; margin: 0px; padding: 0px; border-color: #cccccc initial initial initial; border-style: solid initial initial initial;">
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"><a href="https://bioconductor.org/packages/release/bioc/news/SGSeq/NEWS">Text</a></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;">NEWS</td>
</tr>
</tbody>
</table>
<h2 id="-bam-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">需要bam文件</h2>
<p style="margin: 0px 0px 1.2em !important;">安装好包之后可以看到附带的数据，如下：</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">jianmingzengs-iMac:IGV_2.3.98 jmzeng$ cd /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SGSeq/extdata/bams/
jianmingzengs-iMac:bams jmzeng$ ls -lh
total 1952
-rw-r--r-- 1 jmzeng admin 54K Nov 1 01:26 N1.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N1.bam.bai
-rw-r--r-- 1 jmzeng admin 86K Nov 1 01:26 N2.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N2.bam.bai
-rw-r--r-- 1 jmzeng admin 75K Nov 1 01:26 N3.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N3.bam.bai
-rw-r--r-- 1 jmzeng admin 92K Nov 1 01:26 N4.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N4.bam.bai
-rw-r--r-- 1 jmzeng admin 75K Nov 1 01:26 T1.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T1.bam.bai
-rw-r--r-- 1 jmzeng admin 90K Nov 1 01:26 T2.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T2.bam.bai
-rw-r--r-- 1 jmzeng admin 65K Nov 1 01:26 T3.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T3.bam.bai
-rw-r--r-- 1 jmzeng admin 75K Nov 1 01:26 T4.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T4.bam.bai
</code></pre>
<p style="margin: 0px 0px 1.2em !important;">这些<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">bam</code>文件之所以这么小，就是因为作者只是截取了<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">hg19</code>的部分数据，坐标是<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">16 [87362942, 87425708]</code></p>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">需要注释文件</h2>
<p style="margin: 0px 0px 1.2em !important;">需根据bioconductor里面的txdb对象来构建比对文件的参考基因组，参考注释信息。如果是hg19的可以如下：</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb &lt;- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb &lt;- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) &lt;- "NCBI"
txf_ucsc &lt;- convertToTxFeatures(txdb)
txf_ucsc &lt;- txf_ucsc[txf_ucsc %over% gr]
head(txf_ucsc)
type(txf_ucsc)
head(txName(txf_ucsc))
head(geneName(txf_ucsc))
</code></pre>
<p style="margin: 0px 0px 1.2em !important;">主要就是通过<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">convertToTxFeatures()</code>函数把 <code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">GRanges</code> 对象转化成了一个<em>TxFeatures</em>对象，用来标记下面5种类型：</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;"><em>J</em> (splice junction)</li>
<li style="margin: 0.5em 0px;"><em>I</em> (internal exon)</li>
<li style="margin: 0.5em 0px;"><em>F</em> (first/5′′-terminal exon)</li>
<li style="margin: 0.5em 0px;"><em>L</em> (last/5′′-terminal exon)</li>
<li style="margin: 0.5em 0px;"><em>U</em> (unspliced transcript).</li>
</ul>
<p style="margin: 0px 0px 1.2em !important;">再用 <em>convertToSGFeatures()</em> 函数把TxFeatures对象转化成SGFeatures 对象，用来标记</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;"><em>J</em> (splice junction)</li>
<li style="margin: 0.5em 0px;"><em>E</em> (disjoint exon bin)</li>
<li style="margin: 0.5em 0px;"><em>D</em> (splice donor site)</li>
<li style="margin: 0.5em 0px;"><em>A</em> (splice acceptor site).</li>
</ul>
<h2 id="-sgseq-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">运行SGSeq软件</h2>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">sgfc_ucsc &lt;- analyzeFeatures(si, features = txf_ucsc)
sgfc_ucsc
</code></pre>
<p style="margin: 0px 0px 1.2em !important;">因为软件包自带的数据非常小，所以很容易就运行完毕，不知道真实情况下我的<strong>16G</strong>的bam文件会处理多久。</p>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">探索处理结果</h2>
<p style="margin: 0px 0px 1.2em !important;">也是全部在R语言里面运行即可，下面的这些函数用来探索分析结果，这些表达矩阵就写明了每个基因的每个外显子的表达量以及两个外显子中间夹着的内含子的表达情况。</p>
<p style="margin: 0px 0px 1.2em !important;">也就是说该软件在R里面就对所有的genomic features 进行了reads的计数。</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code class="hljs language-R" style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block; overflow: auto; overflow-x: auto; color: #333333; background: #f8f8f8; text-size-adjust: none;">colData(sgfc_ucsc)
rowRanges(sgfc_ucsc)
head(counts(sgfc_ucsc))
head(FPKM(sgfc_ucsc))
</code></pre>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">可变剪切形式的可视化</h2>
<p style="margin: 0px 0px 1.2em !important;">挑选其中一个基因，可视化表达差异情况</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code class="hljs language-R" style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block; overflow: auto; overflow-x: auto; color: #333333; background: #f8f8f8; text-size-adjust: none;">df &lt;- plotFeatures(sgfc_ucsc, geneID = <span class="hljs-number" style="color: #008080;">1</span>)
<span class="hljs-comment" style="color: #999988; font-style: italic;"># 下面是复杂一点的可视化</span>
sgfc_pred &lt;- analyzeFeatures(si, which = gr)
head(rowRanges(sgfc_pred))
sgfc_pred &lt;- annotate(sgfc_pred, txf_ucsc)
head(rowRanges(sgfc_pred))
df &lt;- plotFeatures(sgfc_pred, geneID = <span class="hljs-number" style="color: #008080;">1</span>, color_novel = <span class="hljs-string" style="color: #dd1144;">"red"</span>)
</code></pre>
<p style="margin: 0px 0px 1.2em !important;"> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2017/12/transcript-variant-bioconductor-SGSeq.png"><img class="alignnone size-full wp-image-2893" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/12/transcript-variant-bioconductor-SGSeq.png" alt="transcript-variant-bioconductor-sgseq" width="1504" height="1080" /></a>这个是作者精选挑选的特殊的例子用来展现软件的成功，事实上应该是先全局检查哪些可变剪切存在，然后输出</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">## 下面是另外一个展现模式：
par(mfrow = c(5, 1), mar = c(1, 3, 1, 1))
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4) {
 plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}
</code></pre>
<h2 style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;"><img class="alignnone size-full wp-image-2892" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/12/transcript-variant-bioconductor-SGSeq-2.png" alt="transcript-variant-bioconductor-sgseq-2" width="1788" height="1136" /></h2>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">从可变剪切预测结果里面鉴别剪切体</h2>
<p style="margin: 0px 0px 1.2em !important;">Instead of considering the full splice graph of a gene, the analysis can be focused on individual splice events. Function <em>analyzeVariants()</em> recursively identifies splice events from the graph, obtains representative counts for each splice variant, and computes estimates of relative splice variant usage, also referred to as ‘percentage spliced <strong>in’ (PSI or Ψ) (Venables et al. 2008, Katz et al. (2010)).</strong> （涉及到了一个算法的问题）</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">sgvc_pred &lt;- analyzeVariants(sgfc_pred)
sgvc_pred
mcols(sgvc_pred)
variantFreq(sgvc_pred)
plotVariants(sgvc_pred, eventID = 1, color_novel = "red")
library(BSgenome.Hsapiens.UCSC.hg19)
seqlevelsStyle(Hsapiens) &lt;- "NCBI"
vep &lt;- predictVariantEffects(sgv_pred, txdb, Hsapiens)
vep
</code></pre>
<div style="height: 0; width: 0; max-height: 0; max-width: 0; overflow: hidden; font-size: 0em; padding: 0; margin: 0;" title="MDH:PHA+IyDkvb/nlKhTR1NlceaOoue0ouWPr+WPmOWJquWIhzwvcD48cD4mZ3Q7ICoq5Y+v5Y+Y5Ymq
5YiHKirmmK/mjIdtUk5B5YmN5L2T5Lul5aSa56eN5pa55byP5bCGZXhvbui/nuaOpeWcqOS4gOi1
t+eahOi/h+eoi+OAgiDnlLHkuo4qKuWPr+WPmOWJquWIhyoq5L2/5LiA5Liq5Z+65Zug5Lqn55Sf
5aSa5LiqbVJOQSoq6L2s5b2V5pysKirvvIzkuI3lkIxtUk5B5Y+v6IO957+76K+R5oiQ5LiN5ZCM
6JuL55m944CCPC9wPjxwPiMjIOWPr+WPmOWJquWIh+iDjOaZr+efpeivhjwvcD48cD7ovazlvZXn
u4TkuIDoiKzmmK/mjIfku47nu4bog57miJbnu4Tnu4fnmoTln7rlm6Dnu4TmiYDovazlvZXlh7rm
naXnmoRSTkHnmoTmgLvlkozvvIzljIXmi6znvJbnoIHom4vnmb3otKjnmoRtUk5B5ZKM5ZCE56eN
6Z2e57yW56CBUk5B77yIKipyUk5BLHRSTkEsc25STkEsc25vUk5BLGxuY1JOQSxtaWNyb1JOQSoq
562J77yJ44CC55yf5qC455Sf54mp55qE5Z+65Zug57uT5p6E5piv5LiN6L+e57ut55qE77yM5aaC
5LiL5Zu+77yaPC9wPjxwPiFb55yf5qC455Sf54mp55qE5Z+65Zug57uT5p6EXShodHRwOi8vd3d3
LmJpby1pbmZvLXRyYWluZWUuY29tL3dwLWNvbnRlbnQvdXBsb2Fkcy8yMDE3LzExL2dlbmUtc3Ry
dWN0dXJlLnBuZyk8L3A+PHA+5YW25Z+65Zug57uE5pyA5Yid55qE6L2s5b2V5Lqn54mp5YW25a6e
5bm25LiN5piv5oiQ54af55qEbVJOQeWIhuWtkO+8jOiAjOaYr+Wug+eahOWJjeS9k3ByZS1tUk5B
77yM6YKj5LmI5oCO5LmI5Y+Y5oiQ5oiQ54af55qEbVJOQeWRou+8jOWwsemcgOimgeS7jnByZS1t
Uk5B5Lit5bCG6Z2e57yW56CB6JuL55m96LSo55qE5YaF5ZCr5a2Q77yIaW50cm9u77yJ5YiH6Zmk
77yM54S25ZCO5ou85o6l5Ymp5LiL55qE57yW56CB6JuL55m96LSo55qE5aSW5pi+5a2Q77yIZXhv
bu+8ieOAguS9huWunumZheS4iu+8jOWcqOi/meS4qui/h+eoi+S4re+8jOacieWkmuenjeWkmuag
t+eahOWJjeWIh+WSjOaLvOaOpeaWueW8j++8jOS7juiAjOS6p+eUn+S4jeWQjOeahOWJquWIh+W8
guaehOS9k++8jOS5n+WwseWSseS7rOimgeivtOeahOWPr+WPmOWJquWIh+OAgjwvcD48cD7lj6/l
j5jliarliIfnmoTlvaLlvI/lpI3mnYLlpJrmoLfvvIzlpKfoh7Tlj6/ku6XliIbkuLo15aSn57G7
44CCPC9wPjxwPi0g56ys5LiA57G75piv5aSW5pi+5a2Q6Lez6LeD5Z6L77yIZXhvbiBza2lwcGlu
Z++8ie+8jOWPkeeUn+i3s+i3g+eahOWkluaYvuWtkOWSjOWFtuS4pOS+p+eahOWGheWQq+WtkOmD
veiiq+WJquWIh+aOie+8jOS4iua4uOWSjOS4i+a4uOeahOWkluaYvuWtkOiiq+ebtOaOpei/nued
gOS4gOi1t+S/neeVmeWcqOWJquWIh+WQjueahOS6p+eJqeS4reOAgjxicj4tIOesrOS6jOexu+aY
r+WGheWQq+WtkOa7nueVmeWei++8iGludHJvbiByZXRlbnRpb27vvInvvIzmn5DkuIDmrrXmoLjo
i7fphbjluo/liJflnKjkuIDkuKrliarliIfkvZPkuK3mmK/lpJbmmL7lrZDnmoTkuIDpg6jliIbv
vIzogIzlnKjkuI7kuYvlr7nnhafnmoTliarliIfkvZPkuK3ljbTmmK/lhoXlkKvlrZDogIzooqvl
iarliIfmjonjgII8YnI+LSDnrKzkuInnsbvmmK/lj6/lj5g14oCZ5oiWM+KAmeerr+WJquWIh++8
iGFsdGVybmF0aXZlIDXigJlzcyBzcGxpY2Ugb3IgYWx0ZXJuYXRpdmUgM+KAmXNzIHNwbGljZe+8
jOWFtuS4rTXigJlzc+ensOS+m+S9k+S9jeeCue+8jDPigJlzc+ensOWPl+S9k+S9jeeCue+8ie+8
jOWSjOS4juWug+WvueeFp+eahOWPpuS4gOS4quWJquWIh+S9k+ebuOavlO+8jOWPkeeUn+WJquWI
h+eahOS9jeeCueWcqDXigJnmiJYz4oCZ56uv5LiN5ZCM77yM6Zmk5q2k77yM5YW25LuW5Ymq5YiH
6YCJ5oup5LiA6Ie044CCPGJyPi0g56ys5Zub57G75piv6L2s5b2V6LW35aeL5Yy65Z+f5Y+v5Y+Y
5Ymq5YiH77yIYWx0ZXJuYXRpdmUgVFNT77yJ77yM5Y+R55Sf5Ymq5YiH55qE5L2N54K55Zyo6L2s
5b2V6LW35aeL5Yy65Z+f77yM5Y2z5LiO5LmL5a+55bqU55qE5Y+m5LiA5Liq5Ymq5YiH5L2T6Zmk
6L2s5b2V6LW35aeL5L2N54K55LiN5ZCM5aSW77yM5YW25L2Z5LiA6Ie044CCPGJyPi0g56ys5LqU
57G75piv6L2s5b2V57uI5q2i5Yy65Z+f5Y+v5Y+Y5Ymq5YiH77yIYWx0ZXJuYXRpdmUgVFRT77yJ
77yM5LiO56ys5Zub57G75a+55bqU77yM5Y+R55Sf5Ymq5YiH55qE5L2N54K55Y+q5piv5Zyo6L2s
5b2V57uI5q2i5L2N54K55LiN5ZCM44CCPC9wPjxwPiFb5Y+v5Y+Y5Ymq5YiH55qENeenjeW9ouW8
j10oaHR0cDovL3d3dy5iaW8taW5mby10cmFpbmVlLmNvbS93cC1jb250ZW50L3VwbG9hZHMvMjAx
Ny8xMS9zcGxpY2luZy5wbmcpPC9wPjxwPiMjIOi9r+S7tueul+azlTwvcD48cD4qKuavlOi+g+aX
p+eahCoq5YiG5p6Q5Y+v5Y+Y5Ymq5YiH55qE6L2v5Lu25Li76KaB5pyJU3BsaWNlUuOAgVNwbGlj
ZUdyYXBoZXLjgIFBU3Byb2ZpbGXku6Xlj4pTcGxpY2luZyBFeHByZXNz562J77yM5a6D5Lus5piv
5Z+65LqOY3VmZmxpbmtz6L2v5Lu255qE57uT5p6c77yM5bCGcmVhZHPlm57luJbliLDln7rlm6Dn
u4Tluo/liJflkI7vvIzmoLnmja7kvY3nva7lkozplb/luqblj4rnu5PmnoTkv6Hmga/vvIzmnaXn
oa7lrprmiJbpooTmtYvlj6/og73nmoTliarliIfkvZPnmoTnsbvlnovjgILnm67liY3kuLvmtYHl
t7Lnu4/kuI3lho3kvb/nlKh0b3BoYXQrY3VmZmxpbmtz5rWB56iL5LqG44CCPC9wPjxwPiMjIyBT
R1Nlcea1geeoizwvcD48cD7ov5nph4zku4vnu43kuIDkuItgU0dTZXFg6L2v5Lu277yM6L6T5YWl
5paH5Lu25pivYmFt77yM5L2G5piv6ZyA6KaB55So5pSv5oyB6L2s5b2V57uE5pWw5o2u5q+U5a+5
55qE5bel5YW35b6X5Yiw55qEYmFt5paH5Lu277yM5q+U5aaCPC9wPjxwPi0gR1NOQVAgKFQuIEQu
IFd1IGFuZCBOYWN1IDIwMTApPGJyPi0gSElTQVQgKEtpbSwgTGFuZ21lYWQsIGFuZCBTYWx6YmVy
ZyAyMDE1KTxicj4tIFNUQVIgKERvYmluIGV0IGFsLiAyMDEzKTwvcD48cD7lhbblrp7mmK/pnIDo
poFiYW3mlofku7bph4zpnaLmnIlgWFNgIOi/meagt+eahOagh+iusO+8gTwvcD48cD5gU0dTZXFg
5YyF55qE5a6J6KOF6K+05piO77yM5L2/55So5pa55rOV6YO95Y+v5Lul6KeB5a6Y572R77yaPC9w
PjxwPnwgW0hUTUxdKGh0dHBzOi8vYmlvY29uZHVjdG9yLm9yZy9wYWNrYWdlcy9yZWxlYXNlL2Jp
b2MvdmlnbmV0dGVzL1NHU2VxL2luc3QvZG9jL1NHU2VxLmh0bWwpIHwgW1IgU2NyaXB0XShodHRw
czovL2Jpb2NvbmR1Y3Rvci5vcmcvcGFja2FnZXMvcmVsZWFzZS9iaW9jL3ZpZ25ldHRlcy9TR1Nl
cS9pbnN0L2RvYy9TR1NlcS5SKSB8IFNHU2VxIHw8YnI+fCAtLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tIHwgLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLSB8IC0tLS0tLS0tLS0tLS0tLS0gfDxicj58IFtQREZdKGh0dHBzOi8vYmlvY29uZHVjdG9y
Lm9yZy9wYWNrYWdlcy9yZWxlYXNlL2Jpb2MvbWFudWFscy9TR1NlcS9tYW4vU0dTZXEucGRmKSB8
IHwgUmVmZXJlbmNlIE1hbnVhbCB8PGJyPnwgW1RleHRdKGh0dHBzOi8vYmlvY29uZHVjdG9yLm9y
Zy9wYWNrYWdlcy9yZWxlYXNlL2Jpb2MvbmV3cy9TR1NlcS9ORVdTKSB8IHwgTkVXUyB8PC9wPjxw
PiMjIOmcgOimgWJhbeaWh+S7tjwvcD48cD7lronoo4Xlpb3ljIXkuYvlkI7lj6/ku6XnnIvliLDp
mYTluKbnmoTmlbDmja7vvIzlpoLkuIvvvJo8L3A+PHA+YGBgPGJyPmppYW5taW5nemVuZ3MtaU1h
YzpJR1ZfMi4zLjk4IGptemVuZyQgY2QgL0xpYnJhcnkvRnJhbWV3b3Jrcy9SLmZyYW1ld29yay9W
ZXJzaW9ucy8zLjQvUmVzb3VyY2VzL2xpYnJhcnkvU0dTZXEvZXh0ZGF0YS9iYW1zLzxicj5qaWFu
bWluZ3plbmdzLWlNYWM6YmFtcyBqbXplbmckIGxzIC1saDxicj50b3RhbCAxOTUyPGJyPi1ydy1y
LS1yLS0gMSBqbXplbmcgYWRtaW4gNTRLIE5vdiAxIDAxOjI2IE4xLmJhbTxicj4tcnctci0tci0t
IDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBOMS5iYW0uYmFpPGJyPi1ydy1yLS1yLS0g
MSBqbXplbmcgYWRtaW4gODZLIE5vdiAxIDAxOjI2IE4yLmJhbTxicj4tcnctci0tci0tIDEgam16
ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBOMi5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXpl
bmcgYWRtaW4gNzVLIE5vdiAxIDAxOjI2IE4zLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFk
bWluIDQzSyBOb3YgMSAwMToyNiBOMy5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRt
aW4gOTJLIE5vdiAxIDAxOjI2IE40LmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQz
SyBOb3YgMSAwMToyNiBONC5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gNzVL
IE5vdiAxIDAxOjI2IFQxLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3Yg
MSAwMToyNiBUMS5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gOTBLIE5vdiAx
IDAxOjI2IFQyLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToy
NiBUMi5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gNjVLIE5vdiAxIDAxOjI2
IFQzLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBUMy5i
YW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gNzVLIE5vdiAxIDAxOjI2IFQ0LmJh
bTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBUNC5iYW0uYmFp
PGJyPmBgYDwvcD48cD7ov5nkuptgYmFtYOaWh+S7tuS5i+aJgOS7pei/meS5iOWwj++8jOWwseaY
r+WboOS4uuS9nOiAheWPquaYr+aIquWPluS6hmBoZzE5YOeahOmDqOWIhuaVsOaNru+8jOWdkOag
h+aYr2AgMTYgWzg3MzYyOTQyLCA4NzQyNTcwOF0gYDwvcD48cD4jIyDpnIDopoHms6jph4rmlofk
u7Y8L3A+PHA+6ZyA5qC55o2uYmlvY29uZHVjdG9y6YeM6Z2i55qEdHhkYuWvueixoeadpeaehOW7
uuavlOWvueaWh+S7tueahOWPguiAg+WfuuWboOe7hO+8jOWPguiAg+azqOmHiuS/oeaBr+OAguWm
guaenOaYr2hnMTnnmoTlj6/ku6XlpoLkuIvvvJo8L3A+PHA+YGBgPGJyPmxpYnJhcnkoVHhEYi5I
c2FwaWVucy5VQ1NDLmhnMTkua25vd25HZW5lKTxicj50eGRiICZsdDstIFR4RGIuSHNhcGllbnMu
VUNTQy5oZzE5Lmtub3duR2VuZTxicj50eGRiICZsdDstIGtlZXBTZXFsZXZlbHModHhkYiwgImNo
cjE2Iik8YnI+c2VxbGV2ZWxzU3R5bGUodHhkYikgJmx0Oy0gIk5DQkkiPGJyPnR4Zl91Y3NjICZs
dDstIGNvbnZlcnRUb1R4RmVhdHVyZXModHhkYik8YnI+dHhmX3Vjc2MgJmx0Oy0gdHhmX3Vjc2Nb
dHhmX3Vjc2MgJW92ZXIlIGdyXTxicj5oZWFkKHR4Zl91Y3NjKTxicj50eXBlKHR4Zl91Y3NjKTxi
cj5oZWFkKHR4TmFtZSh0eGZfdWNzYykpPGJyPmhlYWQoZ2VuZU5hbWUodHhmX3Vjc2MpKTxicj5g
YGA8L3A+PHA+5Li76KaB5bCx5piv6YCa6L+HYGNvbnZlcnRUb1R4RmVhdHVyZXMoKSBg5Ye95pWw
5oqKIGBHUmFuZ2VzYCDlr7nosaHovazljJbmiJDkuobkuIDkuKoqVHhGZWF0dXJlcyrlr7nosaHv
vIznlKjmnaXmoIforrDkuIvpnaI156eN57G75Z6L77yaPC9wPjxwPi0gKkoqIChzcGxpY2UganVu
Y3Rpb24pPGJyPi0gKkkqIChpbnRlcm5hbCBleG9uKTxicj4tICpGKiAoZmlyc3QvNeKAsuKAsi10
ZXJtaW5hbCBleG9uKTxicj4tICpMKiAobGFzdC814oCy4oCyLXRlcm1pbmFsIGV4b24pPGJyPi0g
KlUqICh1bnNwbGljZWQgdHJhbnNjcmlwdCkuPC9wPjxwPuWGjeeUqCAqY29udmVydFRvU0dGZWF0
dXJlcygpKiDlh73mlbDmiopUeEZlYXR1cmVz5a+56LGh6L2s5YyW5oiQU0dGZWF0dXJlcyDlr7no
saHvvIznlKjmnaXmoIforrA8L3A+PHA+LSAqSiogKHNwbGljZSBqdW5jdGlvbik8YnI+LSAqRSog
KGRpc2pvaW50IGV4b24gYmluKTxicj4tICpEKiAoc3BsaWNlIGRvbm9yIHNpdGUpPGJyPi0gKkEq
IChzcGxpY2UgYWNjZXB0b3Igc2l0ZSkuPC9wPjxwPiMjIOi/kOihjFNHU2Vx6L2v5Lu2PC9wPjxw
PmBgYDxicj5zZ2ZjX3Vjc2MgJmx0Oy0gYW5hbHl6ZUZlYXR1cmVzKHNpLCBmZWF0dXJlcyA9IHR4
Zl91Y3NjKTxicj5zZ2ZjX3Vjc2M8YnI+YGBgPC9wPjxwPuWboOS4uui9r+S7tuWMheiHquW4puea
hOaVsOaNrumdnuW4uOWwj++8jOaJgOS7peW+iOWuueaYk+Wwsei/kOihjOWujOavle+8jOS4jeef
pemBk+ecn+WunuaDheWGteS4i+aIkeeahCoqMTZHKirnmoRiYW3mlofku7bkvJrlpITnkIblpJrk
uYXjgII8L3A+PHA+IyMg5o6i57Si5aSE55CG57uT5p6cPC9wPjxwPuS5n+aYr+WFqOmDqOWcqFLo
r63oqIDph4zpnaLov5DooYzljbPlj6/vvIzkuIvpnaLnmoTov5nkupvlh73mlbDnlKjmnaXmjqLn
tKLliIbmnpDnu5PmnpzvvIzov5nkupvooajovr7nn6npmLXlsLHlhpnmmI7kuobmr4/kuKrln7rl
m6DnmoTmr4/kuKrlpJbmmL7lrZDnmoTooajovr7ph4/ku6Xlj4rkuKTkuKrlpJbmmL7lrZDkuK3p
l7TlpLnnnYDnmoTlhoXlkKvlrZDnmoTooajovr7mg4XlhrXjgII8L3A+PHA+5Lmf5bCx5piv6K+0
6K+l6L2v5Lu25ZyoUumHjOmdouWwseWvueaJgOacieeahGdlbm9taWMgZmVhdHVyZXMg6L+b6KGM
5LqGcmVhZHPnmoTorqHmlbDjgII8L3A+PHA+YGBgUjxicj5jb2xEYXRhKHNnZmNfdWNzYyk8YnI+
cm93UmFuZ2VzKHNnZmNfdWNzYyk8YnI+aGVhZChjb3VudHMoc2dmY191Y3NjKSk8YnI+aGVhZChG
UEtNKHNnZmNfdWNzYykpPGJyPmBgYDwvcD48cD4jIyDlj6/lj5jliarliIflvaLlvI/nmoTlj6/o
p4bljJY8L3A+PHA+5oyR6YCJ5YW25Lit5LiA5Liq5Z+65Zug77yM5Y+v6KeG5YyW6KGo6L6+5beu
5byC5oOF5Ya1PC9wPjxwPmBgYFI8YnI+ZGYgJmx0Oy0gcGxvdEZlYXR1cmVzKHNnZmNfdWNzYywg
Z2VuZUlEID0gMSk8YnI+IyDkuIvpnaLmmK/lpI3mnYLkuIDngrnnmoTlj6/op4bljJY8YnI+c2dm
Y19wcmVkICZsdDstIGFuYWx5emVGZWF0dXJlcyhzaSwgd2hpY2ggPSBncik8YnI+aGVhZChyb3dS
YW5nZXMoc2dmY19wcmVkKSk8YnI+c2dmY19wcmVkICZsdDstIGFubm90YXRlKHNnZmNfcHJlZCwg
dHhmX3Vjc2MpPGJyPmhlYWQocm93UmFuZ2VzKHNnZmNfcHJlZCkpPGJyPmRmICZsdDstIHBsb3RG
ZWF0dXJlcyhzZ2ZjX3ByZWQsIGdlbmVJRCA9IDEsIGNvbG9yX25vdmVsID0gInJlZCIpIDxicj5g
YGA8L3A+PHA+6L+Z5Liq5piv5L2c6ICF57K+6YCJ5oyR6YCJ55qE54m55q6K55qE5L6L5a2Q55So
5p2l5bGV546w6L2v5Lu255qE5oiQ5Yqf77yM5LqL5a6e5LiK5bqU6K+l5piv5YWI5YWo5bGA5qOA
5p+l5ZOq5Lqb5Y+v5Y+Y5Ymq5YiH5a2Y5Zyo77yM54S25ZCO6L6T5Ye6PC9wPjxwPmBgYDxicj4j
IyDkuIvpnaLmmK/lj6blpJbkuIDkuKrlsZXnjrDmqKHlvI/vvJo8YnI+cGFyKG1mcm93ID0gYyg1
LCAxKSwgbWFyID0gYygxLCAzLCAxLCAxKSk8YnI+cGxvdFNwbGljZUdyYXBoKHJvd1Jhbmdlcyhz
Z2ZjX3ByZWQpLCBnZW5lSUQgPSAxLCB0b3NjYWxlID0gIm5vbmUiLCBjb2xvcl9ub3ZlbCA9ICJy
ZWQiKTxicj5mb3IgKGogaW4gMTo0KSB7PGJyPiBwbG90Q292ZXJhZ2Uoc2dmY19wcmVkWywgal0s
IGdlbmVJRCA9IDEsIHRvc2NhbGUgPSAibm9uZSIpPGJyPn08YnI+YGBgPC9wPjxwPiMjIOS7juWP
r+WPmOWJquWIh+mihOa1i+e7k+aenOmHjOmdoumJtOWIq+WJquWIh+S9kzwvcD48cD5JbnN0ZWFk
IG9mIGNvbnNpZGVyaW5nIHRoZSBmdWxsIHNwbGljZSBncmFwaCBvZiBhIGdlbmUsIHRoZSBhbmFs
eXNpcyBjYW4gYmUgZm9jdXNlZCBvbiBpbmRpdmlkdWFsIHNwbGljZSBldmVudHMuIEZ1bmN0aW9u
ICphbmFseXplVmFyaWFudHMoKSogcmVjdXJzaXZlbHkgaWRlbnRpZmllcyBzcGxpY2UgZXZlbnRz
IGZyb20gdGhlIGdyYXBoLCBvYnRhaW5zIHJlcHJlc2VudGF0aXZlIGNvdW50cyBmb3IgZWFjaCBz
cGxpY2UgdmFyaWFudCwgYW5kIGNvbXB1dGVzIGVzdGltYXRlcyBvZiByZWxhdGl2ZSBzcGxpY2Ug
dmFyaWFudCB1c2FnZSwgYWxzbyByZWZlcnJlZCB0byBhcyDigJhwZXJjZW50YWdlIHNwbGljZWQg
KippbuKAmSAoUFNJIG9yIM6oKSAoVmVuYWJsZXMgZXQgYWwuIDIwMDgsIEthdHogZXQgYWwuICgy
MDEwKSkuKiog77yI5raJ5Y+K5Yiw5LqG5LiA5Liq566X5rOV55qE6Zeu6aKY77yJPC9wPjxwPmBg
YDxicj5zZ3ZjX3ByZWQgJmx0Oy0gYW5hbHl6ZVZhcmlhbnRzKHNnZmNfcHJlZCk8YnI+c2d2Y19w
cmVkPGJyPm1jb2xzKHNndmNfcHJlZCk8YnI+dmFyaWFudEZyZXEoc2d2Y19wcmVkKTxicj5wbG90
VmFyaWFudHMoc2d2Y19wcmVkLCBldmVudElEID0gMSwgY29sb3Jfbm92ZWwgPSAicmVkIik8YnI+
bGlicmFyeShCU2dlbm9tZS5Ic2FwaWVucy5VQ1NDLmhnMTkpPGJyPnNlcWxldmVsc1N0eWxlKEhz
YXBpZW5zKSAmbHQ7LSAiTkNCSSI8YnI+dmVwICZsdDstIHByZWRpY3RWYXJpYW50RWZmZWN0cyhz
Z3ZfcHJlZCwgdHhkYiwgSHNhcGllbnMpPGJyPnZlcDwvcD48cD5gYGA8L3A+">​</div>
</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2890.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>外显子测序流程-文章里面的</title>
		<link>http://www.bio-info-trainee.com/2838.html</link>
		<comments>http://www.bio-info-trainee.com/2838.html#comments</comments>
		<pubDate>Tue, 14 Nov 2017 07:11:34 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[cancer]]></category>
		<category><![CDATA[全外显子组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2838</guid>
		<description><![CDATA[就是做一个图床而已，需要这个图片的网页url链接，没别的意思！ 一、质控（fas &#8230; <a href="http://www.bio-info-trainee.com/2838.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>就是做一个图床而已，需要这个图片的网页url链接，没别的意思！<span id="more-2838"></span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2017/11/wes-data-analysis-workflow.jpeg"><img class="alignnone size-full wp-image-2839" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/11/wes-data-analysis-workflow.jpeg" alt="wes-data-analysis-workflow" width="1638" height="1574" /></a></p>
<p>一、质控（fastqc +tookit）</p>
<p>1数据质量：</p>
<p>1）碱基质量分布</p>
<p>2）reads质量分布</p>
<p>3）reads长度分布</p>
<p>4）GC含量</p>
<p>&nbsp;</p>
<p>2数据过滤</p>
<p>1）原始reads数</p>
<p>2）平均质量值&gt;Q20 reads数目和比例</p>
<p>3）平均质量值&gt;Q30 reads数目和比例</p>
<p>4）过滤掉reads中碱基质量&lt;Q20的碱基占比超过5%的reads。统计clean data的reads和比例。</p>
<p>&nbsp;</p>
<h3>二、比对（bwa）</h3>
<p>1）比对上基因组的reads数及占总数的比例</p>
<p>2）完全匹配的reads数</p>
<p>3）匹配上各个染色体的reads数</p>
<p>4）染色体上的覆盖深度</p>
<p>5）落在目标区域（exon）的reads数</p>
<p>6）落在目标区域+-100的reads数</p>
<p>7）目标区域碱基覆盖深度</p>
<p>8）目标区域碱基被覆盖比例</p>
<p>9）目标区域碱基被覆盖（50X，100X，150X，200X。。。）的比例</p>
<p>&nbsp;</p>
<h3>三、find SNV（samtools +picard+gatk+varscan）</h3>
<p>1）picard ：sam &gt;sort.bam</p>
<p>2）gatk ：sort.bam &gt;sort.dedup.bam (去重复)</p>
<p>3）gatk ：sort.dedup.bam &gt; realign.bam (重新比对，indel和snp校正)</p>
<p>4）Gatk ：碱基质量重打分。（未进行）</p>
<p>5）Varscan ：call SNV</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3>四、突变注释</h3>
<p>1）annovar注释。</p>
<p>2）注释结果统计（同义，非同义突变，基因上下游，内含子，外显子上。。等）</p>
<p>3）dbsnp 注释（找到的snp是否在dbsnp数据库上）</p>
<p>4） cosmic63 ：癌症相关突变</p>
<p>&nbsp;</p>
<h3>五、突变分析</h3>
<p>1）snv在个染色体上的分布</p>
<p>2）各基因上snv的分布</p>
<p>3）Snv位点较多的基因进行功能分析（pathway，kegg的通路分析和Go功能富集）</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2838.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>850K甲基化芯片数据的分析</title>
		<link>http://www.bio-info-trainee.com/2823.html</link>
		<comments>http://www.bio-info-trainee.com/2823.html#comments</comments>
		<pubDate>Wed, 08 Nov 2017 02:53:14 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[生信组学技术]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2823</guid>
		<description><![CDATA[公众号文章，作者是北京基因组所，任云晓 本文是看到生信技能树有个450K甲基化芯 &#8230; <a href="http://www.bio-info-trainee.com/2823.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h1 class="md-end-block md-heading">公众号文章，作者是北京基因组所，任云晓</h1>
<h1 class="md-end-block md-heading">本文是看到生信技能树有个<span class=""><a spellcheck="false" href="http://www.biotrainee.com/thread-2042-1-1.html">450K甲基化芯片数据处理传送门</a></span>，我呢，恰好不久前用一个集成度很高的ChAMP包分析过850K的甲基化芯片数据。所以，就想着把自己的笔记整理下，可以和更多的小伙伴学习交流，还有个原因可能是因为这是四月份打算学生信时，接手的第一个任务，曲曲折折好几个月才跑通流程，遇到的坑也比较多，想记录下来。</h1>
<p><span id="more-2823"></span></p>
<p><span class="md-line md-end-block">我之前分析时是参考<span class=""><a spellcheck="false" href="https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html">ChAMP包的源文档</a></span>，非常详细的整个流程的介绍，但是，在笔记快整理完时突然发现作者的<span class=""><a spellcheck="false" href="http://blog.csdn.net/joshua_hit/article/details/54982018">博客</a></span>也写过一篇介绍的文章，博客里写的不像源文档很官方，这里面有很多作者很直白的解释和补充，还有作者一些很深刻的思考。看了之后发现自己对很多分析理解的还不是很深刻。所以如果想学甲基化芯片数据分析的小伙伴可以以官方源文档和作者的博客为主，这篇笔记仅仅作为额外的参考吧。</span></p>
<p><span class="md-line md-end-block">ChAMP的源文档：<span spellcheck="false"><a href="https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html">https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html</a></span></span></p>
<p><span class="md-line md-end-block">作者的博客：<span spellcheck="false"><a href="http://blog.csdn.net/joshua_hit/article/details/54982018">http://blog.csdn.net/joshua_hit/article/details/54982018</a></span></span></p>
<p><span class="md-line md-end-block">ChAMP包的github: <span spellcheck="false"><a href="https://github.com/Bioconductor-mirror/ChAMP/search?utf8=%E2%9C%93&amp;q=ChAMP&amp;type=">https://github.com/Bioconductor-mirror/ChAMP/search?utf8=%E2%9C%93&amp;q=ChAMP&amp;type=</a></span></span></p>
<div class="md-hr md-end-block" tabindex="-1" contenteditable="false">
<hr />
</div>
<blockquote><p><span class="md-line md-end-block">Illumina甲基化芯片目前仍是很多实验室做甲基化项目的首选，尤其是对于大样本研究而言，其性价比相当高；目前在临床上应用还是很广的。这种芯片的发展主要经历了27K、450K以及850K（27K，450K，850K指能测到的CpG甲基化位点），目前积累的数据主要是450K芯片的，之后850K可能会成为甲基化芯片的主流。楼主之前写过一篇<span class=""><a spellcheck="false" href="http://www.biotrainee.com/thread-237-1-1.html">450K芯片预处理的帖子</a></span>，其中很详细介绍了这种芯片的基础知识以及流程图和代码，大家可以先看看。芯片的处理流程一般就是：<span class=""><strong>数据读入——数据过滤——数据校正——下游分析</strong></span>。</span></p>
<p><span class="md-line md-end-block">数据处理一种时基于GenomeStudio（illumina开发的软件），但是只对于小样本，另一种基于R的各种package，如lumi、minfi、wateRmelon、ChAMP等。</span></p>
<p><span class="md-line md-end-block">与测序相比，芯片的处理可能对计算资源的要求不算高，主要使用的工具就是R，但是R的使用比较耗内存，尤其是处理大批量数据的时候。</span></p></blockquote>
<h2 class="md-end-block md-heading">Step1: 基础知识的补充</h2>
<p><span class="md-line md-end-block">在正式分析前，我结合作业先将有关甲基化和芯片的基础知识整理了一下。</span></p>
<h3 class="md-end-block md-heading">Illumina 甲基化芯片的原理及探针的设计（I型探针和II型探针)</h3>
<blockquote><p><span class="md-line md-end-block">原理：简而言之，基于亚硫酸盐处理后的DNA序列杂交的信号探测。亚硫酸盐是甲基化探测的“金标准”，不管是芯片或者甲基化测序，都要先对DNA样品进行亚硫酸盐处理，使非甲基化的C变成U，而甲基化的C保持不变，从而在后续的测序或者杂交后区分出来。</span></p>
<p><span class="md-line md-end-block">450K和850K采用了两种探针Infinium Ⅰ 和Infinium Ⅱ对甲基化进行测定，Infinium I采用了两种bead（甲基化M和非甲基化U，如图显示），而II只有一种bead（即甲基化和非甲基化在一起），这也导致了它们在后续荧光探测的不同，450K采用了两种荧光探测信号（红光和绿光）（图1）。</span></p></blockquote>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/Illumina%20methylation%20workflow.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/Illumina%20methylation%20workflow.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图1 <span class=""><a spellcheck="false" href="https://en.wikipedia.org/wiki/Illumina_Methylation_Assay">Illumina Methylation Assay</a></span></strong></span> </span></p>
<h3 class="md-end-block md-heading">甲基化概述：</h3>
<p><span class="md-line md-end-block">DNA甲基化被认为是表观遗传调控的一种方式，如Cytosine methylation (5-mC)是研究最多的，被认为是哺乳动物中常见的甲基化方式, 最近有一些研究也发现了其他形式的甲基化，如2016年Nature上发表了一篇关于<span class=""><a spellcheck="false" href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4977844/">鼠的胚胎干细胞的m6A（N6-methyladenine）形式的甲基化</a></span><span class="">。DAN甲基化被认为对基因表达，染色质重塑，细胞分化，疾病等都有重要影响（图2）。</span></span></p>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/Perturbation%20of%20methylation.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/Perturbation%20of%20methylation.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图2 <span class=""><a spellcheck="false" href="https://www.illumina.com/content/dam/illumina-marketing/documents/products/other/field_guide_methylation.pdf">甲基化与疾病的关系及术语的描述</a></span></strong></span></span></p>
<h3 class="md-end-block md-heading"><span class="">甲基化的检测方法：</span></h3>
<p><span class="md-line md-end-block"><span class="">目前甲基化检测的方法可以概括为三种：芯片、测序、免疫沉淀。具体选择何种方法主要还是根据实验目的和实验室条件了。但目前来说，甲基化芯片技术从覆盖度，检测灵敏度和价格综合考虑，还是性价比相对高的（图3）。</span></span></p>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/Methods%20of%20detect%20cytosine%20methylation.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/Methods%20of%20detect%20cytosine%20methylation.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图3 <span class=""><a spellcheck="false" href="https://www.illumina.com/content/dam/illumina-marketing/documents/products/other/field_guide_methylation.pdf">甲基化检测方法的比较</a></span></strong></span></span></p>
<h3 class="md-end-block md-heading">关于甲基化芯片常见的Glossary:</h3>
<p><span class="md-line md-end-block"><span class=""><strong>CpG island:</strong></span> Defned as regions &gt; 500 bp, 55% GC and expected/observed CpG ratio of &gt; 0.65. 40% of gene promoters contain islands.</span></p>
<p><span class="md-line md-end-block"><span class=""><strong>CpG shelves:</strong></span> ~4Kb from islands.</span></p>
<p><span class="md-line md-end-block"><span class=""><strong>CpG shores:</strong></span> ~2Kb from islands, &gt; 75% of tissuespecifc differentially methylated regions found in shores. Methylation in shores shows higher correlation with gene expression than CpG islands.</span></p>
<p><span class="md-line md-end-block"><span class=""><strong>Differentially methylated regions (DMR):</strong></span> Cell-, tissue-, and condition- specifc differences in methylation.</span></p>
<p><span class="md-line md-end-block"><span class=""><strong>Enhancer:</strong></span> A short region of DNA that can activate transcription and is often regulated by methylation.</span></p>
<p><span class="md-line md-end-block"><span class=""><strong>Hypermethylation:</strong></span> Most cytosines are methylated.</span><span class="md-line md-end-block"><span class=""><strong>Hypomethylation:</strong></span> Most cytosines do not have 5-mC. Euchromatin and active gene promoters are hypomethylated.</span></p>
<p><span class="md-line md-end-block"><span class=""><strong>Beta value:</strong></span>通常的甲基化衡量方法被称为“Beta”值; 等于甲基化百分比，并定义为“Meth”除以“Meth + Unmeth”。</span></p>
<p><span class="md-line md-end-block"><span class=""><strong>CGI:</strong></span> CpG island 即甲基化岛。</span></p>
<div class="md-hr md-end-block" tabindex="-1" contenteditable="false">
<hr />
</div>
<p><span class="md-line md-end-block">因为手头的数据是850K的甲基化数据，之前也只接触过ChAMP包，所以这里就以ChAMP包介绍850K甲基化数据分析。ChAMP包是一个集成度很高的包，它包括450K和EPIC(即通常所说的850K)两套分析流程，完整的包括了数据的载入，标准化，矫正，差异甲基化和富集分析等功能（图4）。</span></p>
<p><span class="md-line md-end-block">​ <span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/ChAMP%20Pipeline.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/ChAMP%20Pipeline.png" alt="" /></span> </span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图4 <span class=""><a spellcheck="false" href="https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html">ChAMP Pepeline</a></span></strong></span></span></p>
<h2 class="md-end-block md-heading">Step2:计算机资源的准备</h2>
<blockquote><p><span class="md-line md-end-block">作业1</span><span class="md-line md-end-block">安装好R软件及相应的包，下载R包的说明书，整理它们的官网链接。</span></p></blockquote>
<p><span class="md-line md-end-block">R的使用真的很耗内存，我有28个样本（14个control, 14个case), 之前4G内存的电脑，本地分析总时半路电脑就卡死了。所以最好配置高一点，或者在服务器上下载安装R和Rstudio（这里最好安装Rstudio, 因为ChAMP包中有很多的GUI图形功能，Rstudio可以更好实现，或者含有X11功能的linux系统）。</span></p>
<h3 class="md-end-block md-heading">软件的安装：</h3>
<p><span class="md-line md-end-block">R和Rstudio 的本地安装很简单，直接到官网下载，只要注意安装时的路径不要有中文，Rstudio安装前要先安装R。</span></p>
<p><span class="md-line md-end-block">服务器版本的Rstudio安装好后，在网页地址栏输入访问地址：<a href="%E6%9C%8D%E5%8A%A1%E5%99%A8IP:8787">服务器IP:8787</a>，用户名和密码为Linux用户的用户名和密码。</span></p>
<p><span class="md-line md-end-block">具体安装方法可以参考生信宝典陈老师的一篇文章<span spellcheck="false"><a href="http://www.biotrainee.com/thread-1808-1-1.html">http://www.biotrainee.com/thread-1808-1-1.html</a></span>。</span></p>
<h3 class="md-end-block md-heading">下载R包：</h3>
<p><span class="md-line md-end-block">下载ChAMP 包，官网给出了很详细的流程说明（<span spellcheck="false"><a href="https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html">https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html</a></span>）。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">source("https://bioconductor.org/biocLite.R")
biocLite("ChAMP")</pre>
<p><span class="md-line md-end-block"><span class=""><strong>NOTE:</strong></span> ChAMP有很多依赖包，安装时，若报错有哪个包没有，就继续安装 biocLite("YourErrorPackage")，可能3-4次就可以安装成功。</span></p>
<h3 class="md-end-block md-heading">导入ChAMP包并测试：</h3>
<p><span class="md-line md-end-block">导入ChAMP包后，根据是450K的数据或者是850K的数据，导入测试数据集，走一下分析流程，检测包是否正常工作，更重要的是看该包的文档，理解每一步流程的意义。该包的文档很详细，建议大家看原文档，下面给出的啰啰嗦嗦的介绍基本上都来自官网的文档说明（<span spellcheck="false"><a href="https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html">https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html</a></span>）。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">library("ChAMP")
#450K的数据导入：
testDir=system.file("extdata",package="ChAMPdata")
myLoad &lt;- champ.load(testDir,arraytype="450K")
#850K的数据
data(EPICSimData)</pre>
<h2 class="md-end-block md-heading">Step 3: 数据读入</h2>
<p><span class="md-line md-end-block">数据的读入这里可能坑最多，首先450K 和850K甲基化芯片的原始数据格式都是IDAT, 因为数组是用两种不同的颜色来测量的，所以每个样本都有两个文件，通常是扩展名Grn.idat和Red.idat。数据在载入时还需要一个<span class=""><strong>Sample_Sheet.csv</strong></span>文件（图5）（也称做pd file）, 这个文件很重要，它包含了样本的信息，可以对照测试数据的csv文件和自己的csv文件，对信息不全的地方进行补充。尤其要注意<span class=""><strong>Sample_Group</strong></span> 这一列信息是否有，这一列信息代表你想比较的表型类型，比如癌和癌旁。另一个我遇到过的一个隐形坑在<span class=""><strong>Sentrix_ID</strong></span>，这一列数因为数字串很长，在Excel中可能以科学计数法显示，然后本来是长数字串后两位不一样的数字串都变为一样的，在读入时就会报重复字符的错误，所以这里一定要核查下长数字串的信息，如果有错误，自己重新输入时以文档格式输入，或者前面加右单引‘。csv文件准备好后，将csv文件与所有样本的芯片数据（即IDAT文件）放在一个文件下，然后就可以正常读入了。</span></p>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/pd_csv.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/pd_csv.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图5 Sample_Sheet.csv fiel</strong></span></span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">library("ChAMP")
myLoad &lt;- champ.load("F:/850K Methylation Chip/biotree_850K/methy_rawData",arraytype = "EPIC")
save(myLoad,file="myLoad.rda")</pre>
<p><span class="md-line md-end-block">champ.load()包含了 champ.import() 和champ.filter()，这里会自动过滤p值&gt;0.01; probes beadcount &lt;3 in at least 5% of samples;NoCG;probes with SNPs; MultiHit; probes located on X,Y chromosome。</span></p>
<p><span class="md-line md-end-block">在读入数据之后，最好保存，后续重复读入时会加快速度。</span></p>
<h2 class="md-end-block md-heading">Step 4: 质控和标准化</h2>
<h3 class="md-end-block md-heading">CpG overview:</h3>
<p><span class="md-line md-end-block">质控前可以先看看CpG的分布，包括在染色体上的分布；CpG岛附近的 open sea, shelf,shore (参考图2，理解具体意思) ; UTR,TSS; I 型探针和II探针上的分布（图6），这个信息对后续DMP的分析有帮助。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">CpG.GUI(arraytype="EPIC")</pre>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/CpG%20overview.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/CpG%20overview.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图6 CpG Overview</strong></span></span></p>
<h3 class="md-end-block md-heading">质控:</h3>
<p><span class="md-line md-end-block">然后进行质控，有两种方式：champ.QC() 和 QC.GUI()。champ.QC会产生三种类型的图（点图，beta 分布图，聚类图）以pdf格式输出，QC.GUI产生5个图，多了一个I型、II型探针图和热图（图7）。所有的GUI功能都比较耗内存，且产生的是网页交互式的图片，每幅图的右上角给的都有保存按钮，要注意的是保存时文件名要加上.png的后缀（图7）。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">#champ.QC()
QC.GUI(arraytype="EPIC")</pre>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/QC.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/QC.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图7 QC Overview</strong></span></span></p>
<h3 class="md-end-block md-heading">标准化:</h3>
<p><span class="md-line md-end-block">champ.norm 提供了四种方法：BMIQ, SWAN1, PBC2 and FunctionalNormliazation4。默认的方法是BMIQ, 且BMIQ对850K的标准化方法更好一点，所以这里我选择的是BMIQ的标准化方法，没有尝试其他的标准化方法。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">myNorm &lt;- champ.norm(arraytype="EPIC")
QC.GUI(myNorm,arraytype="EPIC")
save(myNorm,file="myNorm.rda")</pre>
<h3 class="md-end-block md-heading">SVD plot 和批次效应：</h3>
<p><span class="md-line md-end-block">SVD(singular value decomposition) 这里用于评估数据集中变量的主要成分。这种成分可能确实是你感兴趣的生物因素，也可能是技术来源的一些变量成分（称为批次效应）（图8）。如果存在批次效应，就进行批次效应的矫正，矫正完之后可以再看看SVD plot。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">champ.SVD()</pre>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/SVD.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/SVD.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图8 SVD Plot</strong></span></span></p>
<h2 class="md-end-block md-heading">Step 5: 差异甲基化分析（DMP &amp; DMR &amp; DMB)</h2>
<p><span class="md-line md-end-block">差异分析是多数研究都要分析的，这里包括三种方法：DMP，DMR，DMB。DMP代表找出Differential Methylation Probe（差异化CpG位点），DMR代表找出Differential Methylation Region（差异化CpG区域），Block代表Differential Methylation Block（更大范围的差异化region区域）</span></p>
<blockquote><p><span class="md-line md-end-block">简单来说，DMP是找出一个一个的差异甲基化CpG位点，DMR就是一个连续不断都比较长的差异片段，科学家们觉得，这样的连续差异片段，对于基因的影响会更加明显，只找这样的片段，可以使得计算生物学的打击精度更为准确，也可以让最终找出来的结论数据更少，便于实验人员筛选。另外一个类似的东西就是DMB，那个东西出现的原因是，有的科学家觉得，DMR这样的区域还不够显著，DNA上的甲基化出现变化，可能是绵延几千位点的！而且只会在基因以外的区域，但是这些基因以外的区域发生变化，却会导致基因的表达发生变化。你可以想象成，北京周边的河北在大炼钢铁，然后北京也跟着雾霾了，大概就是这意思。</span></p></blockquote>
<p><span class="md-line md-end-block">DMP,DMR,DMB的结果都是基于的shiny的交互页面，左栏上方是 P-value 和 abs(logFC) ，可以选择想看的值，然后点submit, 右栏可以生成差异甲基化表，热图，feature&amp;cgi, 左栏下方还有基因，CpG按钮，选择你想看的结果，submit, 右栏就会生成相应gene,CpG结果（图9）。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">myDMP &lt;- champ.DMP(arraytype="EPIC")
save(myDMP,file="myDMP.rda")
DMP.GUI()
myDMR &lt;- champ.DMR(arraytype = "EPIC",method="DMRcate",cores=1)
save(myDMR,file="myDMR.rda")
DMR.GUI(arraytype="EPIC")
#myBlock &lt;- champ.Block(arraytype = "EPIC")
#Block.GUI(arraytype="EPIC",compare.group=c("PrEC_cells","LNCaP_cells"))</pre>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/DMP.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/DMP.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图9 DMP Overview</strong></span></span></p>
<h2 class="md-end-block md-heading">Step 6: 基因富集和网络分析（GSEA &amp; EpiMod）</h2>
<p><span class="md-line md-end-block">差异甲基化分析后，你可能想知道DMP,DMR中涉及到的基因是否可以富集到某个生物功能或通路，GSEA(Gene Set Enrichment Analysis)和EpiMod（Differential Methylated Interaction Hotspots）提供了可以寻找作用通路网络中的疾病关联小网络的功能 （图 10）。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">myGSEA &lt;- champ.GSEA(arraytype = "EPIC")
save(myGSEA,file="myGSEA.rda")
​
myEpiMod &lt;- champ.EpiMod(arraytype="EPIC")
save(myEpiMod,file="myEpiMod.rda")</pre>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/EPI.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/EPI.png" alt="" /></span></span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图10 EpiMod</strong></span></span></p>
<h2 class="md-end-block md-heading">Step 7: 拷贝数变异分析（CNA)</h2>
<p><span class="md-line md-end-block">拷贝数变异，也就是有些基因片段被复制的此处过多或者过少，从而导致某些疾病。但是这个函数作者觉得有点粗糙，精度还不够。我试着跑了一下，时间超长（图11）。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">myCNA &lt;- champ.CNA(control = F,arraytype = "EPIC")
save(myCNA,file=myCNA)</pre>
<p><span class="md-line md-end-block"> <span class="md-image md-img-loaded" contenteditable="false" data-src="http://oxrpzhg00.bkt.clouddn.com/CNA.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="http://oxrpzhg00.bkt.clouddn.com/CNA.png" alt="" /></span> </span></p>
<p><span class="md-line md-end-block">​ <span class=""><strong>图11 Frequency Plot of Cancer Sample</strong></span></span></p>
<div class="md-hr md-end-block" tabindex="-1" contenteditable="false">
<hr />
</div>
<p><span class="md-line md-end-block"><span class=""><strong>小结：</strong></span>如果用ChAMP包对450K或850K甲基化数据进行分析时，一是最好有个配置高一点的电脑；二是初始数据导入时，注意csv文件的格式，且要和IDAT文件放在一个文件下；其余的流程很少会遇到bug, 但最关键的是理解每一步的意义，能够根据分析的结果挖掘出想要的东西。</span></p>
<p><span class="md-line md-end-block">ps: 这次作业提供的公共数据，有IDAT文件，也有个csv文件，但是这里的csv文件和我的csv文件差别很大，不是很明白这里的csv文件是什么，有什么作用。</span></p>
<h4 class="md-end-block md-heading">有不当之处欢迎指正和补充。</h4>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2823.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>一个植物转录组项目的实战</title>
		<link>http://www.bio-info-trainee.com/2809.html</link>
		<comments>http://www.bio-info-trainee.com/2809.html#comments</comments>
		<pubDate>Thu, 02 Nov 2017 02:29:11 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[未分类]]></category>
		<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2809</guid>
		<description><![CDATA[其实这个植物是拟南芥，所以跟人类研究的数据处理大同小异。 转录组 转录组测序的研 &#8230; <a href="http://www.bio-info-trainee.com/2809.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>其实这个植物是拟南芥，所以跟人类研究的数据处理大同小异。</p>
<h2 class="md-end-block md-heading">转录组</h2>
<p><span class="md-line md-end-block">转录组测序的研究对象为特定细胞在某一功能状态下所能转录出来的<span class=""><strong>所有 RNA 的总和</strong></span>，包括 mRNA 和非编码 RNA 。通过转录组测序，能够全面获得物种特定组织或器官的转录本信息，从而进行转录本结构研究、变异研究、<span class=""><strong>基因表达水平研究</strong></span>以及全新转录本发现等研究。</span><span id="more-2809"></span></p>
<p><span class="md-line md-end-block">其中，基因表达水平的探究是转录组领域<span class=""><strong>最热门</strong></span>的方向，利用转录组数据来识别转录本和表达定量，是转录组数据的核心作用。由于这个作用，他可以不依赖其他组学信息，单独成为一个产品项目RNA-seq测序。所以很多时候<span class=""><strong>转录组测序</strong></span>会与<span class=""><strong>RNA-seq</strong></span>混为一谈。</span></p>
<p><span class="md-line md-end-block">现在RNA-seq数据<span class=""><strong>使用广泛</strong></span>，但是没有一套流程可以解决所有的问题。比较值得关注的RNA-seq分析中的重要的步骤包括：<span class=""><strong>实验设计，质控，read比对，表达定量，可视化，差异表达，识别可变剪切，功能注释，融合基因检测，eQTL定位</strong></span>等。</span></p>
<p>值得一提的是，这个教程也写的非常赞：https://github.com/twbattaglia/RNAseq-workflow</p>
<h2 class="md-end-block md-heading">流程介绍</h2>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="image/overview-of-RNA-seq-technology.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="file:///Users/jimmy/Documents/github_jmzeng1314/bioinformatics123/ngs/image/overview-of-RNA-seq-technology.png?lastModify=1509589599" alt="" /></span></span><a href="http://www.bio-info-trainee.com/wp-content/uploads/2017/10/overview-of-RNA-seq-technology.png"><img class="alignnone size-full wp-image-2792" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/10/overview-of-RNA-seq-technology.png" alt="overview-of-rna-seq-technology" width="717" height="674" /></a></p>
<p><span class="md-line md-end-block">来自于R处理<span class=""><a spellcheck="false" href="http://biocluster.ucr.edu/~rkaundal/workshops/R_mar2016/RNAseq.html">mRNA-seq数据</a></span></span></p>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="image/mRNAseq-workflow-2010.jpeg"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="file:///Users/jimmy/Documents/github_jmzeng1314/bioinformatics123/ngs/image/mRNAseq-workflow-2010.jpeg?lastModify=1509589599" alt="" /></span></span></p>
<p><span class="md-line md-end-block">来自于2010发表在Genome Biology的<span class=""><a spellcheck="false" href="https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-12-220">From RNA-seq reads to differential expression results</a></span>文章配图</span></p>
<h2 class="md-end-block md-heading">数据来源文章</h2>
<p><span class="md-line md-end-block">数据来自于发表在Nature commmunication 上的一篇文章 “Temporal dynamics of gene expression and histone marks at the Arabidopsis shoot meristem during flowerin”。原文用RNA-Seq的方式研究在开花阶段,芽分生组织在<span class=""><strong>不同时期的基因表达变化。</strong></span></span></p>
<p><span class="md-line md-end-block">原文的流程是： TopHat -&gt; SummarizeOverlaps -&gt; Deseq2 -&gt; AmiGO </span><span class="md-line md-end-block">其中比对的参考基因组为TAIR10 ver.24 ，并且屏蔽了ribosomal RNA regions (2:3471–9557; 3:14,197,350–14,203,988)。</span></p>
<p><span class="md-line md-end-block">Deseq2只计算至少在一个时间段的FPKM的count &gt; 1 的基因。</span></p>
<p><span class="md-line md-end-block">数据存放在<a href="http://www.ebi.ac.uk/arrayexpress/">http://www.ebi.ac.uk/arrayexpress/</a>, ID为E-MTAB-5130。</span></p>
<p><span class="md-line md-end-block">实验设计： 4个时间段（0,1,2,3），分别有4个生物学重复，一共有16个样品。</span></p>
<h2 class="md-end-block md-heading">数据下载</h2>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">conda install <span class="cm-attribute">-c</span> bioconda salmon 
​
<span class="cm-builtin">wget</span> http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-5130/E-MTAB-5130.sdrf.txt
head <span class="cm-attribute">-n1</span> E-MTAB-5130.sdrf.txt | tr <span class="cm-string">'\t'</span> <span class="cm-string">'\n'</span> | nl | <span class="cm-builtin">grep</span> URI
tail <span class="cm-attribute">-n</span> <span class="cm-operator">+</span><span class="cm-number">2</span> E-MTAB-5130.sdrf.txt | <span class="cm-builtin">cut</span> <span class="cm-attribute">-f</span> <span class="cm-number">33</span> | xargs <span class="cm-attribute">-i</span> <span class="cm-builtin">wget</span> {}
​
​
nohup <span class="cm-builtin">wget</span> ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz &amp;
​
nohup <span class="cm-builtin">wget</span> ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa.gz &amp;
nohup <span class="cm-builtin">wget</span>  ftp://ftp.ensemblgenomes.org/pub/plants/release-28/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gff3.gz &amp;
nohup <span class="cm-builtin">wget</span> ftp://ftp.ensemblgenomes.org/pub/plants/release-28/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gtf.gz &amp;</pre>
<h2 class="md-end-block md-heading">salmon 流程</h2>
<p><span class="md-line md-end-block">软件介绍：ome of the upstream quantification methods <span class=""><strong>(<span class=""><em>Salmon</em></span>, <span class=""><em>Sailfish</em></span>, <span class=""><em>kallisto</em></span>)</strong></span> are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files</span></p>
<p><span class="md-line md-end-block">软件官网：<span spellcheck="false"><a href="https://combine-lab.github.io/salmon/">https://combine-lab.github.io/salmon/</a></span></span></p>
<p><span class="md-line md-end-block">先用用Salmon建立索引：</span></p>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">salmon index -t Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -i athal_index</span></li>
</ul>
<p><span class="md-line md-end-block">建立索引耗时53秒，生成的索引文件夹如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">[jianmingzeng@jade salmon]$ ls -lh
total 19M
-rw-rw-r-- 1 jianmingzeng jianmingzeng  19M Oct 17 11:18 Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz
drwxrwxr-x 2 jianmingzeng jianmingzeng 4.0K Oct 17 11:54 athal_index
-rw-rw-r-- 1 jianmingzeng jianmingzeng  142 Oct 17 11:20 wget_cdna.sh
[jianmingzeng@jade salmon]$ ls -lh  athal_index/
total 1.1G
-rw-rw-r-- 1 jianmingzeng jianmingzeng 751M Oct 17 11:54 hash.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng  357 Oct 17 11:54 header.json
-rw-rw-r-- 1 jianmingzeng jianmingzeng  115 Oct 17 11:54 indexing.log
-rw-rw-r-- 1 jianmingzeng jianmingzeng  156 Oct 17 11:54 quasi_index.log
-rw-rw-r-- 1 jianmingzeng jianmingzeng   89 Oct 17 11:54 refInfo.json
-rw-rw-r-- 1 jianmingzeng jianmingzeng 7.8M Oct 17 11:53 rsd.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng 248M Oct 17 11:54 sa.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng  63M Oct 17 11:53 txpInfo.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng   96 Oct 17 11:54 versionInfo.json
[jianmingzeng@jade salmon]$</pre>
<p><span class="md-line md-end-block">然后对所有数据定量</span></p>
<p><span class="md-line md-end-block">由于样本一共有16个，不可能一条条输入命令，所以我们写一个脚本：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-meta">#! /bin/bash</span>
<span class="cm-def">index</span><span class="cm-operator">=</span>salmon/athal_index <span class="cm-comment">## 指定索引文件夹</span>
<span class="cm-keyword">for</span> fn <span class="cm-keyword">in</span> ERR1698{194..209};
<span class="cm-keyword">do</span>
    <span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-quote">`basename </span><span class="cm-def">${fn}</span><span class="cm-quote">`</span>
    <span class="cm-builtin">echo</span> <span class="cm-string">"Processin sample </span><span class="cm-def">${sampe}</span><span class="cm-string">"</span>
    salmon quant <span class="cm-attribute">-i</span> <span class="cm-def">$index</span> <span class="cm-attribute">-l</span> A \
        <span class="cm-attribute">-1</span> <span class="cm-def">${sample}</span>_1.fastq.gz \
        <span class="cm-attribute">-2</span> <span class="cm-def">${sample}</span>_2.fastq.gz \
        <span class="cm-attribute">-p</span> <span class="cm-number">5</span> <span class="cm-attribute">-o</span> quants/<span class="cm-def">${sample}</span>_quant
<span class="cm-keyword">done</span></pre>
<h2 class="md-end-block md-heading">subread流程</h2>
<p><span class="md-line md-end-block">也是首先构建索引，但是这个需要提前解压fa文件</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">gunzip Arabidopsis_thaliana.TAIR10.28.dna.genome.fa.gz
~/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subread-buildindex -o athal_index   Arabidopsis_thaliana.TAIR10.28.dna.genome.fa</pre>
<p><span class="md-line md-end-block">消耗时间也不到一分钟，生成的索引文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">117M Oct 17 11:21 Arabidopsis_thaliana.TAIR10.28.dna.genome.fa
 15M Oct 17 11:41 Arabidopsis_thaliana.TAIR10.28.gff3.gz
 29M Oct 17 12:19 athal_index.00.b.array
231M Oct 17 12:19 athal_index.00.b.tab
 314 Oct 17 12:19 athal_index.files
345K Oct 17 12:18 athal_index.log</pre>
<p><span class="md-line md-end-block">然后比对也是一个脚本批量化完成</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-meta">#! /bin/bash</span>
<span class="cm-def">subjunc</span><span class="cm-operator">=</span><span class="cm-string">"/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subjunc"</span>; 
<span class="cm-def">index</span><span class="cm-operator">=</span><span class="cm-string">'subread/athal_index'</span>;
<span class="cm-keyword">for</span> fn <span class="cm-keyword">in</span> ERR1698{194..209};
<span class="cm-keyword">do</span>
    <span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-quote">`basename </span><span class="cm-def">${fn}</span><span class="cm-quote">`</span>
    <span class="cm-builtin">echo</span> <span class="cm-string">"Processin sample </span><span class="cm-def">${sampe}</span><span class="cm-string">"</span> 
    <span class="cm-def">$subjunc</span> <span class="cm-attribute">-i</span> <span class="cm-def">$index</span> \
        <span class="cm-attribute">-r</span> <span class="cm-def">${sample}</span>_1.fastq.gz \
        <span class="cm-attribute">-R</span> <span class="cm-def">${sample}</span>_2.fastq.gz \
        <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_subjunc.bam
<span class="cm-keyword">done</span></pre>
<p><span class="md-line md-end-block">但是输出bam还不够，还需要用featureCounts对之进行定量</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-def">gff3</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/data/public/tair/subread/Arabidopsis_thaliana.TAIR10.28.gff3.gz'</span>;
<span class="cm-def">gtf</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/data/public/tair/subread/Arabidopsis_thaliana.TAIR10.28.gtf'</span>;
​
​
<span class="cm-def">featureCounts</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts'</span>;
<span class="cm-def">$featureCounts</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-p</span> <span class="cm-attribute">-t</span> exon <span class="cm-attribute">-g</span> gene_name <span class="cm-attribute">-a</span> <span class="cm-def">$gtf</span> <span class="cm-attribute">-o</span>  counts.txt   *.bam
nohup <span class="cm-def">$featureCounts</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-p</span> <span class="cm-attribute">-t</span> exon <span class="cm-attribute">-g</span> gene_id <span class="cm-attribute">-a</span> <span class="cm-def">$gtf</span> <span class="cm-attribute">-o</span>  counts_id.txt   *.bam &amp;</pre>
<p><span class="md-line md-end-block">这一步骤是非常快的。</span></p>
<h2 class="md-end-block md-heading">比对可以有更多选择</h2>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-def">$hisat</span> <span class="cm-attribute">-p</span> <span class="cm-number">5</span> <span class="cm-attribute">-x</span> <span class="cm-def">$hisat2_mm10_index</span> <span class="cm-attribute">-1</span> <span class="cm-def">$fq1</span> <span class="cm-attribute">-2</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">-S</span> <span class="cm-def">$sample</span>.sam <span class="cm-number">2</span>&gt;<span class="cm-def">$sample</span>.hisat.log
samtools <span class="cm-builtin">sort</span> <span class="cm-attribute">-O</span> bam <span class="cm-attribute">-</span>@ <span class="cm-number">5</span>  <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_hisat.bam <span class="cm-def">$sample</span>.sam
​
<span class="cm-def">$subjunc</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span>  <span class="cm-attribute">-i</span> <span class="cm-def">$subjunc_mm10_index</span> <span class="cm-attribute">-r</span> <span class="cm-def">$fq1</span>  <span class="cm-attribute">-R</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_subjunc.bam
<span class="cm-comment">## 比对的sam自动转为bam，但是并不按照参考基因组坐标排序</span>
​
bwa mem <span class="cm-attribute">-t</span> <span class="cm-number">5</span> <span class="cm-attribute">-M</span>  <span class="cm-def">$bwa_mm10_index</span> <span class="cm-def">$fq1</span> <span class="cm-def">$fq2</span> <span class="cm-number">1</span>&gt;<span class="cm-def">$sample</span>.sam <span class="cm-number">2</span>&gt;/dev/null 
samtools <span class="cm-builtin">sort</span> <span class="cm-attribute">-O</span> bam <span class="cm-attribute">-</span>@ <span class="cm-number">5</span>  <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_bwa.bam <span class="cm-def">$sample</span>.sam
​
<span class="cm-def">$bowtie</span> <span class="cm-attribute">-p</span> <span class="cm-number">5</span> <span class="cm-attribute">-x</span> <span class="cm-def">$bowtie2_mm10_index</span> <span class="cm-attribute">-1</span> <span class="cm-def">$fq1</span>  <span class="cm-attribute">-2</span> <span class="cm-def">$fq2</span> | samtools <span class="cm-builtin">sort</span>  <span class="cm-attribute">-O</span> bam  <span class="cm-attribute">-</span>@ <span class="cm-number">5</span> <span class="cm-attribute">-o</span> <span class="cm-attribute">-</span> &gt;<span class="cm-def">${sample}</span>_bowtie.bam
​
<span class="cm-comment">## star软件载入参考基因组非常耗时，约10分钟，也比较耗费内存，但是比对非常快，5M的序列就两分钟即可</span>
<span class="cm-def">$star</span> <span class="cm-attribute">--runThreadN</span>  <span class="cm-number">5</span> <span class="cm-attribute">--genomeDir</span> <span class="cm-def">$star_mm10_index</span> <span class="cm-attribute">--readFilesCommand</span> zcat <span class="cm-attribute">--readFilesIn</span>  <span class="cm-def">$fq1</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">--outFileNamePrefix</span>  <span class="cm-def">${sample}</span>_star 
<span class="cm-comment">## --outSAMtype BAM  可以用这个参数设置直接输出排序好的bam文件</span>
samtools <span class="cm-builtin">sort</span> <span class="cm-attribute">-O</span> bam <span class="cm-attribute">-</span>@ <span class="cm-number">5</span>  <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_star.bam <span class="cm-def">${sample}</span>_starAligned.out.sam</pre>
<h2 class="md-end-block md-heading">表达矩阵的normalization方法</h2>
<p><span class="md-line md-end-block">统计学原理需要耗费很大功夫才能理解，主要是掌握这些normalization方法如何在R里面实现，还有它们的简单比较。</span></p>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block"><span class=""><strong>Total count (TC)</strong></span>: Gene counts are divided by the total number of mapped reads (or library size) associated with their lane and multiplied by the mean total count across all the samples of the dataset.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Upper Quartile (UQ)</strong></span>: Very similar in principle to TC, the total counts are replaced by the upper quartile of counts different from 0 in the computation of the normalization factors.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Median (Med)</strong></span>: Also similar to TC, the total counts are replaced by the median counts different from 0 in the computation of the normalization factors. That is, the median is calculated as the median of gene counts of all runs.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>DESeq</strong></span>: This normalization method is included in the DESeq Bioconductor package and is based on the hypothesis that most genes are not DE. The method is based on a negative binomial distribution model, with variance and mean linked by local regression, and presents an implementation that gives scale factors. Within the DESeq package, and with the <span spellcheck="false"><code>estimateSizeFactorsForMatrix</code></span>function, scaling factors can be calculated for each run. After dividing gene counts by each scaling factor, DESeq values are calculated as the total of rescaled gene counts of all runs.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Trimmed Mean of M-values (TMM)</strong></span>: This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010). It is also based on the hypothesis that most genes are not DE. Scaling factors are calculated using the <span spellcheck="false"><code>calcNormFactors</code></span> function in the package, and then rescaled gene counts are obtained by dividing gene counts by each scaling factor for each run. TMM is the sum of rescaled gene counts of all runs.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Quantile (Q)</strong></span>: First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Reads Per Kilobase per Million mapped reads (RPKM)</strong></span>: This approach was initially introduced to facilitate comparisons between genes within a sample and combines between- and within-sample normalization. This approach quantifies gene expression from RNA-Seq data by normalizing for the total transcript length and the number of sequencing reads.</span></li>
</ul>
<h2 class="md-end-block md-heading">差异分析</h2>
<p><span class="md-line md-end-block">也是有很多种选择，主要是继承自上面的normalization方法，一般来说挑选好了normalization方法就决定了选取何种差异分析方法，也并不强求弄懂统计学原理，它们都被包装到了对应的R包里面，主要是对R包的学习。</span></p>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">edgeR (Robinson et al., 2010)</span></li>
<li><span class="md-line md-end-block">DESeq / DESeq2 (Anders and Huber, 2010, 2014)</span></li>
<li><span class="md-line md-end-block">DEXSeq (Anders et al., 2012)</span></li>
<li><span class="md-line md-end-block">limmaVoom</span></li>
<li><span class="md-line md-end-block">Cuffdiff / Cuffdiff2 (Trapnell et al., 2013)</span></li>
<li><span class="md-line md-end-block">PoissonSeq</span></li>
<li><span class="md-line md-end-block">baySeq</span></li>
</ul>
<p><span class="md-line md-end-block">首先提取样本的分组信息</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">tail <span class="cm-attribute">-n</span> <span class="cm-operator">+</span><span class="cm-number">2</span> E-MTAB-5130.sdrf.txt | <span class="cm-builtin">cut</span> <span class="cm-attribute">-f</span> <span class="cm-number">32</span>,36 |sort <span class="cm-attribute">-u</span></pre>
<h2 class="md-end-block md-heading">制作表达矩阵</h2>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">这个表达矩阵，就是上游的比对+定量得到的，但是要按照下面的规则做成\t分割的txt文档，如下：</span></span></p>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th></th>
<th><span class="td-span" contenteditable="true">SRR1039508</span></th>
<th><span class="td-span" contenteditable="true">SRR1039509</span></th>
<th><span class="td-span" contenteditable="true">SRR1039512</span></th>
<th><span class="td-span" contenteditable="true">SRR1039513</span></th>
<th><span class="td-span" contenteditable="true">SRR1039516</span></th>
<th><span class="td-span" contenteditable="true">SRR1039517</span></th>
<th><span class="td-span" contenteditable="true">SRR1039520</span></th>
<th><span class="td-span" contenteditable="true">SRR1039521</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000003</span></td>
<td><span class="td-span" contenteditable="true">679</span></td>
<td><span class="td-span" contenteditable="true">448</span></td>
<td><span class="td-span" contenteditable="true">873</span></td>
<td><span class="td-span" contenteditable="true">408</span></td>
<td><span class="td-span" contenteditable="true">1138</span></td>
<td><span class="td-span" contenteditable="true">1047</span></td>
<td><span class="td-span" contenteditable="true">770</span></td>
<td><span class="td-span" contenteditable="true">572</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000005</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000419</span></td>
<td><span class="td-span" contenteditable="true">467</span></td>
<td><span class="td-span" contenteditable="true">515</span></td>
<td><span class="td-span" contenteditable="true">621</span></td>
<td><span class="td-span" contenteditable="true">365</span></td>
<td><span class="td-span" contenteditable="true">587</span></td>
<td><span class="td-span" contenteditable="true">799</span></td>
<td><span class="td-span" contenteditable="true">417</span></td>
<td><span class="td-span" contenteditable="true">508</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000457</span></td>
<td><span class="td-span" contenteditable="true">260</span></td>
<td><span class="td-span" contenteditable="true">211</span></td>
<td><span class="td-span" contenteditable="true">263</span></td>
<td><span class="td-span" contenteditable="true">164</span></td>
<td><span class="td-span" contenteditable="true">245</span></td>
<td><span class="td-span" contenteditable="true">331</span></td>
<td><span class="td-span" contenteditable="true">233</span></td>
<td><span class="td-span" contenteditable="true">229</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000460</span></td>
<td><span class="td-span" contenteditable="true">60</span></td>
<td><span class="td-span" contenteditable="true">55</span></td>
<td><span class="td-span" contenteditable="true">40</span></td>
<td><span class="td-span" contenteditable="true">35</span></td>
<td><span class="td-span" contenteditable="true">78</span></td>
<td><span class="td-span" contenteditable="true">63</span></td>
<td><span class="td-span" contenteditable="true">76</span></td>
<td><span class="td-span" contenteditable="true">60</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000938</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">2</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">1</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000971</span></td>
<td><span class="td-span" contenteditable="true">3251</span></td>
<td><span class="td-span" contenteditable="true">3679</span></td>
<td><span class="td-span" contenteditable="true">6177</span></td>
<td><span class="td-span" contenteditable="true">4252</span></td>
<td><span class="td-span" contenteditable="true">6721</span></td>
<td><span class="td-span" contenteditable="true">11027</span></td>
<td><span class="td-span" contenteditable="true">5176</span></td>
<td><span class="td-span" contenteditable="true">7995</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001036</span></td>
<td><span class="td-span" contenteditable="true">1433</span></td>
<td><span class="td-span" contenteditable="true">1062</span></td>
<td><span class="td-span" contenteditable="true">1733</span></td>
<td><span class="td-span" contenteditable="true">881</span></td>
<td><span class="td-span" contenteditable="true">1424</span></td>
<td><span class="td-span" contenteditable="true">1439</span></td>
<td><span class="td-span" contenteditable="true">1359</span></td>
<td><span class="td-span" contenteditable="true">1109</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001084</span></td>
<td><span class="td-span" contenteditable="true">519</span></td>
<td><span class="td-span" contenteditable="true">380</span></td>
<td><span class="td-span" contenteditable="true">595</span></td>
<td><span class="td-span" contenteditable="true">493</span></td>
<td><span class="td-span" contenteditable="true">820</span></td>
<td><span class="td-span" contenteditable="true">714</span></td>
<td><span class="td-span" contenteditable="true">696</span></td>
<td><span class="td-span" contenteditable="true">704</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001167</span></td>
<td><span class="td-span" contenteditable="true">394</span></td>
<td><span class="td-span" contenteditable="true">236</span></td>
<td><span class="td-span" contenteditable="true">464</span></td>
<td><span class="td-span" contenteditable="true">175</span></td>
<td><span class="td-span" contenteditable="true">658</span></td>
<td><span class="td-span" contenteditable="true">584</span></td>
<td><span class="td-span" contenteditable="true">360</span></td>
<td><span class="td-span" contenteditable="true">269</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001460</span></td>
<td><span class="td-span" contenteditable="true">172</span></td>
<td><span class="td-span" contenteditable="true">168</span></td>
<td><span class="td-span" contenteditable="true">264</span></td>
<td><span class="td-span" contenteditable="true">118</span></td>
<td><span class="td-span" contenteditable="true">241</span></td>
<td><span class="td-span" contenteditable="true">210</span></td>
<td><span class="td-span" contenteditable="true">155</span></td>
<td><span class="td-span" contenteditable="true">177</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001461</span></td>
<td><span class="td-span" contenteditable="true">2112</span></td>
<td><span class="td-span" contenteditable="true">1867</span></td>
<td><span class="td-span" contenteditable="true">5137</span></td>
<td><span class="td-span" contenteditable="true">2657</span></td>
<td><span class="td-span" contenteditable="true">2735</span></td>
<td><span class="td-span" contenteditable="true">2751</span></td>
<td><span class="td-span" contenteditable="true">2467</span></td>
<td><span class="td-span" contenteditable="true">2905</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001497</span></td>
<td><span class="td-span" contenteditable="true">524</span></td>
<td><span class="td-span" contenteditable="true">488</span></td>
<td><span class="td-span" contenteditable="true">638</span></td>
<td><span class="td-span" contenteditable="true">357</span></td>
<td><span class="td-span" contenteditable="true">676</span></td>
<td><span class="td-span" contenteditable="true">806</span></td>
<td><span class="td-span" contenteditable="true">493</span></td>
<td><span class="td-span" contenteditable="true">475</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001561</span></td>
<td><span class="td-span" contenteditable="true">71</span></td>
<td><span class="td-span" contenteditable="true">51</span></td>
<td><span class="td-span" contenteditable="true">211</span></td>
<td><span class="td-span" contenteditable="true">156</span></td>
<td><span class="td-span" contenteditable="true">23</span></td>
<td><span class="td-span" contenteditable="true">38</span></td>
<td><span class="td-span" contenteditable="true">134</span></td>
<td><span class="td-span" contenteditable="true">172</span></td>
</tr>
</tbody>
</table>
<p><span class="md-line md-end-block">第一列是基因ID，后面的列是各个样本。其中第一行尤为注意，最开头是一个空格(了解R里面read.table函数原理)</span></p>
<h2 class="md-end-block md-heading">制作分组矩阵</h2>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th></th>
<th><span class="td-span" contenteditable="true">dex</span></th>
<th><span class="td-span" contenteditable="true">SampleName</span></th>
<th><span class="td-span" contenteditable="true">cell</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039508</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275862</span></td>
<td><span class="td-span" contenteditable="true">N61311</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039509</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275863</span></td>
<td><span class="td-span" contenteditable="true">N61311</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039512</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275866</span></td>
<td><span class="td-span" contenteditable="true">N052611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039513</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275867</span></td>
<td><span class="td-span" contenteditable="true">N052611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039516</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275870</span></td>
<td><span class="td-span" contenteditable="true">N080611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039517</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275871</span></td>
<td><span class="td-span" contenteditable="true">N080611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039520</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275874</span></td>
<td><span class="td-span" contenteditable="true">N061011</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039521</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275875</span></td>
<td><span class="td-span" contenteditable="true">N061011</span></td>
</tr>
</tbody>
</table>
<p><span class="md-line md-end-block">记住要跟上面的表达矩阵的样本名对应！！！</span></p>
<p><span class="md-line md-end-block">只有第一列是需要看的，其余的无所谓。</span></p>
<p><span class="md-line md-end-block">根据分组信息，是需要自己指定比对信息的，比如上面的分组矩阵，需要指定 <span spellcheck="false"><code>-c 'trt-untrt'</code></span></span></p>
<h2 class="md-end-block md-heading">下载差异分析脚本</h2>
<pre class="md-fences md-end-block" lang="" contenteditable="false">wget  https://raw.githubusercontent.com/jmzeng1314/my-R/master/DEG_scripts/run_DEG.R
wget  https://raw.githubusercontent.com/jmzeng1314/my-R/master/DEG_scripts/tair/exprSet.txt
wget  https://raw.githubusercontent.com/jmzeng1314/my-R/master/DEG_scripts/tair/group_info.txt
Rscript ../run_DEG.R -e exprSet.txt -g group_info.txt -c 'Day1-Day0' -s counts  -m DESeq2</pre>
<p><span class="md-line md-end-block">如果是转录组的raw counts数据，就选择 -s counts，如果是芯片等normalization好的表达矩阵数据，用默认参数即可。下面是例子：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false"># Rscript run_DEG.R -e airway.expression.txt -g airway.group.txt -c 'trt-untrt' -s counts -m DESeq2
# Rscript run_DEG.R -e airway.expression.txt -g airway.group.txt -c 'trt-untrt' -s counts -m edgeR
# Rscript run_DEG.R -e sCLLex.expression.txt -g sCLLex.group.txt -c 'progres.-stable'
# Rscript run_DEG.R -e sCLLex.expression.txt -g sCLLex.group.txt -c 'progres.-stable' -m t.test</pre>
<p><span class="md-line md-end-block">对于转录组的raw counts数据，有DEseq2包和edgeR包可供选择。对于芯片等normalization好的表达矩阵数据，有limma和t.test供选择。</span></p>
<p><span class="md-line md-end-block" contenteditable="true">关于 选择 哪一组样本与哪一组样本比较，其实可以非常复杂，比如：<span class="" spellcheck="false"><a href="http://genomicsclass.github.io/book/pages/expressing_design_formula.html">http://genomicsclass.github.io/book/pages/expressing_design_formula.html</a></span></span></p>
<h2 class="md-end-block md-heading"><span class="">重要的脚本</span></h2>
<p><span class="md-line md-end-block">比如 <span spellcheck="false"><code>create_testData.R</code></span><span class=""> 里面有如何得到表达矩阵和分组矩阵的内容。</span></span></p>
<h2 class="md-end-block md-heading">富集分析</h2>
<p><span class="md-line md-end-block md-focus" contenteditable="true"><span class="md-expand">这里不想讲解了，跟人类的基因的富集分析还有一点区别的。</span></span></p>
<h2 class="md-end-block md-heading">其它数据</h2>
<p><span class="md-line md-end-block">比如：<span spellcheck="false"><a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89843">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89843</a></span> 测定了402个NSCLC病人和377个正常人的血小板的转录组，数据分析方法如下：</span></p>
<blockquote><p><span class="md-line md-end-block">For further downstream analyses, reads were quality-controlled using Trimmomatic, mapped to the humane reference genome using STAR, and intron-spanning reads were summarized using HTseq.</span></p></blockquote>
<p><span class="md-line md-end-block">这个数据量要重分析，对计算资源要求就比较高了，但是可以直接下载作者分析好的表达矩阵： ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE89nnn/GSE89843/suppl/GSE89843_TEP_Count_Matrix.txt.gz </span></p>
<p><span class="md-line md-end-block">而且表达矩阵的后续分析也不仅仅是差异表达那么简单，毕竟测了如此多的样本。</span></p>
<h3 class="md-end-block md-heading"></h3>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2809.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>史上最快的转录组流程-subread</title>
		<link>http://www.bio-info-trainee.com/2775.html</link>
		<comments>http://www.bio-info-trainee.com/2775.html#comments</comments>
		<pubDate>Thu, 19 Oct 2017 14:10:29 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2775</guid>
		<description><![CDATA[史上最快的转录组流程-subread 安装软件 二进制版本软件，直接找到官网下载 &#8230; <a href="http://www.bio-info-trainee.com/2775.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h2 class="md-end-block md-heading md-focus"><span class="md-expand">史上最快的转录组流程-subread</span></h2>
<h2 class="md-end-block md-heading">安装软件</h2>
<p><span class="md-line md-end-block">二进制版本软件，直接找到官网下载解压即可使用。</span><span id="more-2775"></span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-comment"># http://bioinf.wehi.edu.au/featureCounts/</span>
<span class="cm-builtin">mkdir</span> featureCounts &amp;&amp;  <span class="cm-builtin">cd</span> featureCounts
<span class="cm-comment">## 之前以为这个软件就是用来计算表达量的，所以把文件夹取名为 featureCounts</span>
<span class="cm-builtin">wget</span> https://sourceforge.net/projects/subread/files/subread-1.5.3/subread-1.5.3-Linux-x86_64.tar.gz
tar zxvf subread-1.5.3-Linux-x86_64.tar.gz</pre>
<h2 class="md-end-block md-heading">建立索引</h2>
<p><span class="md-line md-end-block">每个比对工具的算法不一样，所以每个工具都需要对<span class=""><strong>参考基因组</strong></span>建立自己的索引。本身参考基因组占一篇空间就不小，索引之后更大!</span></p>
<p><span class="md-line md-end-block">需要自行从UCSC下载参考基因组，我放在了<span spellcheck="false"><code>~/reference/genome/</code></span> 目录</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-def">buildindex</span><span class="cm-operator">=</span>~/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subread-buildindex
<span class="cm-builtin">cd</span> /home/jianmingzeng/reference/index/subread/
<span class="cm-def">$buildindex</span> <span class="cm-attribute">-o</span> mm10  ~/reference/genome/mm10/mm10.fa
<span class="cm-def">$buildindex</span> <span class="cm-attribute">-o</span> hg19  ~/reference/genome/hg19/hg19.fa
<span class="cm-def">$buildindex</span> <span class="cm-attribute">-o</span> hg38  ~/reference/genome/hg38/hg38.fa</pre>
<p><span class="md-line md-end-block">得到的索引文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
749M Sep 15 17:37 hg19.00.b.array
4.9G Sep 15 17:37 hg19.00.b.tab
5.5K Sep 15 17:33 hg19.files
   0 Sep 15 17:17 hg19.log
2.3K Sep 15 17:38 hg19.reads
766M Sep 15 18:01 hg38.00.b.array
5.0G Sep 15 18:01 hg38.00.b.tab
 29K Sep 15 17:57 hg38.files
   0 Sep 15 17:38 hg38.log
 14K Sep 15 18:01 hg38.reads
652M Sep 15 17:17 mm10.00.b.array
4.4G Sep 15 17:17 mm10.00.b.tab
3.9K Sep 15 17:13 mm10.files
   0 Sep 15 16:52 mm10.log
1.6K Sep 15 17:17 mm10.reads</pre>
<h2 class="md-end-block md-heading">批量比对</h2>
<p><span class="md-line md-end-block">做好一个<span class=""><strong>配置文件</strong></span>，就可以运行下面的脚本。</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-def">subjunc</span><span class="cm-operator">=</span><span class="cm-string">"/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subjunc"</span>; 
<span class="cm-def">subjunc_mm10_index</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/reference/index/subread/mm10'</span>;
​
<span class="cm-builtin">cat</span> <span class="cm-def">$config</span> |while read id
<span class="cm-keyword">do</span>
    <span class="cm-def">arr</span><span class="cm-operator">=</span>(<span class="cm-def">$id</span>)
    <span class="cm-def">fq1</span><span class="cm-operator">=</span><span class="cm-def">${arr[1]}</span>
    <span class="cm-def">fq2</span><span class="cm-operator">=</span><span class="cm-def">${arr[2]}</span>
    <span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-def">${arr[0]}</span>
    <span class="cm-builtin">echo</span> <span class="cm-string">"  start alignment for </span><span class="cm-def">$sample</span><span class="cm-string">"</span> <span class="cm-quote">`date`</span>
    <span class="cm-comment">#$hisat -p 5 -x $mm10_index -1 $fq1 -2 $fq2 -S $sample.sam 2&gt;$sample.hisat.log</span>
    <span class="cm-comment">#samtools sort -O bam -@ 5  -o $sample.bam $sample.sam</span>
    <span class="cm-def">$subjunc</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span>  <span class="cm-attribute">-i</span> <span class="cm-def">$subjunc_mm10_index</span> <span class="cm-attribute">-r</span> <span class="cm-def">$fq1</span>  <span class="cm-attribute">-R</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_subjunc.bam
    <span class="cm-builtin">echo</span> <span class="cm-string">"  end alignment for </span><span class="cm-def">$sample</span><span class="cm-string">"</span> <span class="cm-quote">`date`</span>
<span class="cm-keyword">done</span></pre>
<p><span class="md-line md-end-block">配置文件就3列，第一列是样本名，第二列是该样本的fastq1，第二列是fastq2。多个样本的样本名不运行重复。</span></p>
<p><span class="md-line md-end-block">之前我以为hisat就很快了，换成了这个subjunc才知道没有最快，只有更快。</span></p>
<h2 class="md-end-block md-heading">批量计算表达量</h2>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-def">mm10_gtf</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/reference/gtf/gencode/gencode.vM12.annotation.gtf'</span>;
<span class="cm-def">featureCounts</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts'</span>;
<span class="cm-def">$featureCounts</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-p</span> <span class="cm-attribute">-t</span> exon <span class="cm-attribute">-g</span> gene_id <span class="cm-attribute">-a</span> <span class="cm-def">$mm10_gtf</span> <span class="cm-attribute">-o</span>  counts.txt   *.bam</pre>
<p><span class="md-line md-end-block">实在是没有想到这个软件居然会如此快，1M的reads耗时三五秒即可，甩之前的htseq-counts好几条街。</span></p>
<p><span class="md-line md-end-block">还有更多计算的模型和参数可以供挑选；<span class="" spellcheck="false"><a href="http://bioinf.wehi.edu.au/featureCounts/">http://bioinf.wehi.edu.au/featureCounts/</a></span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2775.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>基因组重测序的unmapped reads assembly探究</title>
		<link>http://www.bio-info-trainee.com/2523.html</link>
		<comments>http://www.bio-info-trainee.com/2523.html#comments</comments>
		<pubDate>Sat, 02 Sep 2017 12:16:55 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2523</guid>
		<description><![CDATA[基因组重测序的unmapped reads assembly探究 主要参考这篇文 &#8230; <a href="http://www.bio-info-trainee.com/2523.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h2 class="md-end-block md-heading md-focus" contenteditable="true"><span class="md-expand">基因组重测序的unmapped reads assembly探究</span></h2>
<p><span class="md-line md-end-block">主要参考这篇文章的图4：<span spellcheck="false"><a href="http://www.nature.com/ng/journal/v42/n11/fig_tab/ng.691_F4.html">http://www.nature.com/ng/journal/v42/n11/fig_tab/ng.691_F4.html</a></span> </span><span id="more-2523"></span></p>
<p><span class="md-line md-end-block" contenteditable="true"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://www.nature.com/ng/journal/v42/n11/images/ng.691-F4.jpg"><img src="http://www.nature.com/ng/journal/v42/n11/images/ng.691-F4.jpg" alt="" /></span></span></p>
<p><span class="md-line md-end-block" contenteditable="true">这是2010年发表于nature genetics杂志的<span class=""><a spellcheck="false" href="http://www.nature.com/ng/journal/v42/n11/full/ng.691.html">Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing</a></span><span class=""> 虽然文章选择的是SOAPdenovo,ABySS,Velvet这3款软件来进行组装，但毕竟是2010年的文章了，现在其实有更好的选择，比如Minia</span></span></p>
<h2 class="md-end-block md-heading">选择Minia工具来组装</h2>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">Minia软件也是基于de Bruijn图原理的短序列组装工具，优于以前的ABySS和SOAPdenovo，所以这里就选择它啦。</span></span></p>
<h3 class="md-end-block md-heading">下载安装Minia</h3>
<p><span class="md-line md-end-block">安装官网的指导说明书下载二进制版本即可，代码如下：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-comment">## Download and install Minia</span>
<span class="cm-comment"># http://minia.genouest.org/</span>
<span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-builtin">mkdir</span> Minia &amp;&amp;  <span class="cm-builtin">cd</span> Minia
<span class="cm-builtin">wget</span> https://github.com/GATB/minia/releases/download/v2.0.7/minia-v2.0.7-bin-Linux.tar.gz 
tar <span class="cm-attribute">-zxvf</span> minia-v2.0.7-bin-Linux.tar.gz 
~/biosoft/Minia/minia-v2.0.7-bin-Linux/bin/minia <span class="cm-attribute">--help</span> 
<span class="cm-comment">## eg: ./minia -in reads.fa -kmer-size 31 -abundance-min 3 -out output_prefix </span></pre>
<p><span class="md-line md-end-block">软件使用方法也非常简单，就一行命令，其中最佳<span spellcheck="false"><code>-kmer-size</code></span>需要用<span class=""><a spellcheck="false" href="http://kmergenie.bx.psu.edu/">KmerGenie</a></span>来确定。</span></p>
<h3 class="md-end-block md-heading">使用</h3>
<h3 class="md-end-block md-heading">step1:提取比对失败的reads</h3>
<pre class="md-fences md-end-block" lang="Shell" contenteditable="false">
samtools view <span class="cm-attribute">-f4</span> jmzeng_recal.bam |perl <span class="cm-attribute">-alne</span> <span class="cm-string">'{print "\@$F[0]\n$F[9]\n+\n$F[10]" }'</span> &gt;unmapped.fq
​
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-lite.pl <span class="cm-attribute">-verbose</span> <span class="cm-attribute">-fastq</span> unmapped.fq <span class="cm-attribute">-graph_data</span> unmapped.gd <span class="cm-attribute">-out_good</span> null <span class="cm-attribute">-out_bad</span> null
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> unmapped.gd <span class="cm-attribute">-png_all</span> <span class="cm-attribute">-o</span> unmapped
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> unmapped.gd <span class="cm-attribute">-html_all</span> <span class="cm-attribute">-o</span> unmapped
​
<span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped</pre>
<p><span class="md-line md-end-block">共31481084/4=7870271，仅仅是7.8M的reads</span></p>
<h3 class="md-end-block md-heading">step2: 用KmerGenie确定kmer值</h3>
<p><span class="md-line md-end-block">KmerGenie estimates the best k-mer length for genome de novo assembly.</span></p>
<p><span class="md-line md-end-block"><span class="">KmerGenie predictions can be applied to single-k genome assemblers (e.g. Velvet, SOAPdenovo 2, ABySS, Minia).</span></span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-comment">## http://kmergenie.bx.psu.edu/</span>
<span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-builtin">mkdir</span> KmerGenie &amp;&amp;  <span class="cm-builtin">cd</span> KmerGenie
<span class="cm-builtin">wget</span> http://kmergenie.bx.psu.edu/kmergenie-1.7044.tar.gz
tar zxvf kmergenie-1.7044.tar.gz
<span class="cm-builtin">cd</span> kmergenie-1.7044
<span class="cm-builtin">make</span> 
python setup.py install <span class="cm-attribute">--user</span>
~/.local/bin/kmergenie <span class="cm-attribute">--help</span> 
<span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped
~/.local/bin/kmergenie unmapped.fq</pre>
<h3 class="md-end-block md-heading"><span class="">step3: 运行Minia</span></h3>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped
~/biosoft/Minia/minia-v2.0.7-bin-Linux/bin/minia  <span class="cm-attribute">-in</span> unmapped.fq <span class="cm-attribute">-kmer-size</span> <span class="cm-number">31</span> <span class="cm-attribute">-abundance-min</span> <span class="cm-number">3</span> <span class="cm-attribute">-out</span> output_prefix</pre>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">7.8M的reads组装之后有272007条contigs</span></span></p>
<h2 class="md-end-block md-heading">组装之后：</h2>
<p><span class="md-line md-end-block">Prinseq v0.20.4 was used to calculate assembly statistics, including N50 contig size, GC content</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped
​
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-lite.pl <span class="cm-attribute">-verbose</span> <span class="cm-attribute">-fasta</span> output_prefix.contigs.fa  <span class="cm-attribute">-graph_data</span> contigs.gd <span class="cm-attribute">-out_good</span> null <span class="cm-attribute">-out_bad</span> null 
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> contigs.gd <span class="cm-attribute">-png_all</span> <span class="cm-attribute">-o</span> contigs
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> contigs.gd <span class="cm-attribute">-html_all</span> <span class="cm-attribute">-o</span> contigs
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-lite.pl <span class="cm-attribute">-verbose</span> <span class="cm-attribute">-fasta</span> output_prefix.contigs.fa  <span class="cm-attribute">-stats_assembly</span></pre>
<p><span class="md-line md-end-block"><span class="">就是给出一些指标，如下；</span></span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
stats_assembly  N50 176
stats_assembly  N75 113
stats_assembly  N90 78
stats_assembly  N95 70
​</pre>
<h3 class="md-end-block md-heading">Input Information</h3>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th><span class="td-span" contenteditable="true">Input file(s):</span></th>
<th><span class="td-span" contenteditable="true">output_prefix.contigs.fa</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Input format(s):</span></td>
<td><span class="td-span" contenteditable="true">FASTA</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true"><span class=""># Sequences:</span></span></td>
<td><span class="td-span" contenteditable="true">272,007</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Total bases:</span></td>
<td><span class="td-span" contenteditable="true"><span class="">44,868,011</span></span></td>
</tr>
</tbody>
</table>
<h3 class="md-end-block md-heading">Length Distribution</h3>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th><span class="td-span" contenteditable="true">Mean sequence length:</span></th>
<th><span class="td-span" contenteditable="true">164.95 ± 204.44 bp</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Minimum length:</span></td>
<td><span class="td-span" contenteditable="true">63 bp</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Maximum length:</span></td>
<td><span class="td-span" contenteditable="true">10,187 bp</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Length range:</span></td>
<td><span class="td-span" contenteditable="true">10,125 bp</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Mode length:</span></td>
<td><span class="td-span" contenteditable="true"><span class="">150 bp with 16,461 sequences</span></span></td>
</tr>
</tbody>
</table>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">然后用RNA-SEQ数据来比对验证！ 以后再讲</span></span></p>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">把组装好的contigs拿去NCBI做blast看看物种分布,Distribution of top nucleotide BLAST hits by species from the NCBI nr database for 1000 random contigs in the assembly！其实上面的prinseq软件也简单的给出了一个污染物种分布情况表，但是这个原理不一样。以后再讲</span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2523.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>一个MeDIP-seq实战-超级简单-2小时搞定！</title>
		<link>http://www.bio-info-trainee.com/2352.html</link>
		<comments>http://www.bio-info-trainee.com/2352.html#comments</comments>
		<pubDate>Wed, 15 Feb 2017 06:34:38 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[CHIP-seq]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2352</guid>
		<description><![CDATA[请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。 软件 &#8230; <a href="http://www.bio-info-trainee.com/2352.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div><span style="color: #ff0000;"><strong>请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。</strong></span></div>
<div><span style="color: #ff0000;"><strong>软件请用最新版，尤其是samtools等被我存储在系统环境变量的，考虑到读者众多，一般的软件我都会自带版本信息的！</strong></span></div>
<div>我用两个小时，不代表你是两个小时就学会，有些朋友反映学了两个星期才 学会，这很正常，没毛病，不要异想天开两个小时就达到我的水平。</div>
<p>MeDIP-seq 跟ChIP-seq的分析手段是一模一样的，同理hMeDIP-seq，caMeDIP-seq等等，都没有本质上的区别，只是用的抗体不一样而已，请自行搜索基础知识，我只讲数据分析。</p>
<p><a title="Permalink to 一个ChIP-seq实战-超级简单-2小时搞定！" href="http://www.bio-info-trainee.com/2257.html" rel="bookmark">一个ChIP-seq实战-超级简单-2小时搞定！</a></p>
<h1 class="entry-title"><a title="Permalink to 一个RNA-seq实战-超级简单-2小时搞定！" href="http://www.bio-info-trainee.com/2218.html" rel="bookmark">一个RNA-seq实战-超级简单-2小时搞定！</a></h1>
<p><span id="more-2352"></span></p>
<p>请先看看我前面写的系列，对我而言很简单，因为软件我都安装了，数据我都下载好了，代码我都看得懂，对你，不一定简单，有朋友反映学了两个星期才弄懂，但至少，是可以弄懂的！</p>
<div>paper是Dnmt3L antagonizes DNA methylation at bivalent promoters and favors DNA methylation at gene bodies in ESCs.：<a href="https://www.ncbi.nlm.nih.gov/pubmed/24074865">https://www.ncbi.nlm.nih.gov/pubmed/24074865</a> 发表在2013年CELL杂志上面，值得重复！</div>
<div>MeDIP-seq 数据在：<a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44642">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44642</a></div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2017/02/11.png"><img class="alignnone size-full wp-image-2353" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/02/11.png" alt="1" width="501" height="153" /></a></div>
<div>首先下载raw data数据：</div>
<div>wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP018/SRP018845/SRR764931/SRR764931.sra</div>
<div>wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP018/SRP018845/SRR764932/SRR764932.sra</div>
<div>ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 $id;done</div>
<div><img class="alignnone size-full wp-image-2354" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/02/2.png" alt="2" width="359" height="98" /></div>
<div>用fastqc看了看数据质量，发现质量非常赞，我就不需要过滤reads了。代码如下：</div>
<div>ls *fastq |xargs ~/biosoft/fastqc/FastQC/fastqc -t 10</div>
<div>如果要过滤，就用下面的代码：</div>
<div>ls *.fastq | while read id</div>
<div>do</div>
<div>~/biosoft/sickle/sickle-master/sickle se -t sanger -g -f $id -o ${id%%.*}.trimmed.fq.gz</div>
<div>done</div>
<div></div>
<div>首先用bowtie2软件把测序得到的fastq文件比对到mm10参考基因组上面，就两个数据，我就不写循环了！</div>
<div>对于这种没有control的数据，我们可以直接把<span style="color: #ff0000;"><strong>peaks-calling 4部曲</strong></span>一起搞定的！</div>
<div></div>
<div>对比对好的bam文件， 就可以直接用MACS软件来找peaks啦：</div>
<div>首先对这些bam文件批量转换成bw文件。然后批量画图</div>
<div>~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -x ~/reference/index/bowtie/mm10 -U SRR764931.fastq | samtools sort -O bam -o shDnmt3L.bam</div>
<div>## 比对率很高，分别是96.67%(shDnmt3L) 和96.59%(shGFP),这比对率没得说了，非常赞！</div>
<div>samtools index shDnmt3L.bam</div>
<div>~/.local/bin/macs2 callpeak -t shDnmt3L.bam -m 10 30 -p 1e-5 -f BAM -g mm -n shDnmt3L 2&gt;shDnmt3L.masc2.log</div>
<div>bamCoverage -b shDnmt3L.bam -o shDnmt3L.bw ## 这里有个参数，-p 10 --normalizeUsingRPKM</div>
<div>computeMatrix reference-point --referencePoint TSS -b 10000 -a 10000 -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed -S shDnmt3L.bw --skipZeros -o matrix1_shDnmt3L_TSS.gz</div>
<div>plotHeatmap -m matrix1_shDnmt3L_TSS.gz -out shDnmt3L.png</div>
<div>就两个数据，我就没有写循环了，现在你肯定能看懂了吧！</div>
<div>分析，就这样介绍咯！</div>
<div></div>
<div>参考：<a href="http://crazyhottommy.blogspot.com/search/label/MeDIP-seq">http://crazyhottommy.blogspot.com/search/label/MeDIP-seq</a></div>
<div></div>
<div></div>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2352.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
