<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 转录组软件</title>
	<atom:link href="http://www.bio-info-trainee.com/category/omics/transcriptomics/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>Bioconductor包chimeraviz嵌合RNA可视化</title>
		<link>http://www.bio-info-trainee.com/2955.html</link>
		<comments>http://www.bio-info-trainee.com/2955.html#comments</comments>
		<pubDate>Sat, 06 Jan 2018 09:41:26 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[可视化]]></category>
		<category><![CDATA[融合基因]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2955</guid>
		<description><![CDATA[Bioconductor包chimeraviz嵌合RNA可视化 高通量RNA测序 &#8230; <a href="http://www.bio-info-trainee.com/2955.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h1 class="md-end-block md-heading">Bioconductor包chimeraviz嵌合RNA可视化</h1>
<p><span class="md-line md-end-block"><span class="">高通量RNA测序已经能够更高效地检测融合转录本，但是融合检测的技术和相关软件通常产生高错误发现率。而一个自动整合RNA数据和已知基因组特征的可视化框架对于结果的检验是有帮助的。2017年发布的一个bioconductor包，chimeraviz就可以做到自动创建嵌合RNA可视化。 </span></span></p>
<p><span class="md-line md-end-block">支持来自9种不同融合发现工具（<span class=""><a spellcheck="false" href="http://www.bioinformatics.com.cn/?/article/601">deFuse</a></span>、<span class=""><a spellcheck="false" href="http://www.bioinformatics.com.cn/?/article/497">EricScript</a></span>、InFusion、<span class=""><a spellcheck="false" href="http://www.bioinformatics.com.cn/?/article/367">JAFFA</a></span>、FusionCatcher、FusionMap、PRADA、SOAPfuse和STAR-FUSION）的输入。</span><span id="more-2955"></span></p>
<h2 class="md-end-block md-heading">官网教程</h2>
<p><span class="md-line md-end-block">直接在bioconductor可以看到详细说明：<span spellcheck="false"><a href="https://bioconductor.org/packages/release/bioc/html/chimeraviz.html">https://bioconductor.org/packages/release/bioc/html/chimeraviz.html</a></span> | <span class=""><a spellcheck="false" href="https://bioconductor.org/packages/release/bioc/vignettes/chimeraviz/inst/doc/chimeraviz-vignette.html">HTML</a></span> | <span class=""><a spellcheck="false" href="https://bioconductor.org/packages/release/bioc/vignettes/chimeraviz/inst/doc/chimeraviz-vignette.R">R Script</a></span> |</span></p>
<p><span class="md-line md-end-block">下载安装好该R包后，自带一系列的融合基因可视化的测试数据，文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">  1.1K Oct 16 22:36 5267readsAligned.bam
   96B Oct 16 22:36 5267readsAligned.bam.bai
   22K Oct 16 22:36 FusionMap_01_TestDataset_InputFastq.FusionReport.txt
   37K Oct 16 22:36 Homo_sapiens.GRCh37.74.sqlite
   68K Oct 16 22:36 Homo_sapiens.GRCh37.74_subset.gtf
  1.9K Oct 16 22:36 PRADA.acc.fusion.fq.TAF.tsv
   32K Oct 16 22:36 UCSC.HG19.Human.CytoBandIdeogram.txt
   32K Oct 16 22:36 UCSC.HG38.Human.CytoBandIdeogram.txt
   16K Oct 16 22:36 defuse_833ke_results.filtered.tsv
  4.6K Oct 16 22:36 ericscript_SRR1657556.results.total.tsv
  1.7M Oct 16 22:36 fusion5267and11759reads.bam
   57K Oct 16 22:36 fusion5267and11759reads.bam.bai
  4.1K Oct 16 22:36 fusioncatcher_833ke_final-list-candidate-fusion-genes.txt
  2.1K Oct 16 22:36 infusion_fusions.txt
  4.3K Oct 16 22:36 jaffa_results.csv
  2.6K Oct 16 22:36 reads.1.fq
  2.6K Oct 16 22:36 reads.2.fq
  1.0K Oct 16 22:36 reads_supporting_defuse_fusion_5267.1.fq
  1.0K Oct 16 22:36 reads_supporting_defuse_fusion_5267.2.fq
  3.3K Oct 16 22:36 soapfuse_833ke_final.Fusion.specific.for.genes
  2.0K Oct 16 22:36 star-fusion.fusion_candidates.final.abridged.txt</pre>
<p><span class="md-line md-end-block">可以看到，所支持的9种融合基因检测工具的示例结果都在这里了，比如我最喜欢的star-fusion的结果节选如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">#FusionName JunctionReadCount   SpanningFragCount   SpliceType  LeftGene    LeftBreakpoint  RightGene   RightBreakpoint
THRA--AC090627.1    27  93  ONLY_REF_SPLICE THRA^ENSG00000126351.8  chr17:38243106:+    AC090627.1^ENSG00000235300.3    chr17:46371709:+
THRA--AC090627.1    5   93  ONLY_REF_SPLICE THRA^ENSG00000126351.8  chr17:38243106:+    AC090627.1^ENSG00000235300.3    chr17:46384693:+
ACACA--STAC2    12  51  ONLY_REF_SPLICE ACACA^ENSG00000132142.15    chr17:35479453:-    STAC2^ENSG00000141750.6 chr17:37374426:-
RPS6KB1--SNF8   10  43  ONLY_REF_SPLICE RPS6KB1^ENSG00000108443.9   chr17:57970686:+    SNF8^ENSG00000159210.5  chr17:47021337:-
TOB1--SYNRG 8   30  ONLY_REF_SPLICE TOB1^ENSG00000141232.4  chr17:48943419:-    SYNRG^ENSG00000006114.11    chr17:35880751:-
VAPB--IKZF3 4   46  ONLY_REF_SPLICE VAPB^ENSG00000124164.11 chr20:56964573:+    IKZF3^ENSG00000161405.12    chr17:37934020:-
ZMYND8--CEP250  2   44  ONLY_REF_SPLICE ZMYND8^ENSG00000101040.15   chr20:45852970:-    CEP250^ENSG00000126001.11   chr20:34078463:+
AHCTF1--NAAA    3   38  ONLY_REF_SPLICE AHCTF1^ENSG00000153207.10   chr1:247094880:-    NAAA^ENSG00000138744.10 chr4:76846964:-
VAPB--IKZF3 1   46  ONLY_REF_SPLICE VAPB^ENSG00000124164.11 chr20:56964573:+    IKZF3^ENSG00000161405.12    chr17:37944627:-
VAPB--IKZF3 1   46  ONLY_REF_SPLICE VAPB^ENSG00000124164.11 chr20:56964573:+    IKZF3^ENSG00000161405.12    chr17:37922746:-
STX16--RAE1 4   33  ONLY_REF_SPLICE STX16^ENSG00000124222.17    chr20:57227143:+    RAE1^ENSG00000101146.8  chr20:55929088:+</pre>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">这些结果文件导入R里面统一用import系列函数，比如：</span></span></p>
<pre class="md-fences md-end-block" lang="R" contenteditable="false"><span class="cm-variable">library</span>(<span class="cm-variable">chimeraviz</span>)
​
<span class="cm-comment"># Get reference to results file from deFuse</span>
<span class="cm-variable">defuse833ke</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
  <span class="cm-string">"extdata"</span>,
  <span class="cm-string">"defuse_833ke_results.filtered.tsv"</span>,
  <span class="cm-variable">package</span><span class="cm-arg-is">=</span><span class="cm-string">"chimeraviz"</span>)
​
<span class="cm-comment"># Load the results file into a list of fusion objects</span>
<span class="cm-variable">fusions</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">importDefuse</span>(<span class="cm-variable">defuse833ke</span>, <span class="cm-string">"hg19"</span>)
​
<span class="cm-comment">## ---- message = FALSE------------------------------------------------------</span>
<span class="cm-variable">length</span>(<span class="cm-variable">fusions</span>)</pre>
<h2 class="md-end-block md-heading">基因组全局可视化</h2>
<pre class="md-fences md-end-block" lang="R" contenteditable="false"><span class="cm-variable">soapfuse833ke</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
  <span class="cm-string">"extdata"</span>,
  <span class="cm-string">"soapfuse_833ke_final.Fusion.specific.for.genes"</span>,
  <span class="cm-variable">package</span> <span class="cm-arg-is">=</span> <span class="cm-string">"chimeraviz"</span>)
<span class="cm-variable">fusions</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">importSoapfuse</span>(<span class="cm-variable">soapfuse833ke</span>, <span class="cm-string">"hg38"</span>, <span class="cm-number">10</span>)
<span class="cm-comment"># Plot!</span>
<span class="cm-variable">plotCircle</span>(<span class="cm-variable">fusions</span>)</pre>
<p><span class="md-line md-end-block">主要是一个环形图，如下：</span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-circle-plot.png"><img class="alignnone size-full wp-image-2957" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-circle-plot.png" alt="chimeraviz-fusion-circle-plot" width="1094" height="998" /></a></p>
<p><span class="">红色条带-</span><span class=""><strong>染色体内融合</strong></span>，蓝色条带-<span class=""><strong>染色体间融合。</strong></span></p>
<h3 class="md-end-block md-heading">单独可视化某个融合事件</h3>
<pre class="md-fences md-end-block" lang="R" contenteditable="false">​
<span class="cm-keyword">if</span>(<span class="cm-operator">!</span><span class="cm-variable">exists</span>(<span class="cm-string">"defuse833ke"</span>))
  <span class="cm-variable">defuse833ke</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
    <span class="cm-string">"extdata"</span>,
    <span class="cm-string">"defuse_833ke_results.filtered.tsv"</span>,
    <span class="cm-variable">package</span> <span class="cm-arg-is">=</span> <span class="cm-string">"chimeraviz"</span>)
<span class="cm-variable">fusions</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">importDefuse</span>(<span class="cm-variable">defuse833ke</span>, <span class="cm-string">"hg19"</span>, <span class="cm-number">1</span>)
<span class="cm-comment"># Choose a fusion object</span>
<span class="cm-variable">fusion</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">getFusionById</span>(<span class="cm-variable">fusions</span>, <span class="cm-number">5267</span>)
<span class="cm-comment"># Load edb</span>
<span class="cm-keyword">if</span>(<span class="cm-operator">!</span><span class="cm-variable">exists</span>(<span class="cm-string">"edbSqliteFile"</span>))
  <span class="cm-variable">edbSqliteFile</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
    <span class="cm-string">"extdata"</span>,
    <span class="cm-string">"Homo_sapiens.GRCh37.74.sqlite"</span>,
    <span class="cm-variable">package</span><span class="cm-arg-is">=</span><span class="cm-string">"chimeraviz"</span>)
<span class="cm-variable">edb</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">ensembldb</span><span class="cm-operator">::</span><span class="cm-variable">EnsDb</span>(<span class="cm-variable">edbSqliteFile</span>)
<span class="cm-comment"># bamfile with reads in the regions of this fusion event</span>
<span class="cm-keyword">if</span>(<span class="cm-operator">!</span><span class="cm-variable">exists</span>(<span class="cm-string">"fusion5267and11759reads"</span>))
  <span class="cm-variable">fusion5267and11759reads</span> <span class="cm-operator cm-arrow">&lt;-</span> <span class="cm-variable">system.file</span>(
    <span class="cm-string">"extdata"</span>,
    <span class="cm-string">"fusion5267and11759reads.bam"</span>,
    <span class="cm-variable">package</span> <span class="cm-arg-is">=</span> <span class="cm-string">"chimeraviz"</span>)
<span class="cm-comment"># Plot!</span>
<span class="cm-variable">plotFusion</span>(
  <span class="cm-variable">fusion</span> <span class="cm-arg-is">=</span> <span class="cm-variable">fusion</span>,
  <span class="cm-variable">bamfile</span> <span class="cm-arg-is">=</span> <span class="cm-variable">fusion5267and11759reads</span>,
  <span class="cm-variable">edb</span> <span class="cm-arg-is">=</span> <span class="cm-variable">edb</span>,
  <span class="cm-variable">nonUCSC</span> <span class="cm-arg-is">=</span> <span class="cm-variable">TRUE</span>)
​
<span class="cm-comment">## ---- echo = FALSE, message = FALSE, fig.height = 5, fig.width = 10, dev='png'----</span>
<span class="cm-comment"># Plot!</span>
<span class="cm-variable">plotFusion</span>(
  <span class="cm-variable">fusion</span> <span class="cm-arg-is">=</span> <span class="cm-variable">fusion</span>,
  <span class="cm-variable">bamfile</span> <span class="cm-arg-is">=</span> <span class="cm-variable">bamfile5267</span>,
  <span class="cm-variable">edb</span> <span class="cm-arg-is">=</span> <span class="cm-variable">edb</span>,
  <span class="cm-variable">nonUCSC</span> <span class="cm-arg-is">=</span> <span class="cm-variable">TRUE</span>,
  <span class="cm-variable">reduceTranscripts</span> <span class="cm-arg-is">=</span> <span class="cm-variable">TRUE</span>)
​</pre>
<p><span class="md-line md-end-block">这个可视化比较复杂一点，需要融合基因的事件详情，包含两个融合基因的bam片段文件，以及参考基因组的数据库信息。</span></p>
<p><span class="md-line md-end-block">然后有两种展现方式，一种是基于转录本的融合情况，一种是基于基因</span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-plot.png"><img class="alignnone size-full wp-image-2958" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/chimeraviz-fusion-plot.png" alt="chimeraviz-fusion-plot" width="1310" height="1406" /></a></p>
<p><span class="md-line md-end-block">RCC1-HENMT1融合例子。</span></p>
<p><span class="md-line md-end-block md-focus">顶部：显示融合的染色体位置。支持断裂点（红色曲线）的discordant reads数10（其中split的6，spanning的4），注释的转录本及read数图。</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2955.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>用LeafCutter探索转录组数据的可变剪切</title>
		<link>http://www.bio-info-trainee.com/2949.html</link>
		<comments>http://www.bio-info-trainee.com/2949.html#comments</comments>
		<pubDate>Fri, 05 Jan 2018 01:49:59 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2949</guid>
		<description><![CDATA[用LeafCutter探索转录组数据的可变剪切 该软件早在2016年就公布了，发 &#8230; <a href="http://www.bio-info-trainee.com/2949.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h1 class="md-end-block md-heading"><span class="">用LeafCutter探索转录组数据的可变剪切</span></h1>
<p><span class="md-line md-end-block">该软件早在2016年就公布了，发表在biorxiv预印本上面，但直到2017年的双11，才发表在NG上面，文章是 : <span class=""><a spellcheck="false" href="https://www.nature.com/articles/s41588-017-0004-9">Annotation-free quantification of RNA splicing using LeafCutter</a></span> 最大的特点应该是不需要参考基因组的基因注释信息了吧，就是gtf/gff文件可以省略，当然，比对还是需要的。它还有另外一个非常重要的功能，splicing quantitative trait loci (sQTLs) 但是跟我目前关系不大， 就不介绍了。</span><span id="more-2949"></span></p>
<h3 class="md-end-block md-heading">背景介绍</h3>
<p><span class="md-line md-end-block md-focus"><span class="md-expand">目前主流的探究转录组数据的可变剪切的算法要么是基于estimate isoform ratios 或者 exon inclusion levels ，但是挑战还是蛮多的，可变剪切本跟正常转录本重合的比例很大，技术误差也是有的，依赖于基因现有的注释信息，既不准确，也不完全。所以作者开发了LeafCutter工具。</span></span></p>
<h3 class="md-end-block md-heading">LeafCutter workflow.</h3>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">First, short reads are <span class=""><strong>mapped</strong></span> to the genome. When SNP data are available, WASP should be used to filter allele-specific reads that map with a bias. </span></li>
<li><span class="md-line md-end-block">Next, LeafCutter <span class=""><strong>extracts junction reads</strong></span> from.bam files, identifies alternatively excised intron clusters, and summarizes <span class=""><strong>intron usage</strong></span> as counts or proportions. </span></li>
<li><span class="md-line md-end-block">Finally, LeafCutter <span class=""><strong>identifies intron clusters</strong></span> with differentially excised introns between two user-defined groups by using a <span class=""><strong>Dirichlet-multinomial model,</strong></span> or maps genetic variants associated with intron excision levels by using a linear model. </span></li>
</ul>
<p><span class="md-line md-end-block">作者在Genotype-Tissue Expression (GTEx) Consortium数据集上面测试了，并且把结果跟 GENCODE v19, Ensembl, and UCSC 着3大主流的基因注释信息数据库比较。还在其它数据库里面验证了，数据下载地址是：dbGaP under accession <span class=""><a spellcheck="false" href="https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v6.p1">phs000424.v6.p1</a></span> (GTEx), GEO under accession <span class=""><a spellcheck="false" href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637">GSE41637</a></span> (RNA-seq data from mammalian organs), and ENA under accession <span class=""><a spellcheck="false" href="https://www.ebi.ac.uk/ena/data/view/PRJEB3366">PRJEB3366</a></span> (Geuvadis).</span></p>
<h3 class="md-end-block md-heading">软件下载地址：</h3>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">LeafCutter software, <a href="https://github.com/davidaknowles/leafcutter">https://github.com/davidaknowles/leafcutter</a>; </span></li>
<li><span class="md-line md-end-block">LeafViz visualizations, <a href="https://leafcutter.shinyapps.io/leafviz/">https://leafcutter.shinyapps.io/leafviz/</a>; </span></li>
<li><span class="md-line md-end-block">rheumatoid arthritis summary statistics, <a href="http://plaza.umin.ac.jp/yokada/datasource/software.htm">http://plaza.umin.ac.jp/yokada/datasource/software.htm</a>.</span></li>
</ul>
<h3 class="md-end-block md-heading">软件安装及使用</h3>
<p><span class="md-line md-end-block">最简单的就是conda进行安装了：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
conda install <span class="cm-attribute">-c</span> davidaknowles r-leafcutter</pre>
<p><span class="md-line md-end-block">如果安装失败，可能需要单独为它创建一个环境。</span></p>
<p><span class="md-line md-end-block">不过，它本身就是一个R包，所以在个人电脑里面的rstudio里面安装即可。</span></p>
<pre class="md-fences md-end-block" lang="r" contenteditable="false">
<span class="cm-keyword">if</span> (<span class="cm-operator">!</span><span class="cm-variable">require</span>(<span class="cm-string">"devtools"</span>)) <span class="cm-variable">install.packages</span>(<span class="cm-string">"devtools"</span>, <span class="cm-variable">repos</span><span class="cm-operator">=</span><span class="cm-string">'http://cran.us.r-project.org'</span>)
<span class="cm-variable">devtools</span><span class="cm-operator">::</span><span class="cm-variable">install_github</span>(<span class="cm-string">"davidaknowles/leafcutter/leafcutter"</span>)</pre>
<p><span class="md-line md-end-block">但是源代码里面有一些脚本和测试数据，所以还是要下载看看</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">mkdir</span> <span class="cm-attribute">-p</span> ~/biosoft 
<span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-builtin">git</span> clone https://github.com/davidaknowles/leafcutter
<span class="cm-builtin">cd</span> leafcutter
<span class="cm-comment">## 需要修改里面的一个脚本 scripts/bam2junc.sh 把软件路径增添进去即可</span></pre>
<p><span class="md-line md-end-block">里面又是perl又是python的，感觉他们团队开发环境不统一。</span></p>
<h2 class="md-end-block md-heading">第一步:bam2junc</h2>
<p><span class="md-line md-end-block">比对一般来说，优先选择STAR等支持跨越内含子的转录组比对工具得到bam文件，运行下面的脚本即可进行批量转换：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-builtin">cat</span> bam_path.txt |while read id
<span class="cm-keyword">do</span>
<span class="cm-def">file</span><span class="cm-operator">=</span><span class="cm-quote">$(basename </span><span class="cm-def">$id</span><span class="cm-quote"> )</span>
<span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-def">${file%%.*}</span>
    <span class="cm-builtin">echo</span> Converting <span class="cm-def">$id</span> to <span class="cm-def">$sample</span>.junc
    <span class="cm-builtin">sh</span> /public/biosoft/leafcutter/scripts/bam2junc.sh  <span class="cm-def">$id</span> <span class="cm-def">$sample</span>.junc
<span class="cm-keyword">done</span></pre>
<p><span class="md-line md-end-block">得到的junc文件如下:</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
chr7    134840725   134843893   .   1   -
chr2    234355442   234355737   .   1   +
chr4    37828435    37831585    .   13  +
chr19   39101772    39101882    .   5   +
chr11   109735445   109827551   .   19  +
chr18   48458730    48465939    .   8   -
chr12   82751048    82752457    .   12  -
chr15   51018323    51018517    .   14  -
chr1    247323115   247335149   .   2   +
chr10   92920631    92982445    .   1   +</pre>
<p><span class="md-line md-end-block">这个步骤有点耗时，所有的junc文件地址需要保存给下一步使用</span></p>
<h3 class="md-end-block md-heading">第二步：Intron clustering</h3>
<p><span class="md-line md-end-block">这个步骤，需要python2.7版本，这个是python的一个大坑，到现在版本仍然不统一。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">ls *.junc &gt;test_juncfiles.txt
python /public/biosoft/leafcutter/clustering/leafcutter_cluster.py -j test_juncfiles.txt -m 50 -o testYRIvsEU -l 500000</pre>
<p><span class="md-line md-end-block">几分钟就运行完毕。</span></p>
<p><span class="md-line md-end-block">得到的比较重要的文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
1.3M Jan  4 17:45 testYRIvsEU_perind.counts.gz
680K Jan  4 17:45 testYRIvsEU_perind_numers.counts.gz
5.0M Jan  4 17:45 testYRIvsEU_pooled
540K Jan  4 17:45 testYRIvsEU_refined
 877 Jan  4 17:45 testYRIvsEU_sortedlibs
 854 Jan  4 17:43 test_juncfiles.txt</pre>
<p><span class="md-line md-end-block">值得注意的是 <span spellcheck="false"><code>testYRIvsEU_perind_numers.counts.gz</code></span> 文件，里面每一行都是一个内含子，每一列都是一个样本，写明了它们的表达值，这些数值就可以用来做可变剪切分析。</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
 #  zcat testYRIvsEU_perind_numers.counts.gz |tail
chr8:145651155:145651305:clu_6538 21 14 19 8 9 0 13 33 0 0 4 0 5 8 12 0 12 34 15 0 0 10 11
chr8:145651155:145651409:clu_6538 1021 611 186 190 294 284 681 89 222 57 257 363 694 807 523 44 469 812 926 71 80 260 214
chr8:145652362:145653872:clu_6539 1265 694 132 74 302 71 178 34 44 12 63 122 230 218 472 6 146 1421 1084 16 14 83 46
chr8:145652654:145653872:clu_6539 48 24 56 0 26 0 13 0 2 5 2 0 3 19 17 0 2 8 64 0 0 3 0
chr8:145652674:145653872:clu_6539 18 26 0 0 0 7 2 0 5 0 0 0 1 6 11 0 3 34 37 0 0 9 6
chr8:146017525:146017630:clu_6540 2 3 44 0 2 12 4 0 0 0 22 5 9 10 2 0 1 9 11 0 0 1 0
chr8:146017525:146017751:clu_6540 1067 671 620 41 295 347 224 89 62 33 262 136 229 223 356 17 288 480 1842 9 35 70 23
chr8:146076780:146078224:clu_6541 18 3 0 0 17 17 8 0 0 3 2 3 16 6 12 0 4 45 29 9 0 10 2
chr8:146076780:146078378:clu_6541 22 17 0 0 0 3 1 0 0 0 3 2 15 7 2 0 7 62 55 0 0 4 0
chr8:146076780:146078757:clu_6541 10 1 16 0 12 52 0 0 11 0 24 9 27 3 0 0 7 0 28 0 0 2 0</pre>
<h3 class="md-end-block md-heading">第三步：制作分组矩阵进行差异分析</h3>
<p><span class="md-line md-end-block">避免暴露我真实的项目，这里就给作者的示例文件吧：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
RNA.NA18486_YRI.chr1.bam YRI
RNA.NA18487_YRI.chr1.bam YRI
RNA.NA18488_YRI.chr1.bam YRI
RNA.NA18489_YRI.chr1.bam YRI
RNA.NA18498_YRI.chr1.bam YRI
RNA.NA06984_CEU.chr1.bam CEU
RNA.NA06985_CEU.chr1.bam CEU
RNA.NA06986_CEU.chr1.bam CEU
RNA.NA06989_CEU.chr1.bam CEU
RNA.NA06994_CEU.chr1.bam CEU</pre>
<p><span class="md-line md-end-block">很简单的两列文件，说明每一个样本属于哪个组即可。</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"> /public/biosoft/leafcutter/scripts/leafcutter_ds.R <span class="cm-attribute">--num_threads</span> <span class="cm-number">4</span> \
 <span class="cm-attribute">--exon_file</span><span class="cm-operator">=</span>/public/biosoft/leafcutter/leafcutter/data/gencode19_exons.txt.gz \
testYRIvsEU_perind_numers.counts.gz group_info.txt</pre>
<p><span class="md-line md-end-block" contenteditable="true">这里的<span spellcheck="false"><code>group_info.txt</code></span> 就是自己制作好的分组矩阵。值得提醒的是，<span class=""><strong>上面的文件有且只能有2个分组，</strong></span><span class="">这样软件才知道怎么样去比较，如果自己的分组很多，可以考虑制作多个分组文件，运行多次。</span></span></p>
<p><span class="md-line md-end-block">当然，上面的脚本已经没有必要在linux服务器里面运行啦。</span></p>
<p><span class="md-line md-end-block">既然有了内含子的表达矩阵，又有了分组信息，差异分析根本就不会消耗多少计算资源，全部下载到自己的电脑里面去做吧。</span></p>
<p><span class="md-line md-end-block">自己打开文件 <span class="" spellcheck="false"><code>/public/biosoft/leafcutter/scripts/leafcutter_ds.R</code></span> 就明白了整个流程。</span></p>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">也是几分钟就完成了全部结果。</span></span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
Running differential splicing analysis...
Differential splicing summary:
                                             statuses Freq
1 &lt;2 introns used in &gt;=min_samples_per_intron samples  425
2                          &lt;=1 sample with coverage&gt;0   62
3               &lt;=1 sample with coverage&gt;min_coverage  939
4                            Not enough valid samples 3047
5                                             Success 2068
Saving results...
Loading exons from /Users/jmzeng/biosoft/leafcutter/leafcutter/data/gencode19_exons.txt.gz
All done, exiting</pre>
<p><span class="md-line md-end-block" contenteditable="true">得到的文件里面，需要详细了解的是 <span class="" spellcheck="false"><code>leafcutter_ds_cluster_significance.txt</code></span><span class=""> 主要靠自己看readme啦。</span></span></p>
<h3 class="md-end-block md-heading">第四步：可视化那些可变剪切</h3>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">也是包装好的脚本。</span></span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"> /Users/jmzeng/biosoft/leafcutter/scripts/ds_plots.R <span class="cm-attribute">-e</span>  /Users/jmzeng/biosoft/leafcutter/leafcutter/data/gencode19_exons.txt.gz testYRIvsEU_perind_numers.counts.gz   group_info.txt leafcutter_ds_cluster_significance.txt <span class="cm-attribute">-f</span> <span class="cm-number">0</span>.05</pre>
<p><span class="md-line md-end-block">所有的可变剪切形式都会可视化在一张PDF图里面。如下：</span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter1.jpeg"><img class="alignnone size-full wp-image-2950" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter1.jpeg" alt="leafcutter1" width="2236" height="2124" /></a> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter2.jpeg"><img class="alignnone size-full wp-image-2951" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter2.jpeg" alt="leafcutter2" width="2232" height="2122" /></a> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter3.jpeg"><img class="alignnone size-full wp-image-2952" src="http://www.bio-info-trainee.com/wp-content/uploads/2018/01/LeafCutter3.jpeg" alt="leafcutter3" width="2228" height="2154" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2949.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>使用SGSeq探索可变剪切</title>
		<link>http://www.bio-info-trainee.com/2890.html</link>
		<comments>http://www.bio-info-trainee.com/2890.html#comments</comments>
		<pubDate>Thu, 14 Dec 2017 03:17:11 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2890</guid>
		<description><![CDATA[可变剪切是指mRNA前体以多种方式将exon连接在一起的过程。 由于可变剪切使一 &#8230; <a href="http://www.bio-info-trainee.com/2890.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div class="markdown-here-wrapper" data-md-url="http://www.bio-info-trainee.com/wp-admin/post-new.php">
<blockquote style="margin: 1.2em 0px; border-left: 4px solid #dddddd; padding: 0px 1em; color: #777777; quotes: none;">
<p style="margin: 0px 0px 1.2em !important;"><strong>可变剪切</strong>是指mRNA前体以多种方式将exon连接在一起的过程。 由于<strong>可变剪切</strong>使一个基因产生多个mRNA<strong>转录本</strong>，不同mRNA可能翻译成不同蛋白。</p>
</blockquote>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">可变剪切背景知识</h2>
<p style="margin: 0px 0px 1.2em !important;">转录组一般是指从细胞或组织的基因组所转录出来的RNA的总和，包括编码蛋白质的mRNA和各种非编码RNA（<strong>rRNA,tRNA,snRNA,snoRNA,lncRNA,microRNA</strong>等）。真核生物的基因结构是不连续的，如下图：</p>
<p style="margin: 0px 0px 1.2em !important;"><span id="more-2890"></span></p>
<p style="margin: 0px 0px 1.2em !important;"><img src="http://www.bio-info-trainee.com/wp-content/uploads/2017/11/gene-structure.png" alt="真核生物的基因结构" /></p>
<p style="margin: 0px 0px 1.2em !important;">其基因组最初的转录产物其实并不是成熟的mRNA分子，而是它的前体pre-mRNA，那么怎么变成成熟的mRNA呢，就需要从pre-mRNA中将非编码蛋白质的内含子（intron）切除，然后拼接剩下的编码蛋白质的外显子（exon）。但实际上，在这个过程中，有多种多样的前切和拼接方式，从而产生不同的剪切异构体，也就咱们要说的可变剪切。</p>
<p style="margin: 0px 0px 1.2em !important;">可变剪切的形式复杂多样，大致可以分为5大类。</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;">第一类是外显子跳跃型（exon skipping），发生跳跃的外显子和其两侧的内含子都被剪切掉，上游和下游的外显子被直接连着一起保留在剪切后的产物中。</li>
<li style="margin: 0.5em 0px;">第二类是内含子滞留型（intron retention），某一段核苷酸序列在一个剪切体中是外显子的一部分，而在与之对照的剪切体中却是内含子而被剪切掉。</li>
<li style="margin: 0.5em 0px;">第三类是可变5’或3’端剪切（alternative 5’ss splice or alternative 3’ss splice，其中5’ss称供体位点，3’ss称受体位点），和与它对照的另一个剪切体相比，发生剪切的位点在5’或3’端不同，除此，其他剪切选择一致。</li>
<li style="margin: 0.5em 0px;">第四类是转录起始区域可变剪切（alternative TSS），发生剪切的位点在转录起始区域，即与之对应的另一个剪切体除转录起始位点不同外，其余一致。</li>
<li style="margin: 0.5em 0px;">第五类是转录终止区域可变剪切（alternative TTS），与第四类对应，发生剪切的位点只是在转录终止位点不同。</li>
</ul>
<p style="margin: 0px 0px 1.2em !important;"><img src="http://www.bio-info-trainee.com/wp-content/uploads/2017/11/splicing.png" alt="可变剪切的5种形式" /></p>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">软件算法</h2>
<p style="margin: 0px 0px 1.2em !important;"><strong>比较旧的</strong>分析可变剪切的软件主要有SpliceR、SpliceGrapher、ASprofile以及Splicing Express等，它们是基于cufflinks软件的结果，将reads回帖到基因组序列后，根据位置和长度及结构信息，来确定或预测可能的剪切体的类型。目前主流已经不再使用tophat+cufflinks流程了。</p>
<h3 id="sgseq-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.3em;">SGSeq流程</h3>
<p style="margin: 0px 0px 1.2em !important;">这里介绍一下<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">SGSeq</code>软件，输入文件是bam，但是需要用支持转录组数据比对的工具得到的bam文件，比如</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;">GSNAP (T. D. Wu and Nacu 2010)</li>
<li style="margin: 0.5em 0px;">HISAT (Kim, Langmead, and Salzberg 2015)</li>
<li style="margin: 0.5em 0px;">STAR (Dobin et al. 2013)</li>
</ul>
<p style="margin: 0px 0px 1.2em !important;">其实是需要bam文件里面有<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">XS</code> 这样的标记！</p>
<p style="margin: 0px 0px 1.2em !important;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">SGSeq</code>包的安装说明，使用方法都可以见官网：</p>
<table style="margin: 1.2em 0px; padding: 0px; border-collapse: collapse; border-spacing: 0px; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; font-size: inherit; line-height: inherit; font-family: inherit; border: 0px;">
<thead>
<tr style="border-width: 1px 0px 0px; border-image: initial; background-color: white; margin: 0px; padding: 0px; border-color: #cccccc initial initial initial; border-style: solid initial initial initial;">
<th style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em; font-weight: bold; background-color: #f0f0f0;"><a href="https://bioconductor.org/packages/release/bioc/vignettes/SGSeq/inst/doc/SGSeq.html">HTML</a></th>
<th style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em; font-weight: bold; background-color: #f0f0f0;"><a href="https://bioconductor.org/packages/release/bioc/vignettes/SGSeq/inst/doc/SGSeq.R">R Script</a></th>
<th style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em; font-weight: bold; background-color: #f0f0f0;">SGSeq</th>
</tr>
</thead>
<tbody style="margin: 0px; padding: 0px; border: 0px;">
<tr style="border-width: 1px 0px 0px; border-image: initial; background-color: white; margin: 0px; padding: 0px; border-color: #cccccc initial initial initial; border-style: solid initial initial initial;">
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"><a href="https://bioconductor.org/packages/release/bioc/manuals/SGSeq/man/SGSeq.pdf">PDF</a></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;">Reference Manual</td>
</tr>
<tr style="border-width: 1px 0px 0px; border-image: initial; background-color: #f8f8f8; margin: 0px; padding: 0px; border-color: #cccccc initial initial initial; border-style: solid initial initial initial;">
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"><a href="https://bioconductor.org/packages/release/bioc/news/SGSeq/NEWS">Text</a></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;"></td>
<td style="font-size: 1em; border: 1px solid #cccccc; margin: 0px; padding: 0.5em 1em;">NEWS</td>
</tr>
</tbody>
</table>
<h2 id="-bam-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">需要bam文件</h2>
<p style="margin: 0px 0px 1.2em !important;">安装好包之后可以看到附带的数据，如下：</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">jianmingzengs-iMac:IGV_2.3.98 jmzeng$ cd /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SGSeq/extdata/bams/
jianmingzengs-iMac:bams jmzeng$ ls -lh
total 1952
-rw-r--r-- 1 jmzeng admin 54K Nov 1 01:26 N1.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N1.bam.bai
-rw-r--r-- 1 jmzeng admin 86K Nov 1 01:26 N2.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N2.bam.bai
-rw-r--r-- 1 jmzeng admin 75K Nov 1 01:26 N3.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N3.bam.bai
-rw-r--r-- 1 jmzeng admin 92K Nov 1 01:26 N4.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 N4.bam.bai
-rw-r--r-- 1 jmzeng admin 75K Nov 1 01:26 T1.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T1.bam.bai
-rw-r--r-- 1 jmzeng admin 90K Nov 1 01:26 T2.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T2.bam.bai
-rw-r--r-- 1 jmzeng admin 65K Nov 1 01:26 T3.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T3.bam.bai
-rw-r--r-- 1 jmzeng admin 75K Nov 1 01:26 T4.bam
-rw-r--r-- 1 jmzeng admin 43K Nov 1 01:26 T4.bam.bai
</code></pre>
<p style="margin: 0px 0px 1.2em !important;">这些<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">bam</code>文件之所以这么小，就是因为作者只是截取了<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">hg19</code>的部分数据，坐标是<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">16 [87362942, 87425708]</code></p>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">需要注释文件</h2>
<p style="margin: 0px 0px 1.2em !important;">需根据bioconductor里面的txdb对象来构建比对文件的参考基因组，参考注释信息。如果是hg19的可以如下：</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb &lt;- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb &lt;- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) &lt;- "NCBI"
txf_ucsc &lt;- convertToTxFeatures(txdb)
txf_ucsc &lt;- txf_ucsc[txf_ucsc %over% gr]
head(txf_ucsc)
type(txf_ucsc)
head(txName(txf_ucsc))
head(geneName(txf_ucsc))
</code></pre>
<p style="margin: 0px 0px 1.2em !important;">主要就是通过<code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">convertToTxFeatures()</code>函数把 <code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0px 0.3em; white-space: pre-wrap; border: 1px solid #eaeaea; background-color: #f8f8f8; border-radius: 3px; display: inline;">GRanges</code> 对象转化成了一个<em>TxFeatures</em>对象，用来标记下面5种类型：</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;"><em>J</em> (splice junction)</li>
<li style="margin: 0.5em 0px;"><em>I</em> (internal exon)</li>
<li style="margin: 0.5em 0px;"><em>F</em> (first/5′′-terminal exon)</li>
<li style="margin: 0.5em 0px;"><em>L</em> (last/5′′-terminal exon)</li>
<li style="margin: 0.5em 0px;"><em>U</em> (unspliced transcript).</li>
</ul>
<p style="margin: 0px 0px 1.2em !important;">再用 <em>convertToSGFeatures()</em> 函数把TxFeatures对象转化成SGFeatures 对象，用来标记</p>
<ul style="margin: 1.2em 0px; padding-left: 2em;">
<li style="margin: 0.5em 0px;"><em>J</em> (splice junction)</li>
<li style="margin: 0.5em 0px;"><em>E</em> (disjoint exon bin)</li>
<li style="margin: 0.5em 0px;"><em>D</em> (splice donor site)</li>
<li style="margin: 0.5em 0px;"><em>A</em> (splice acceptor site).</li>
</ul>
<h2 id="-sgseq-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">运行SGSeq软件</h2>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">sgfc_ucsc &lt;- analyzeFeatures(si, features = txf_ucsc)
sgfc_ucsc
</code></pre>
<p style="margin: 0px 0px 1.2em !important;">因为软件包自带的数据非常小，所以很容易就运行完毕，不知道真实情况下我的<strong>16G</strong>的bam文件会处理多久。</p>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">探索处理结果</h2>
<p style="margin: 0px 0px 1.2em !important;">也是全部在R语言里面运行即可，下面的这些函数用来探索分析结果，这些表达矩阵就写明了每个基因的每个外显子的表达量以及两个外显子中间夹着的内含子的表达情况。</p>
<p style="margin: 0px 0px 1.2em !important;">也就是说该软件在R里面就对所有的genomic features 进行了reads的计数。</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code class="hljs language-R" style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block; overflow: auto; overflow-x: auto; color: #333333; background: #f8f8f8; text-size-adjust: none;">colData(sgfc_ucsc)
rowRanges(sgfc_ucsc)
head(counts(sgfc_ucsc))
head(FPKM(sgfc_ucsc))
</code></pre>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">可变剪切形式的可视化</h2>
<p style="margin: 0px 0px 1.2em !important;">挑选其中一个基因，可视化表达差异情况</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code class="hljs language-R" style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block; overflow: auto; overflow-x: auto; color: #333333; background: #f8f8f8; text-size-adjust: none;">df &lt;- plotFeatures(sgfc_ucsc, geneID = <span class="hljs-number" style="color: #008080;">1</span>)
<span class="hljs-comment" style="color: #999988; font-style: italic;"># 下面是复杂一点的可视化</span>
sgfc_pred &lt;- analyzeFeatures(si, which = gr)
head(rowRanges(sgfc_pred))
sgfc_pred &lt;- annotate(sgfc_pred, txf_ucsc)
head(rowRanges(sgfc_pred))
df &lt;- plotFeatures(sgfc_pred, geneID = <span class="hljs-number" style="color: #008080;">1</span>, color_novel = <span class="hljs-string" style="color: #dd1144;">"red"</span>)
</code></pre>
<p style="margin: 0px 0px 1.2em !important;"> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2017/12/transcript-variant-bioconductor-SGSeq.png"><img class="alignnone size-full wp-image-2893" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/12/transcript-variant-bioconductor-SGSeq.png" alt="transcript-variant-bioconductor-sgseq" width="1504" height="1080" /></a>这个是作者精选挑选的特殊的例子用来展现软件的成功，事实上应该是先全局检查哪些可变剪切存在，然后输出</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">## 下面是另外一个展现模式：
par(mfrow = c(5, 1), mar = c(1, 3, 1, 1))
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4) {
 plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}
</code></pre>
<h2 style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;"><img class="alignnone size-full wp-image-2892" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/12/transcript-variant-bioconductor-SGSeq-2.png" alt="transcript-variant-bioconductor-sgseq-2" width="1788" height="1136" /></h2>
<h2 id="-" style="margin: 1.3em 0px 1em; padding: 0px; font-weight: bold; font-size: 1.4em; border-bottom: 1px solid #eeeeee;">从可变剪切预测结果里面鉴别剪切体</h2>
<p style="margin: 0px 0px 1.2em !important;">Instead of considering the full splice graph of a gene, the analysis can be focused on individual splice events. Function <em>analyzeVariants()</em> recursively identifies splice events from the graph, obtains representative counts for each splice variant, and computes estimates of relative splice variant usage, also referred to as ‘percentage spliced <strong>in’ (PSI or Ψ) (Venables et al. 2008, Katz et al. (2010)).</strong> （涉及到了一个算法的问题）</p>
<pre style="font-size: 1em; font-family: Consolas, Inconsolata, Courier, monospace; line-height: 1.2em; margin: 1.2em 0px;"><code style="font-size: 0.85em; font-family: Consolas, Inconsolata, Courier, monospace; margin: 0px 0.15em; padding: 0.5em 0.7em; white-space: pre; border: 1px solid #cccccc; background-color: #f8f8f8; border-radius: 3px; display: block !important; overflow: auto;">sgvc_pred &lt;- analyzeVariants(sgfc_pred)
sgvc_pred
mcols(sgvc_pred)
variantFreq(sgvc_pred)
plotVariants(sgvc_pred, eventID = 1, color_novel = "red")
library(BSgenome.Hsapiens.UCSC.hg19)
seqlevelsStyle(Hsapiens) &lt;- "NCBI"
vep &lt;- predictVariantEffects(sgv_pred, txdb, Hsapiens)
vep
</code></pre>
<div style="height: 0; width: 0; max-height: 0; max-width: 0; overflow: hidden; font-size: 0em; padding: 0; margin: 0;" title="MDH:PHA+IyDkvb/nlKhTR1NlceaOoue0ouWPr+WPmOWJquWIhzwvcD48cD4mZ3Q7ICoq5Y+v5Y+Y5Ymq
5YiHKirmmK/mjIdtUk5B5YmN5L2T5Lul5aSa56eN5pa55byP5bCGZXhvbui/nuaOpeWcqOS4gOi1
t+eahOi/h+eoi+OAgiDnlLHkuo4qKuWPr+WPmOWJquWIhyoq5L2/5LiA5Liq5Z+65Zug5Lqn55Sf
5aSa5LiqbVJOQSoq6L2s5b2V5pysKirvvIzkuI3lkIxtUk5B5Y+v6IO957+76K+R5oiQ5LiN5ZCM
6JuL55m944CCPC9wPjxwPiMjIOWPr+WPmOWJquWIh+iDjOaZr+efpeivhjwvcD48cD7ovazlvZXn
u4TkuIDoiKzmmK/mjIfku47nu4bog57miJbnu4Tnu4fnmoTln7rlm6Dnu4TmiYDovazlvZXlh7rm
naXnmoRSTkHnmoTmgLvlkozvvIzljIXmi6znvJbnoIHom4vnmb3otKjnmoRtUk5B5ZKM5ZCE56eN
6Z2e57yW56CBUk5B77yIKipyUk5BLHRSTkEsc25STkEsc25vUk5BLGxuY1JOQSxtaWNyb1JOQSoq
562J77yJ44CC55yf5qC455Sf54mp55qE5Z+65Zug57uT5p6E5piv5LiN6L+e57ut55qE77yM5aaC
5LiL5Zu+77yaPC9wPjxwPiFb55yf5qC455Sf54mp55qE5Z+65Zug57uT5p6EXShodHRwOi8vd3d3
LmJpby1pbmZvLXRyYWluZWUuY29tL3dwLWNvbnRlbnQvdXBsb2Fkcy8yMDE3LzExL2dlbmUtc3Ry
dWN0dXJlLnBuZyk8L3A+PHA+5YW25Z+65Zug57uE5pyA5Yid55qE6L2s5b2V5Lqn54mp5YW25a6e
5bm25LiN5piv5oiQ54af55qEbVJOQeWIhuWtkO+8jOiAjOaYr+Wug+eahOWJjeS9k3ByZS1tUk5B
77yM6YKj5LmI5oCO5LmI5Y+Y5oiQ5oiQ54af55qEbVJOQeWRou+8jOWwsemcgOimgeS7jnByZS1t
Uk5B5Lit5bCG6Z2e57yW56CB6JuL55m96LSo55qE5YaF5ZCr5a2Q77yIaW50cm9u77yJ5YiH6Zmk
77yM54S25ZCO5ou85o6l5Ymp5LiL55qE57yW56CB6JuL55m96LSo55qE5aSW5pi+5a2Q77yIZXhv
bu+8ieOAguS9huWunumZheS4iu+8jOWcqOi/meS4qui/h+eoi+S4re+8jOacieWkmuenjeWkmuag
t+eahOWJjeWIh+WSjOaLvOaOpeaWueW8j++8jOS7juiAjOS6p+eUn+S4jeWQjOeahOWJquWIh+W8
guaehOS9k++8jOS5n+WwseWSseS7rOimgeivtOeahOWPr+WPmOWJquWIh+OAgjwvcD48cD7lj6/l
j5jliarliIfnmoTlvaLlvI/lpI3mnYLlpJrmoLfvvIzlpKfoh7Tlj6/ku6XliIbkuLo15aSn57G7
44CCPC9wPjxwPi0g56ys5LiA57G75piv5aSW5pi+5a2Q6Lez6LeD5Z6L77yIZXhvbiBza2lwcGlu
Z++8ie+8jOWPkeeUn+i3s+i3g+eahOWkluaYvuWtkOWSjOWFtuS4pOS+p+eahOWGheWQq+WtkOmD
veiiq+WJquWIh+aOie+8jOS4iua4uOWSjOS4i+a4uOeahOWkluaYvuWtkOiiq+ebtOaOpei/nued
gOS4gOi1t+S/neeVmeWcqOWJquWIh+WQjueahOS6p+eJqeS4reOAgjxicj4tIOesrOS6jOexu+aY
r+WGheWQq+WtkOa7nueVmeWei++8iGludHJvbiByZXRlbnRpb27vvInvvIzmn5DkuIDmrrXmoLjo
i7fphbjluo/liJflnKjkuIDkuKrliarliIfkvZPkuK3mmK/lpJbmmL7lrZDnmoTkuIDpg6jliIbv
vIzogIzlnKjkuI7kuYvlr7nnhafnmoTliarliIfkvZPkuK3ljbTmmK/lhoXlkKvlrZDogIzooqvl
iarliIfmjonjgII8YnI+LSDnrKzkuInnsbvmmK/lj6/lj5g14oCZ5oiWM+KAmeerr+WJquWIh++8
iGFsdGVybmF0aXZlIDXigJlzcyBzcGxpY2Ugb3IgYWx0ZXJuYXRpdmUgM+KAmXNzIHNwbGljZe+8
jOWFtuS4rTXigJlzc+ensOS+m+S9k+S9jeeCue+8jDPigJlzc+ensOWPl+S9k+S9jeeCue+8ie+8
jOWSjOS4juWug+WvueeFp+eahOWPpuS4gOS4quWJquWIh+S9k+ebuOavlO+8jOWPkeeUn+WJquWI
h+eahOS9jeeCueWcqDXigJnmiJYz4oCZ56uv5LiN5ZCM77yM6Zmk5q2k77yM5YW25LuW5Ymq5YiH
6YCJ5oup5LiA6Ie044CCPGJyPi0g56ys5Zub57G75piv6L2s5b2V6LW35aeL5Yy65Z+f5Y+v5Y+Y
5Ymq5YiH77yIYWx0ZXJuYXRpdmUgVFNT77yJ77yM5Y+R55Sf5Ymq5YiH55qE5L2N54K55Zyo6L2s
5b2V6LW35aeL5Yy65Z+f77yM5Y2z5LiO5LmL5a+55bqU55qE5Y+m5LiA5Liq5Ymq5YiH5L2T6Zmk
6L2s5b2V6LW35aeL5L2N54K55LiN5ZCM5aSW77yM5YW25L2Z5LiA6Ie044CCPGJyPi0g56ys5LqU
57G75piv6L2s5b2V57uI5q2i5Yy65Z+f5Y+v5Y+Y5Ymq5YiH77yIYWx0ZXJuYXRpdmUgVFRT77yJ
77yM5LiO56ys5Zub57G75a+55bqU77yM5Y+R55Sf5Ymq5YiH55qE5L2N54K55Y+q5piv5Zyo6L2s
5b2V57uI5q2i5L2N54K55LiN5ZCM44CCPC9wPjxwPiFb5Y+v5Y+Y5Ymq5YiH55qENeenjeW9ouW8
j10oaHR0cDovL3d3dy5iaW8taW5mby10cmFpbmVlLmNvbS93cC1jb250ZW50L3VwbG9hZHMvMjAx
Ny8xMS9zcGxpY2luZy5wbmcpPC9wPjxwPiMjIOi9r+S7tueul+azlTwvcD48cD4qKuavlOi+g+aX
p+eahCoq5YiG5p6Q5Y+v5Y+Y5Ymq5YiH55qE6L2v5Lu25Li76KaB5pyJU3BsaWNlUuOAgVNwbGlj
ZUdyYXBoZXLjgIFBU3Byb2ZpbGXku6Xlj4pTcGxpY2luZyBFeHByZXNz562J77yM5a6D5Lus5piv
5Z+65LqOY3VmZmxpbmtz6L2v5Lu255qE57uT5p6c77yM5bCGcmVhZHPlm57luJbliLDln7rlm6Dn
u4Tluo/liJflkI7vvIzmoLnmja7kvY3nva7lkozplb/luqblj4rnu5PmnoTkv6Hmga/vvIzmnaXn
oa7lrprmiJbpooTmtYvlj6/og73nmoTliarliIfkvZPnmoTnsbvlnovjgILnm67liY3kuLvmtYHl
t7Lnu4/kuI3lho3kvb/nlKh0b3BoYXQrY3VmZmxpbmtz5rWB56iL5LqG44CCPC9wPjxwPiMjIyBT
R1Nlcea1geeoizwvcD48cD7ov5nph4zku4vnu43kuIDkuItgU0dTZXFg6L2v5Lu277yM6L6T5YWl
5paH5Lu25pivYmFt77yM5L2G5piv6ZyA6KaB55So5pSv5oyB6L2s5b2V57uE5pWw5o2u5q+U5a+5
55qE5bel5YW35b6X5Yiw55qEYmFt5paH5Lu277yM5q+U5aaCPC9wPjxwPi0gR1NOQVAgKFQuIEQu
IFd1IGFuZCBOYWN1IDIwMTApPGJyPi0gSElTQVQgKEtpbSwgTGFuZ21lYWQsIGFuZCBTYWx6YmVy
ZyAyMDE1KTxicj4tIFNUQVIgKERvYmluIGV0IGFsLiAyMDEzKTwvcD48cD7lhbblrp7mmK/pnIDo
poFiYW3mlofku7bph4zpnaLmnIlgWFNgIOi/meagt+eahOagh+iusO+8gTwvcD48cD5gU0dTZXFg
5YyF55qE5a6J6KOF6K+05piO77yM5L2/55So5pa55rOV6YO95Y+v5Lul6KeB5a6Y572R77yaPC9w
PjxwPnwgW0hUTUxdKGh0dHBzOi8vYmlvY29uZHVjdG9yLm9yZy9wYWNrYWdlcy9yZWxlYXNlL2Jp
b2MvdmlnbmV0dGVzL1NHU2VxL2luc3QvZG9jL1NHU2VxLmh0bWwpIHwgW1IgU2NyaXB0XShodHRw
czovL2Jpb2NvbmR1Y3Rvci5vcmcvcGFja2FnZXMvcmVsZWFzZS9iaW9jL3ZpZ25ldHRlcy9TR1Nl
cS9pbnN0L2RvYy9TR1NlcS5SKSB8IFNHU2VxIHw8YnI+fCAtLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tIHwgLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLSB8IC0tLS0tLS0tLS0tLS0tLS0gfDxicj58IFtQREZdKGh0dHBzOi8vYmlvY29uZHVjdG9y
Lm9yZy9wYWNrYWdlcy9yZWxlYXNlL2Jpb2MvbWFudWFscy9TR1NlcS9tYW4vU0dTZXEucGRmKSB8
IHwgUmVmZXJlbmNlIE1hbnVhbCB8PGJyPnwgW1RleHRdKGh0dHBzOi8vYmlvY29uZHVjdG9yLm9y
Zy9wYWNrYWdlcy9yZWxlYXNlL2Jpb2MvbmV3cy9TR1NlcS9ORVdTKSB8IHwgTkVXUyB8PC9wPjxw
PiMjIOmcgOimgWJhbeaWh+S7tjwvcD48cD7lronoo4Xlpb3ljIXkuYvlkI7lj6/ku6XnnIvliLDp
mYTluKbnmoTmlbDmja7vvIzlpoLkuIvvvJo8L3A+PHA+YGBgPGJyPmppYW5taW5nemVuZ3MtaU1h
YzpJR1ZfMi4zLjk4IGptemVuZyQgY2QgL0xpYnJhcnkvRnJhbWV3b3Jrcy9SLmZyYW1ld29yay9W
ZXJzaW9ucy8zLjQvUmVzb3VyY2VzL2xpYnJhcnkvU0dTZXEvZXh0ZGF0YS9iYW1zLzxicj5qaWFu
bWluZ3plbmdzLWlNYWM6YmFtcyBqbXplbmckIGxzIC1saDxicj50b3RhbCAxOTUyPGJyPi1ydy1y
LS1yLS0gMSBqbXplbmcgYWRtaW4gNTRLIE5vdiAxIDAxOjI2IE4xLmJhbTxicj4tcnctci0tci0t
IDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBOMS5iYW0uYmFpPGJyPi1ydy1yLS1yLS0g
MSBqbXplbmcgYWRtaW4gODZLIE5vdiAxIDAxOjI2IE4yLmJhbTxicj4tcnctci0tci0tIDEgam16
ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBOMi5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXpl
bmcgYWRtaW4gNzVLIE5vdiAxIDAxOjI2IE4zLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFk
bWluIDQzSyBOb3YgMSAwMToyNiBOMy5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRt
aW4gOTJLIE5vdiAxIDAxOjI2IE40LmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQz
SyBOb3YgMSAwMToyNiBONC5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gNzVL
IE5vdiAxIDAxOjI2IFQxLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3Yg
MSAwMToyNiBUMS5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gOTBLIE5vdiAx
IDAxOjI2IFQyLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToy
NiBUMi5iYW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gNjVLIE5vdiAxIDAxOjI2
IFQzLmJhbTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBUMy5i
YW0uYmFpPGJyPi1ydy1yLS1yLS0gMSBqbXplbmcgYWRtaW4gNzVLIE5vdiAxIDAxOjI2IFQ0LmJh
bTxicj4tcnctci0tci0tIDEgam16ZW5nIGFkbWluIDQzSyBOb3YgMSAwMToyNiBUNC5iYW0uYmFp
PGJyPmBgYDwvcD48cD7ov5nkuptgYmFtYOaWh+S7tuS5i+aJgOS7pei/meS5iOWwj++8jOWwseaY
r+WboOS4uuS9nOiAheWPquaYr+aIquWPluS6hmBoZzE5YOeahOmDqOWIhuaVsOaNru+8jOWdkOag
h+aYr2AgMTYgWzg3MzYyOTQyLCA4NzQyNTcwOF0gYDwvcD48cD4jIyDpnIDopoHms6jph4rmlofk
u7Y8L3A+PHA+6ZyA5qC55o2uYmlvY29uZHVjdG9y6YeM6Z2i55qEdHhkYuWvueixoeadpeaehOW7
uuavlOWvueaWh+S7tueahOWPguiAg+WfuuWboOe7hO+8jOWPguiAg+azqOmHiuS/oeaBr+OAguWm
guaenOaYr2hnMTnnmoTlj6/ku6XlpoLkuIvvvJo8L3A+PHA+YGBgPGJyPmxpYnJhcnkoVHhEYi5I
c2FwaWVucy5VQ1NDLmhnMTkua25vd25HZW5lKTxicj50eGRiICZsdDstIFR4RGIuSHNhcGllbnMu
VUNTQy5oZzE5Lmtub3duR2VuZTxicj50eGRiICZsdDstIGtlZXBTZXFsZXZlbHModHhkYiwgImNo
cjE2Iik8YnI+c2VxbGV2ZWxzU3R5bGUodHhkYikgJmx0Oy0gIk5DQkkiPGJyPnR4Zl91Y3NjICZs
dDstIGNvbnZlcnRUb1R4RmVhdHVyZXModHhkYik8YnI+dHhmX3Vjc2MgJmx0Oy0gdHhmX3Vjc2Nb
dHhmX3Vjc2MgJW92ZXIlIGdyXTxicj5oZWFkKHR4Zl91Y3NjKTxicj50eXBlKHR4Zl91Y3NjKTxi
cj5oZWFkKHR4TmFtZSh0eGZfdWNzYykpPGJyPmhlYWQoZ2VuZU5hbWUodHhmX3Vjc2MpKTxicj5g
YGA8L3A+PHA+5Li76KaB5bCx5piv6YCa6L+HYGNvbnZlcnRUb1R4RmVhdHVyZXMoKSBg5Ye95pWw
5oqKIGBHUmFuZ2VzYCDlr7nosaHovazljJbmiJDkuobkuIDkuKoqVHhGZWF0dXJlcyrlr7nosaHv
vIznlKjmnaXmoIforrDkuIvpnaI156eN57G75Z6L77yaPC9wPjxwPi0gKkoqIChzcGxpY2UganVu
Y3Rpb24pPGJyPi0gKkkqIChpbnRlcm5hbCBleG9uKTxicj4tICpGKiAoZmlyc3QvNeKAsuKAsi10
ZXJtaW5hbCBleG9uKTxicj4tICpMKiAobGFzdC814oCy4oCyLXRlcm1pbmFsIGV4b24pPGJyPi0g
KlUqICh1bnNwbGljZWQgdHJhbnNjcmlwdCkuPC9wPjxwPuWGjeeUqCAqY29udmVydFRvU0dGZWF0
dXJlcygpKiDlh73mlbDmiopUeEZlYXR1cmVz5a+56LGh6L2s5YyW5oiQU0dGZWF0dXJlcyDlr7no
saHvvIznlKjmnaXmoIforrA8L3A+PHA+LSAqSiogKHNwbGljZSBqdW5jdGlvbik8YnI+LSAqRSog
KGRpc2pvaW50IGV4b24gYmluKTxicj4tICpEKiAoc3BsaWNlIGRvbm9yIHNpdGUpPGJyPi0gKkEq
IChzcGxpY2UgYWNjZXB0b3Igc2l0ZSkuPC9wPjxwPiMjIOi/kOihjFNHU2Vx6L2v5Lu2PC9wPjxw
PmBgYDxicj5zZ2ZjX3Vjc2MgJmx0Oy0gYW5hbHl6ZUZlYXR1cmVzKHNpLCBmZWF0dXJlcyA9IHR4
Zl91Y3NjKTxicj5zZ2ZjX3Vjc2M8YnI+YGBgPC9wPjxwPuWboOS4uui9r+S7tuWMheiHquW4puea
hOaVsOaNrumdnuW4uOWwj++8jOaJgOS7peW+iOWuueaYk+Wwsei/kOihjOWujOavle+8jOS4jeef
pemBk+ecn+WunuaDheWGteS4i+aIkeeahCoqMTZHKirnmoRiYW3mlofku7bkvJrlpITnkIblpJrk
uYXjgII8L3A+PHA+IyMg5o6i57Si5aSE55CG57uT5p6cPC9wPjxwPuS5n+aYr+WFqOmDqOWcqFLo
r63oqIDph4zpnaLov5DooYzljbPlj6/vvIzkuIvpnaLnmoTov5nkupvlh73mlbDnlKjmnaXmjqLn
tKLliIbmnpDnu5PmnpzvvIzov5nkupvooajovr7nn6npmLXlsLHlhpnmmI7kuobmr4/kuKrln7rl
m6DnmoTmr4/kuKrlpJbmmL7lrZDnmoTooajovr7ph4/ku6Xlj4rkuKTkuKrlpJbmmL7lrZDkuK3p
l7TlpLnnnYDnmoTlhoXlkKvlrZDnmoTooajovr7mg4XlhrXjgII8L3A+PHA+5Lmf5bCx5piv6K+0
6K+l6L2v5Lu25ZyoUumHjOmdouWwseWvueaJgOacieeahGdlbm9taWMgZmVhdHVyZXMg6L+b6KGM
5LqGcmVhZHPnmoTorqHmlbDjgII8L3A+PHA+YGBgUjxicj5jb2xEYXRhKHNnZmNfdWNzYyk8YnI+
cm93UmFuZ2VzKHNnZmNfdWNzYyk8YnI+aGVhZChjb3VudHMoc2dmY191Y3NjKSk8YnI+aGVhZChG
UEtNKHNnZmNfdWNzYykpPGJyPmBgYDwvcD48cD4jIyDlj6/lj5jliarliIflvaLlvI/nmoTlj6/o
p4bljJY8L3A+PHA+5oyR6YCJ5YW25Lit5LiA5Liq5Z+65Zug77yM5Y+v6KeG5YyW6KGo6L6+5beu
5byC5oOF5Ya1PC9wPjxwPmBgYFI8YnI+ZGYgJmx0Oy0gcGxvdEZlYXR1cmVzKHNnZmNfdWNzYywg
Z2VuZUlEID0gMSk8YnI+IyDkuIvpnaLmmK/lpI3mnYLkuIDngrnnmoTlj6/op4bljJY8YnI+c2dm
Y19wcmVkICZsdDstIGFuYWx5emVGZWF0dXJlcyhzaSwgd2hpY2ggPSBncik8YnI+aGVhZChyb3dS
YW5nZXMoc2dmY19wcmVkKSk8YnI+c2dmY19wcmVkICZsdDstIGFubm90YXRlKHNnZmNfcHJlZCwg
dHhmX3Vjc2MpPGJyPmhlYWQocm93UmFuZ2VzKHNnZmNfcHJlZCkpPGJyPmRmICZsdDstIHBsb3RG
ZWF0dXJlcyhzZ2ZjX3ByZWQsIGdlbmVJRCA9IDEsIGNvbG9yX25vdmVsID0gInJlZCIpIDxicj5g
YGA8L3A+PHA+6L+Z5Liq5piv5L2c6ICF57K+6YCJ5oyR6YCJ55qE54m55q6K55qE5L6L5a2Q55So
5p2l5bGV546w6L2v5Lu255qE5oiQ5Yqf77yM5LqL5a6e5LiK5bqU6K+l5piv5YWI5YWo5bGA5qOA
5p+l5ZOq5Lqb5Y+v5Y+Y5Ymq5YiH5a2Y5Zyo77yM54S25ZCO6L6T5Ye6PC9wPjxwPmBgYDxicj4j
IyDkuIvpnaLmmK/lj6blpJbkuIDkuKrlsZXnjrDmqKHlvI/vvJo8YnI+cGFyKG1mcm93ID0gYyg1
LCAxKSwgbWFyID0gYygxLCAzLCAxLCAxKSk8YnI+cGxvdFNwbGljZUdyYXBoKHJvd1Jhbmdlcyhz
Z2ZjX3ByZWQpLCBnZW5lSUQgPSAxLCB0b3NjYWxlID0gIm5vbmUiLCBjb2xvcl9ub3ZlbCA9ICJy
ZWQiKTxicj5mb3IgKGogaW4gMTo0KSB7PGJyPiBwbG90Q292ZXJhZ2Uoc2dmY19wcmVkWywgal0s
IGdlbmVJRCA9IDEsIHRvc2NhbGUgPSAibm9uZSIpPGJyPn08YnI+YGBgPC9wPjxwPiMjIOS7juWP
r+WPmOWJquWIh+mihOa1i+e7k+aenOmHjOmdoumJtOWIq+WJquWIh+S9kzwvcD48cD5JbnN0ZWFk
IG9mIGNvbnNpZGVyaW5nIHRoZSBmdWxsIHNwbGljZSBncmFwaCBvZiBhIGdlbmUsIHRoZSBhbmFs
eXNpcyBjYW4gYmUgZm9jdXNlZCBvbiBpbmRpdmlkdWFsIHNwbGljZSBldmVudHMuIEZ1bmN0aW9u
ICphbmFseXplVmFyaWFudHMoKSogcmVjdXJzaXZlbHkgaWRlbnRpZmllcyBzcGxpY2UgZXZlbnRz
IGZyb20gdGhlIGdyYXBoLCBvYnRhaW5zIHJlcHJlc2VudGF0aXZlIGNvdW50cyBmb3IgZWFjaCBz
cGxpY2UgdmFyaWFudCwgYW5kIGNvbXB1dGVzIGVzdGltYXRlcyBvZiByZWxhdGl2ZSBzcGxpY2Ug
dmFyaWFudCB1c2FnZSwgYWxzbyByZWZlcnJlZCB0byBhcyDigJhwZXJjZW50YWdlIHNwbGljZWQg
KippbuKAmSAoUFNJIG9yIM6oKSAoVmVuYWJsZXMgZXQgYWwuIDIwMDgsIEthdHogZXQgYWwuICgy
MDEwKSkuKiog77yI5raJ5Y+K5Yiw5LqG5LiA5Liq566X5rOV55qE6Zeu6aKY77yJPC9wPjxwPmBg
YDxicj5zZ3ZjX3ByZWQgJmx0Oy0gYW5hbHl6ZVZhcmlhbnRzKHNnZmNfcHJlZCk8YnI+c2d2Y19w
cmVkPGJyPm1jb2xzKHNndmNfcHJlZCk8YnI+dmFyaWFudEZyZXEoc2d2Y19wcmVkKTxicj5wbG90
VmFyaWFudHMoc2d2Y19wcmVkLCBldmVudElEID0gMSwgY29sb3Jfbm92ZWwgPSAicmVkIik8YnI+
bGlicmFyeShCU2dlbm9tZS5Ic2FwaWVucy5VQ1NDLmhnMTkpPGJyPnNlcWxldmVsc1N0eWxlKEhz
YXBpZW5zKSAmbHQ7LSAiTkNCSSI8YnI+dmVwICZsdDstIHByZWRpY3RWYXJpYW50RWZmZWN0cyhz
Z3ZfcHJlZCwgdHhkYiwgSHNhcGllbnMpPGJyPnZlcDwvcD48cD5gYGA8L3A+">​</div>
</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2890.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>一个植物转录组项目的实战</title>
		<link>http://www.bio-info-trainee.com/2809.html</link>
		<comments>http://www.bio-info-trainee.com/2809.html#comments</comments>
		<pubDate>Thu, 02 Nov 2017 02:29:11 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[未分类]]></category>
		<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2809</guid>
		<description><![CDATA[其实这个植物是拟南芥，所以跟人类研究的数据处理大同小异。 转录组 转录组测序的研 &#8230; <a href="http://www.bio-info-trainee.com/2809.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>其实这个植物是拟南芥，所以跟人类研究的数据处理大同小异。</p>
<h2 class="md-end-block md-heading">转录组</h2>
<p><span class="md-line md-end-block">转录组测序的研究对象为特定细胞在某一功能状态下所能转录出来的<span class=""><strong>所有 RNA 的总和</strong></span>，包括 mRNA 和非编码 RNA 。通过转录组测序，能够全面获得物种特定组织或器官的转录本信息，从而进行转录本结构研究、变异研究、<span class=""><strong>基因表达水平研究</strong></span>以及全新转录本发现等研究。</span><span id="more-2809"></span></p>
<p><span class="md-line md-end-block">其中，基因表达水平的探究是转录组领域<span class=""><strong>最热门</strong></span>的方向，利用转录组数据来识别转录本和表达定量，是转录组数据的核心作用。由于这个作用，他可以不依赖其他组学信息，单独成为一个产品项目RNA-seq测序。所以很多时候<span class=""><strong>转录组测序</strong></span>会与<span class=""><strong>RNA-seq</strong></span>混为一谈。</span></p>
<p><span class="md-line md-end-block">现在RNA-seq数据<span class=""><strong>使用广泛</strong></span>，但是没有一套流程可以解决所有的问题。比较值得关注的RNA-seq分析中的重要的步骤包括：<span class=""><strong>实验设计，质控，read比对，表达定量，可视化，差异表达，识别可变剪切，功能注释，融合基因检测，eQTL定位</strong></span>等。</span></p>
<p>值得一提的是，这个教程也写的非常赞：https://github.com/twbattaglia/RNAseq-workflow</p>
<h2 class="md-end-block md-heading">流程介绍</h2>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="image/overview-of-RNA-seq-technology.png"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="file:///Users/jimmy/Documents/github_jmzeng1314/bioinformatics123/ngs/image/overview-of-RNA-seq-technology.png?lastModify=1509589599" alt="" /></span></span><a href="http://www.bio-info-trainee.com/wp-content/uploads/2017/10/overview-of-RNA-seq-technology.png"><img class="alignnone size-full wp-image-2792" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/10/overview-of-RNA-seq-technology.png" alt="overview-of-rna-seq-technology" width="717" height="674" /></a></p>
<p><span class="md-line md-end-block">来自于R处理<span class=""><a spellcheck="false" href="http://biocluster.ucr.edu/~rkaundal/workshops/R_mar2016/RNAseq.html">mRNA-seq数据</a></span></span></p>
<p><span class="md-line md-end-block"><span class="md-image md-img-loaded" contenteditable="false" data-src="image/mRNAseq-workflow-2010.jpeg"><img style="box-sizing: border-box; border-width: 0px 4px 0px 2px; border-right-style: solid; border-left-style: solid; border-right-color: transparent; border-left-color: transparent; vertical-align: middle; max-width: 100%; cursor: default;" src="file:///Users/jimmy/Documents/github_jmzeng1314/bioinformatics123/ngs/image/mRNAseq-workflow-2010.jpeg?lastModify=1509589599" alt="" /></span></span></p>
<p><span class="md-line md-end-block">来自于2010发表在Genome Biology的<span class=""><a spellcheck="false" href="https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-12-220">From RNA-seq reads to differential expression results</a></span>文章配图</span></p>
<h2 class="md-end-block md-heading">数据来源文章</h2>
<p><span class="md-line md-end-block">数据来自于发表在Nature commmunication 上的一篇文章 “Temporal dynamics of gene expression and histone marks at the Arabidopsis shoot meristem during flowerin”。原文用RNA-Seq的方式研究在开花阶段,芽分生组织在<span class=""><strong>不同时期的基因表达变化。</strong></span></span></p>
<p><span class="md-line md-end-block">原文的流程是： TopHat -&gt; SummarizeOverlaps -&gt; Deseq2 -&gt; AmiGO </span><span class="md-line md-end-block">其中比对的参考基因组为TAIR10 ver.24 ，并且屏蔽了ribosomal RNA regions (2:3471–9557; 3:14,197,350–14,203,988)。</span></p>
<p><span class="md-line md-end-block">Deseq2只计算至少在一个时间段的FPKM的count &gt; 1 的基因。</span></p>
<p><span class="md-line md-end-block">数据存放在<a href="http://www.ebi.ac.uk/arrayexpress/">http://www.ebi.ac.uk/arrayexpress/</a>, ID为E-MTAB-5130。</span></p>
<p><span class="md-line md-end-block">实验设计： 4个时间段（0,1,2,3），分别有4个生物学重复，一共有16个样品。</span></p>
<h2 class="md-end-block md-heading">数据下载</h2>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">conda install <span class="cm-attribute">-c</span> bioconda salmon 
​
<span class="cm-builtin">wget</span> http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-5130/E-MTAB-5130.sdrf.txt
head <span class="cm-attribute">-n1</span> E-MTAB-5130.sdrf.txt | tr <span class="cm-string">'\t'</span> <span class="cm-string">'\n'</span> | nl | <span class="cm-builtin">grep</span> URI
tail <span class="cm-attribute">-n</span> <span class="cm-operator">+</span><span class="cm-number">2</span> E-MTAB-5130.sdrf.txt | <span class="cm-builtin">cut</span> <span class="cm-attribute">-f</span> <span class="cm-number">33</span> | xargs <span class="cm-attribute">-i</span> <span class="cm-builtin">wget</span> {}
​
​
nohup <span class="cm-builtin">wget</span> ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz &amp;
​
nohup <span class="cm-builtin">wget</span> ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa.gz &amp;
nohup <span class="cm-builtin">wget</span>  ftp://ftp.ensemblgenomes.org/pub/plants/release-28/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gff3.gz &amp;
nohup <span class="cm-builtin">wget</span> ftp://ftp.ensemblgenomes.org/pub/plants/release-28/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gtf.gz &amp;</pre>
<h2 class="md-end-block md-heading">salmon 流程</h2>
<p><span class="md-line md-end-block">软件介绍：ome of the upstream quantification methods <span class=""><strong>(<span class=""><em>Salmon</em></span>, <span class=""><em>Sailfish</em></span>, <span class=""><em>kallisto</em></span>)</strong></span> are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files</span></p>
<p><span class="md-line md-end-block">软件官网：<span spellcheck="false"><a href="https://combine-lab.github.io/salmon/">https://combine-lab.github.io/salmon/</a></span></span></p>
<p><span class="md-line md-end-block">先用用Salmon建立索引：</span></p>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">salmon index -t Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -i athal_index</span></li>
</ul>
<p><span class="md-line md-end-block">建立索引耗时53秒，生成的索引文件夹如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">[jianmingzeng@jade salmon]$ ls -lh
total 19M
-rw-rw-r-- 1 jianmingzeng jianmingzeng  19M Oct 17 11:18 Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz
drwxrwxr-x 2 jianmingzeng jianmingzeng 4.0K Oct 17 11:54 athal_index
-rw-rw-r-- 1 jianmingzeng jianmingzeng  142 Oct 17 11:20 wget_cdna.sh
[jianmingzeng@jade salmon]$ ls -lh  athal_index/
total 1.1G
-rw-rw-r-- 1 jianmingzeng jianmingzeng 751M Oct 17 11:54 hash.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng  357 Oct 17 11:54 header.json
-rw-rw-r-- 1 jianmingzeng jianmingzeng  115 Oct 17 11:54 indexing.log
-rw-rw-r-- 1 jianmingzeng jianmingzeng  156 Oct 17 11:54 quasi_index.log
-rw-rw-r-- 1 jianmingzeng jianmingzeng   89 Oct 17 11:54 refInfo.json
-rw-rw-r-- 1 jianmingzeng jianmingzeng 7.8M Oct 17 11:53 rsd.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng 248M Oct 17 11:54 sa.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng  63M Oct 17 11:53 txpInfo.bin
-rw-rw-r-- 1 jianmingzeng jianmingzeng   96 Oct 17 11:54 versionInfo.json
[jianmingzeng@jade salmon]$</pre>
<p><span class="md-line md-end-block">然后对所有数据定量</span></p>
<p><span class="md-line md-end-block">由于样本一共有16个，不可能一条条输入命令，所以我们写一个脚本：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-meta">#! /bin/bash</span>
<span class="cm-def">index</span><span class="cm-operator">=</span>salmon/athal_index <span class="cm-comment">## 指定索引文件夹</span>
<span class="cm-keyword">for</span> fn <span class="cm-keyword">in</span> ERR1698{194..209};
<span class="cm-keyword">do</span>
    <span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-quote">`basename </span><span class="cm-def">${fn}</span><span class="cm-quote">`</span>
    <span class="cm-builtin">echo</span> <span class="cm-string">"Processin sample </span><span class="cm-def">${sampe}</span><span class="cm-string">"</span>
    salmon quant <span class="cm-attribute">-i</span> <span class="cm-def">$index</span> <span class="cm-attribute">-l</span> A \
        <span class="cm-attribute">-1</span> <span class="cm-def">${sample}</span>_1.fastq.gz \
        <span class="cm-attribute">-2</span> <span class="cm-def">${sample}</span>_2.fastq.gz \
        <span class="cm-attribute">-p</span> <span class="cm-number">5</span> <span class="cm-attribute">-o</span> quants/<span class="cm-def">${sample}</span>_quant
<span class="cm-keyword">done</span></pre>
<h2 class="md-end-block md-heading">subread流程</h2>
<p><span class="md-line md-end-block">也是首先构建索引，但是这个需要提前解压fa文件</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">gunzip Arabidopsis_thaliana.TAIR10.28.dna.genome.fa.gz
~/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subread-buildindex -o athal_index   Arabidopsis_thaliana.TAIR10.28.dna.genome.fa</pre>
<p><span class="md-line md-end-block">消耗时间也不到一分钟，生成的索引文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">117M Oct 17 11:21 Arabidopsis_thaliana.TAIR10.28.dna.genome.fa
 15M Oct 17 11:41 Arabidopsis_thaliana.TAIR10.28.gff3.gz
 29M Oct 17 12:19 athal_index.00.b.array
231M Oct 17 12:19 athal_index.00.b.tab
 314 Oct 17 12:19 athal_index.files
345K Oct 17 12:18 athal_index.log</pre>
<p><span class="md-line md-end-block">然后比对也是一个脚本批量化完成</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-meta">#! /bin/bash</span>
<span class="cm-def">subjunc</span><span class="cm-operator">=</span><span class="cm-string">"/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subjunc"</span>; 
<span class="cm-def">index</span><span class="cm-operator">=</span><span class="cm-string">'subread/athal_index'</span>;
<span class="cm-keyword">for</span> fn <span class="cm-keyword">in</span> ERR1698{194..209};
<span class="cm-keyword">do</span>
    <span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-quote">`basename </span><span class="cm-def">${fn}</span><span class="cm-quote">`</span>
    <span class="cm-builtin">echo</span> <span class="cm-string">"Processin sample </span><span class="cm-def">${sampe}</span><span class="cm-string">"</span> 
    <span class="cm-def">$subjunc</span> <span class="cm-attribute">-i</span> <span class="cm-def">$index</span> \
        <span class="cm-attribute">-r</span> <span class="cm-def">${sample}</span>_1.fastq.gz \
        <span class="cm-attribute">-R</span> <span class="cm-def">${sample}</span>_2.fastq.gz \
        <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_subjunc.bam
<span class="cm-keyword">done</span></pre>
<p><span class="md-line md-end-block">但是输出bam还不够，还需要用featureCounts对之进行定量</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-def">gff3</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/data/public/tair/subread/Arabidopsis_thaliana.TAIR10.28.gff3.gz'</span>;
<span class="cm-def">gtf</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/data/public/tair/subread/Arabidopsis_thaliana.TAIR10.28.gtf'</span>;
​
​
<span class="cm-def">featureCounts</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts'</span>;
<span class="cm-def">$featureCounts</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-p</span> <span class="cm-attribute">-t</span> exon <span class="cm-attribute">-g</span> gene_name <span class="cm-attribute">-a</span> <span class="cm-def">$gtf</span> <span class="cm-attribute">-o</span>  counts.txt   *.bam
nohup <span class="cm-def">$featureCounts</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-p</span> <span class="cm-attribute">-t</span> exon <span class="cm-attribute">-g</span> gene_id <span class="cm-attribute">-a</span> <span class="cm-def">$gtf</span> <span class="cm-attribute">-o</span>  counts_id.txt   *.bam &amp;</pre>
<p><span class="md-line md-end-block">这一步骤是非常快的。</span></p>
<h2 class="md-end-block md-heading">比对可以有更多选择</h2>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-def">$hisat</span> <span class="cm-attribute">-p</span> <span class="cm-number">5</span> <span class="cm-attribute">-x</span> <span class="cm-def">$hisat2_mm10_index</span> <span class="cm-attribute">-1</span> <span class="cm-def">$fq1</span> <span class="cm-attribute">-2</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">-S</span> <span class="cm-def">$sample</span>.sam <span class="cm-number">2</span>&gt;<span class="cm-def">$sample</span>.hisat.log
samtools <span class="cm-builtin">sort</span> <span class="cm-attribute">-O</span> bam <span class="cm-attribute">-</span>@ <span class="cm-number">5</span>  <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_hisat.bam <span class="cm-def">$sample</span>.sam
​
<span class="cm-def">$subjunc</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span>  <span class="cm-attribute">-i</span> <span class="cm-def">$subjunc_mm10_index</span> <span class="cm-attribute">-r</span> <span class="cm-def">$fq1</span>  <span class="cm-attribute">-R</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_subjunc.bam
<span class="cm-comment">## 比对的sam自动转为bam，但是并不按照参考基因组坐标排序</span>
​
bwa mem <span class="cm-attribute">-t</span> <span class="cm-number">5</span> <span class="cm-attribute">-M</span>  <span class="cm-def">$bwa_mm10_index</span> <span class="cm-def">$fq1</span> <span class="cm-def">$fq2</span> <span class="cm-number">1</span>&gt;<span class="cm-def">$sample</span>.sam <span class="cm-number">2</span>&gt;/dev/null 
samtools <span class="cm-builtin">sort</span> <span class="cm-attribute">-O</span> bam <span class="cm-attribute">-</span>@ <span class="cm-number">5</span>  <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_bwa.bam <span class="cm-def">$sample</span>.sam
​
<span class="cm-def">$bowtie</span> <span class="cm-attribute">-p</span> <span class="cm-number">5</span> <span class="cm-attribute">-x</span> <span class="cm-def">$bowtie2_mm10_index</span> <span class="cm-attribute">-1</span> <span class="cm-def">$fq1</span>  <span class="cm-attribute">-2</span> <span class="cm-def">$fq2</span> | samtools <span class="cm-builtin">sort</span>  <span class="cm-attribute">-O</span> bam  <span class="cm-attribute">-</span>@ <span class="cm-number">5</span> <span class="cm-attribute">-o</span> <span class="cm-attribute">-</span> &gt;<span class="cm-def">${sample}</span>_bowtie.bam
​
<span class="cm-comment">## star软件载入参考基因组非常耗时，约10分钟，也比较耗费内存，但是比对非常快，5M的序列就两分钟即可</span>
<span class="cm-def">$star</span> <span class="cm-attribute">--runThreadN</span>  <span class="cm-number">5</span> <span class="cm-attribute">--genomeDir</span> <span class="cm-def">$star_mm10_index</span> <span class="cm-attribute">--readFilesCommand</span> zcat <span class="cm-attribute">--readFilesIn</span>  <span class="cm-def">$fq1</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">--outFileNamePrefix</span>  <span class="cm-def">${sample}</span>_star 
<span class="cm-comment">## --outSAMtype BAM  可以用这个参数设置直接输出排序好的bam文件</span>
samtools <span class="cm-builtin">sort</span> <span class="cm-attribute">-O</span> bam <span class="cm-attribute">-</span>@ <span class="cm-number">5</span>  <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_star.bam <span class="cm-def">${sample}</span>_starAligned.out.sam</pre>
<h2 class="md-end-block md-heading">表达矩阵的normalization方法</h2>
<p><span class="md-line md-end-block">统计学原理需要耗费很大功夫才能理解，主要是掌握这些normalization方法如何在R里面实现，还有它们的简单比较。</span></p>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block"><span class=""><strong>Total count (TC)</strong></span>: Gene counts are divided by the total number of mapped reads (or library size) associated with their lane and multiplied by the mean total count across all the samples of the dataset.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Upper Quartile (UQ)</strong></span>: Very similar in principle to TC, the total counts are replaced by the upper quartile of counts different from 0 in the computation of the normalization factors.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Median (Med)</strong></span>: Also similar to TC, the total counts are replaced by the median counts different from 0 in the computation of the normalization factors. That is, the median is calculated as the median of gene counts of all runs.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>DESeq</strong></span>: This normalization method is included in the DESeq Bioconductor package and is based on the hypothesis that most genes are not DE. The method is based on a negative binomial distribution model, with variance and mean linked by local regression, and presents an implementation that gives scale factors. Within the DESeq package, and with the <span spellcheck="false"><code>estimateSizeFactorsForMatrix</code></span>function, scaling factors can be calculated for each run. After dividing gene counts by each scaling factor, DESeq values are calculated as the total of rescaled gene counts of all runs.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Trimmed Mean of M-values (TMM)</strong></span>: This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010). It is also based on the hypothesis that most genes are not DE. Scaling factors are calculated using the <span spellcheck="false"><code>calcNormFactors</code></span> function in the package, and then rescaled gene counts are obtained by dividing gene counts by each scaling factor for each run. TMM is the sum of rescaled gene counts of all runs.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Quantile (Q)</strong></span>: First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes.</span></li>
<li><span class="md-line md-end-block"><span class=""><strong>Reads Per Kilobase per Million mapped reads (RPKM)</strong></span>: This approach was initially introduced to facilitate comparisons between genes within a sample and combines between- and within-sample normalization. This approach quantifies gene expression from RNA-Seq data by normalizing for the total transcript length and the number of sequencing reads.</span></li>
</ul>
<h2 class="md-end-block md-heading">差异分析</h2>
<p><span class="md-line md-end-block">也是有很多种选择，主要是继承自上面的normalization方法，一般来说挑选好了normalization方法就决定了选取何种差异分析方法，也并不强求弄懂统计学原理，它们都被包装到了对应的R包里面，主要是对R包的学习。</span></p>
<ul class="ul-list" data-mark="-">
<li><span class="md-line md-end-block">edgeR (Robinson et al., 2010)</span></li>
<li><span class="md-line md-end-block">DESeq / DESeq2 (Anders and Huber, 2010, 2014)</span></li>
<li><span class="md-line md-end-block">DEXSeq (Anders et al., 2012)</span></li>
<li><span class="md-line md-end-block">limmaVoom</span></li>
<li><span class="md-line md-end-block">Cuffdiff / Cuffdiff2 (Trapnell et al., 2013)</span></li>
<li><span class="md-line md-end-block">PoissonSeq</span></li>
<li><span class="md-line md-end-block">baySeq</span></li>
</ul>
<p><span class="md-line md-end-block">首先提取样本的分组信息</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">tail <span class="cm-attribute">-n</span> <span class="cm-operator">+</span><span class="cm-number">2</span> E-MTAB-5130.sdrf.txt | <span class="cm-builtin">cut</span> <span class="cm-attribute">-f</span> <span class="cm-number">32</span>,36 |sort <span class="cm-attribute">-u</span></pre>
<h2 class="md-end-block md-heading">制作表达矩阵</h2>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">这个表达矩阵，就是上游的比对+定量得到的，但是要按照下面的规则做成\t分割的txt文档，如下：</span></span></p>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th></th>
<th><span class="td-span" contenteditable="true">SRR1039508</span></th>
<th><span class="td-span" contenteditable="true">SRR1039509</span></th>
<th><span class="td-span" contenteditable="true">SRR1039512</span></th>
<th><span class="td-span" contenteditable="true">SRR1039513</span></th>
<th><span class="td-span" contenteditable="true">SRR1039516</span></th>
<th><span class="td-span" contenteditable="true">SRR1039517</span></th>
<th><span class="td-span" contenteditable="true">SRR1039520</span></th>
<th><span class="td-span" contenteditable="true">SRR1039521</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000003</span></td>
<td><span class="td-span" contenteditable="true">679</span></td>
<td><span class="td-span" contenteditable="true">448</span></td>
<td><span class="td-span" contenteditable="true">873</span></td>
<td><span class="td-span" contenteditable="true">408</span></td>
<td><span class="td-span" contenteditable="true">1138</span></td>
<td><span class="td-span" contenteditable="true">1047</span></td>
<td><span class="td-span" contenteditable="true">770</span></td>
<td><span class="td-span" contenteditable="true">572</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000005</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000419</span></td>
<td><span class="td-span" contenteditable="true">467</span></td>
<td><span class="td-span" contenteditable="true">515</span></td>
<td><span class="td-span" contenteditable="true">621</span></td>
<td><span class="td-span" contenteditable="true">365</span></td>
<td><span class="td-span" contenteditable="true">587</span></td>
<td><span class="td-span" contenteditable="true">799</span></td>
<td><span class="td-span" contenteditable="true">417</span></td>
<td><span class="td-span" contenteditable="true">508</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000457</span></td>
<td><span class="td-span" contenteditable="true">260</span></td>
<td><span class="td-span" contenteditable="true">211</span></td>
<td><span class="td-span" contenteditable="true">263</span></td>
<td><span class="td-span" contenteditable="true">164</span></td>
<td><span class="td-span" contenteditable="true">245</span></td>
<td><span class="td-span" contenteditable="true">331</span></td>
<td><span class="td-span" contenteditable="true">233</span></td>
<td><span class="td-span" contenteditable="true">229</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000460</span></td>
<td><span class="td-span" contenteditable="true">60</span></td>
<td><span class="td-span" contenteditable="true">55</span></td>
<td><span class="td-span" contenteditable="true">40</span></td>
<td><span class="td-span" contenteditable="true">35</span></td>
<td><span class="td-span" contenteditable="true">78</span></td>
<td><span class="td-span" contenteditable="true">63</span></td>
<td><span class="td-span" contenteditable="true">76</span></td>
<td><span class="td-span" contenteditable="true">60</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000938</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">2</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">1</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
<td><span class="td-span" contenteditable="true">0</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000000971</span></td>
<td><span class="td-span" contenteditable="true">3251</span></td>
<td><span class="td-span" contenteditable="true">3679</span></td>
<td><span class="td-span" contenteditable="true">6177</span></td>
<td><span class="td-span" contenteditable="true">4252</span></td>
<td><span class="td-span" contenteditable="true">6721</span></td>
<td><span class="td-span" contenteditable="true">11027</span></td>
<td><span class="td-span" contenteditable="true">5176</span></td>
<td><span class="td-span" contenteditable="true">7995</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001036</span></td>
<td><span class="td-span" contenteditable="true">1433</span></td>
<td><span class="td-span" contenteditable="true">1062</span></td>
<td><span class="td-span" contenteditable="true">1733</span></td>
<td><span class="td-span" contenteditable="true">881</span></td>
<td><span class="td-span" contenteditable="true">1424</span></td>
<td><span class="td-span" contenteditable="true">1439</span></td>
<td><span class="td-span" contenteditable="true">1359</span></td>
<td><span class="td-span" contenteditable="true">1109</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001084</span></td>
<td><span class="td-span" contenteditable="true">519</span></td>
<td><span class="td-span" contenteditable="true">380</span></td>
<td><span class="td-span" contenteditable="true">595</span></td>
<td><span class="td-span" contenteditable="true">493</span></td>
<td><span class="td-span" contenteditable="true">820</span></td>
<td><span class="td-span" contenteditable="true">714</span></td>
<td><span class="td-span" contenteditable="true">696</span></td>
<td><span class="td-span" contenteditable="true">704</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001167</span></td>
<td><span class="td-span" contenteditable="true">394</span></td>
<td><span class="td-span" contenteditable="true">236</span></td>
<td><span class="td-span" contenteditable="true">464</span></td>
<td><span class="td-span" contenteditable="true">175</span></td>
<td><span class="td-span" contenteditable="true">658</span></td>
<td><span class="td-span" contenteditable="true">584</span></td>
<td><span class="td-span" contenteditable="true">360</span></td>
<td><span class="td-span" contenteditable="true">269</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001460</span></td>
<td><span class="td-span" contenteditable="true">172</span></td>
<td><span class="td-span" contenteditable="true">168</span></td>
<td><span class="td-span" contenteditable="true">264</span></td>
<td><span class="td-span" contenteditable="true">118</span></td>
<td><span class="td-span" contenteditable="true">241</span></td>
<td><span class="td-span" contenteditable="true">210</span></td>
<td><span class="td-span" contenteditable="true">155</span></td>
<td><span class="td-span" contenteditable="true">177</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001461</span></td>
<td><span class="td-span" contenteditable="true">2112</span></td>
<td><span class="td-span" contenteditable="true">1867</span></td>
<td><span class="td-span" contenteditable="true">5137</span></td>
<td><span class="td-span" contenteditable="true">2657</span></td>
<td><span class="td-span" contenteditable="true">2735</span></td>
<td><span class="td-span" contenteditable="true">2751</span></td>
<td><span class="td-span" contenteditable="true">2467</span></td>
<td><span class="td-span" contenteditable="true">2905</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001497</span></td>
<td><span class="td-span" contenteditable="true">524</span></td>
<td><span class="td-span" contenteditable="true">488</span></td>
<td><span class="td-span" contenteditable="true">638</span></td>
<td><span class="td-span" contenteditable="true">357</span></td>
<td><span class="td-span" contenteditable="true">676</span></td>
<td><span class="td-span" contenteditable="true">806</span></td>
<td><span class="td-span" contenteditable="true">493</span></td>
<td><span class="td-span" contenteditable="true">475</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">ENSG00000001561</span></td>
<td><span class="td-span" contenteditable="true">71</span></td>
<td><span class="td-span" contenteditable="true">51</span></td>
<td><span class="td-span" contenteditable="true">211</span></td>
<td><span class="td-span" contenteditable="true">156</span></td>
<td><span class="td-span" contenteditable="true">23</span></td>
<td><span class="td-span" contenteditable="true">38</span></td>
<td><span class="td-span" contenteditable="true">134</span></td>
<td><span class="td-span" contenteditable="true">172</span></td>
</tr>
</tbody>
</table>
<p><span class="md-line md-end-block">第一列是基因ID，后面的列是各个样本。其中第一行尤为注意，最开头是一个空格(了解R里面read.table函数原理)</span></p>
<h2 class="md-end-block md-heading">制作分组矩阵</h2>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th></th>
<th><span class="td-span" contenteditable="true">dex</span></th>
<th><span class="td-span" contenteditable="true">SampleName</span></th>
<th><span class="td-span" contenteditable="true">cell</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039508</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275862</span></td>
<td><span class="td-span" contenteditable="true">N61311</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039509</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275863</span></td>
<td><span class="td-span" contenteditable="true">N61311</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039512</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275866</span></td>
<td><span class="td-span" contenteditable="true">N052611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039513</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275867</span></td>
<td><span class="td-span" contenteditable="true">N052611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039516</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275870</span></td>
<td><span class="td-span" contenteditable="true">N080611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039517</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275871</span></td>
<td><span class="td-span" contenteditable="true">N080611</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039520</span></td>
<td><span class="td-span" contenteditable="true">untrt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275874</span></td>
<td><span class="td-span" contenteditable="true">N061011</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">SRR1039521</span></td>
<td><span class="td-span" contenteditable="true">trt</span></td>
<td><span class="td-span" contenteditable="true">GSM1275875</span></td>
<td><span class="td-span" contenteditable="true">N061011</span></td>
</tr>
</tbody>
</table>
<p><span class="md-line md-end-block">记住要跟上面的表达矩阵的样本名对应！！！</span></p>
<p><span class="md-line md-end-block">只有第一列是需要看的，其余的无所谓。</span></p>
<p><span class="md-line md-end-block">根据分组信息，是需要自己指定比对信息的，比如上面的分组矩阵，需要指定 <span spellcheck="false"><code>-c 'trt-untrt'</code></span></span></p>
<h2 class="md-end-block md-heading">下载差异分析脚本</h2>
<pre class="md-fences md-end-block" lang="" contenteditable="false">wget  https://raw.githubusercontent.com/jmzeng1314/my-R/master/DEG_scripts/run_DEG.R
wget  https://raw.githubusercontent.com/jmzeng1314/my-R/master/DEG_scripts/tair/exprSet.txt
wget  https://raw.githubusercontent.com/jmzeng1314/my-R/master/DEG_scripts/tair/group_info.txt
Rscript ../run_DEG.R -e exprSet.txt -g group_info.txt -c 'Day1-Day0' -s counts  -m DESeq2</pre>
<p><span class="md-line md-end-block">如果是转录组的raw counts数据，就选择 -s counts，如果是芯片等normalization好的表达矩阵数据，用默认参数即可。下面是例子：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false"># Rscript run_DEG.R -e airway.expression.txt -g airway.group.txt -c 'trt-untrt' -s counts -m DESeq2
# Rscript run_DEG.R -e airway.expression.txt -g airway.group.txt -c 'trt-untrt' -s counts -m edgeR
# Rscript run_DEG.R -e sCLLex.expression.txt -g sCLLex.group.txt -c 'progres.-stable'
# Rscript run_DEG.R -e sCLLex.expression.txt -g sCLLex.group.txt -c 'progres.-stable' -m t.test</pre>
<p><span class="md-line md-end-block">对于转录组的raw counts数据，有DEseq2包和edgeR包可供选择。对于芯片等normalization好的表达矩阵数据，有limma和t.test供选择。</span></p>
<p><span class="md-line md-end-block" contenteditable="true">关于 选择 哪一组样本与哪一组样本比较，其实可以非常复杂，比如：<span class="" spellcheck="false"><a href="http://genomicsclass.github.io/book/pages/expressing_design_formula.html">http://genomicsclass.github.io/book/pages/expressing_design_formula.html</a></span></span></p>
<h2 class="md-end-block md-heading"><span class="">重要的脚本</span></h2>
<p><span class="md-line md-end-block">比如 <span spellcheck="false"><code>create_testData.R</code></span><span class=""> 里面有如何得到表达矩阵和分组矩阵的内容。</span></span></p>
<h2 class="md-end-block md-heading">富集分析</h2>
<p><span class="md-line md-end-block md-focus" contenteditable="true"><span class="md-expand">这里不想讲解了，跟人类的基因的富集分析还有一点区别的。</span></span></p>
<h2 class="md-end-block md-heading">其它数据</h2>
<p><span class="md-line md-end-block">比如：<span spellcheck="false"><a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89843">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89843</a></span> 测定了402个NSCLC病人和377个正常人的血小板的转录组，数据分析方法如下：</span></p>
<blockquote><p><span class="md-line md-end-block">For further downstream analyses, reads were quality-controlled using Trimmomatic, mapped to the humane reference genome using STAR, and intron-spanning reads were summarized using HTseq.</span></p></blockquote>
<p><span class="md-line md-end-block">这个数据量要重分析，对计算资源要求就比较高了，但是可以直接下载作者分析好的表达矩阵： ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE89nnn/GSE89843/suppl/GSE89843_TEP_Count_Matrix.txt.gz </span></p>
<p><span class="md-line md-end-block">而且表达矩阵的后续分析也不仅仅是差异表达那么简单，毕竟测了如此多的样本。</span></p>
<h3 class="md-end-block md-heading"></h3>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2809.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>史上最快的转录组流程-subread</title>
		<link>http://www.bio-info-trainee.com/2775.html</link>
		<comments>http://www.bio-info-trainee.com/2775.html#comments</comments>
		<pubDate>Thu, 19 Oct 2017 14:10:29 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2775</guid>
		<description><![CDATA[史上最快的转录组流程-subread 安装软件 二进制版本软件，直接找到官网下载 &#8230; <a href="http://www.bio-info-trainee.com/2775.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h2 class="md-end-block md-heading md-focus"><span class="md-expand">史上最快的转录组流程-subread</span></h2>
<h2 class="md-end-block md-heading">安装软件</h2>
<p><span class="md-line md-end-block">二进制版本软件，直接找到官网下载解压即可使用。</span><span id="more-2775"></span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-comment"># http://bioinf.wehi.edu.au/featureCounts/</span>
<span class="cm-builtin">mkdir</span> featureCounts &amp;&amp;  <span class="cm-builtin">cd</span> featureCounts
<span class="cm-comment">## 之前以为这个软件就是用来计算表达量的，所以把文件夹取名为 featureCounts</span>
<span class="cm-builtin">wget</span> https://sourceforge.net/projects/subread/files/subread-1.5.3/subread-1.5.3-Linux-x86_64.tar.gz
tar zxvf subread-1.5.3-Linux-x86_64.tar.gz</pre>
<h2 class="md-end-block md-heading">建立索引</h2>
<p><span class="md-line md-end-block">每个比对工具的算法不一样，所以每个工具都需要对<span class=""><strong>参考基因组</strong></span>建立自己的索引。本身参考基因组占一篇空间就不小，索引之后更大!</span></p>
<p><span class="md-line md-end-block">需要自行从UCSC下载参考基因组，我放在了<span spellcheck="false"><code>~/reference/genome/</code></span> 目录</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-def">buildindex</span><span class="cm-operator">=</span>~/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subread-buildindex
<span class="cm-builtin">cd</span> /home/jianmingzeng/reference/index/subread/
<span class="cm-def">$buildindex</span> <span class="cm-attribute">-o</span> mm10  ~/reference/genome/mm10/mm10.fa
<span class="cm-def">$buildindex</span> <span class="cm-attribute">-o</span> hg19  ~/reference/genome/hg19/hg19.fa
<span class="cm-def">$buildindex</span> <span class="cm-attribute">-o</span> hg38  ~/reference/genome/hg38/hg38.fa</pre>
<p><span class="md-line md-end-block">得到的索引文件如下：</span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
749M Sep 15 17:37 hg19.00.b.array
4.9G Sep 15 17:37 hg19.00.b.tab
5.5K Sep 15 17:33 hg19.files
   0 Sep 15 17:17 hg19.log
2.3K Sep 15 17:38 hg19.reads
766M Sep 15 18:01 hg38.00.b.array
5.0G Sep 15 18:01 hg38.00.b.tab
 29K Sep 15 17:57 hg38.files
   0 Sep 15 17:38 hg38.log
 14K Sep 15 18:01 hg38.reads
652M Sep 15 17:17 mm10.00.b.array
4.4G Sep 15 17:17 mm10.00.b.tab
3.9K Sep 15 17:13 mm10.files
   0 Sep 15 16:52 mm10.log
1.6K Sep 15 17:17 mm10.reads</pre>
<h2 class="md-end-block md-heading">批量比对</h2>
<p><span class="md-line md-end-block">做好一个<span class=""><strong>配置文件</strong></span>，就可以运行下面的脚本。</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-def">subjunc</span><span class="cm-operator">=</span><span class="cm-string">"/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subjunc"</span>; 
<span class="cm-def">subjunc_mm10_index</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/reference/index/subread/mm10'</span>;
​
<span class="cm-builtin">cat</span> <span class="cm-def">$config</span> |while read id
<span class="cm-keyword">do</span>
    <span class="cm-def">arr</span><span class="cm-operator">=</span>(<span class="cm-def">$id</span>)
    <span class="cm-def">fq1</span><span class="cm-operator">=</span><span class="cm-def">${arr[1]}</span>
    <span class="cm-def">fq2</span><span class="cm-operator">=</span><span class="cm-def">${arr[2]}</span>
    <span class="cm-def">sample</span><span class="cm-operator">=</span><span class="cm-def">${arr[0]}</span>
    <span class="cm-builtin">echo</span> <span class="cm-string">"  start alignment for </span><span class="cm-def">$sample</span><span class="cm-string">"</span> <span class="cm-quote">`date`</span>
    <span class="cm-comment">#$hisat -p 5 -x $mm10_index -1 $fq1 -2 $fq2 -S $sample.sam 2&gt;$sample.hisat.log</span>
    <span class="cm-comment">#samtools sort -O bam -@ 5  -o $sample.bam $sample.sam</span>
    <span class="cm-def">$subjunc</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span>  <span class="cm-attribute">-i</span> <span class="cm-def">$subjunc_mm10_index</span> <span class="cm-attribute">-r</span> <span class="cm-def">$fq1</span>  <span class="cm-attribute">-R</span> <span class="cm-def">$fq2</span> <span class="cm-attribute">-o</span> <span class="cm-def">${sample}</span>_subjunc.bam
    <span class="cm-builtin">echo</span> <span class="cm-string">"  end alignment for </span><span class="cm-def">$sample</span><span class="cm-string">"</span> <span class="cm-quote">`date`</span>
<span class="cm-keyword">done</span></pre>
<p><span class="md-line md-end-block">配置文件就3列，第一列是样本名，第二列是该样本的fastq1，第二列是fastq2。多个样本的样本名不运行重复。</span></p>
<p><span class="md-line md-end-block">之前我以为hisat就很快了，换成了这个subjunc才知道没有最快，只有更快。</span></p>
<h2 class="md-end-block md-heading">批量计算表达量</h2>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-def">mm10_gtf</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/reference/gtf/gencode/gencode.vM12.annotation.gtf'</span>;
<span class="cm-def">featureCounts</span><span class="cm-operator">=</span><span class="cm-string">'/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts'</span>;
<span class="cm-def">$featureCounts</span> <span class="cm-attribute">-T</span> <span class="cm-number">5</span> <span class="cm-attribute">-p</span> <span class="cm-attribute">-t</span> exon <span class="cm-attribute">-g</span> gene_id <span class="cm-attribute">-a</span> <span class="cm-def">$mm10_gtf</span> <span class="cm-attribute">-o</span>  counts.txt   *.bam</pre>
<p><span class="md-line md-end-block">实在是没有想到这个软件居然会如此快，1M的reads耗时三五秒即可，甩之前的htseq-counts好几条街。</span></p>
<p><span class="md-line md-end-block">还有更多计算的模型和参数可以供挑选；<span class="" spellcheck="false"><a href="http://bioinf.wehi.edu.au/featureCounts/">http://bioinf.wehi.edu.au/featureCounts/</a></span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2775.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>一个RNA-seq的反思</title>
		<link>http://www.bio-info-trainee.com/2275.html</link>
		<comments>http://www.bio-info-trainee.com/2275.html#comments</comments>
		<pubDate>Thu, 12 Jan 2017 10:51:22 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[全半角]]></category>
		<category><![CDATA[单端双端]]></category>
		<category><![CDATA[默认参数]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2275</guid>
		<description><![CDATA[熟悉我的人都知道RNA-seq是我的拿手好戏啦！ 但是，今天处理了一个公共数据， &#8230; <a href="http://www.bio-info-trainee.com/2275.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>熟悉我的人都知道RNA-seq是我的拿手好戏啦！</div>
<div>但是，今天处理了一个公共数据，比对率低的惊人！</div>
<div>是测序数据质量不好？</div>
<div>难道grcm38与mm10有差别？</div>
<div>还是比对工具的默认参数不行？</div>
<div>请看下去，看看老司机是如何翻车的！</div>
<div></div>
<p><span id="more-2275"></span></p>
<div>数据比较新，是理所当然的认为测序数据肯定是OK的：<a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81916">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81916</a></div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/12.png"><img class="alignnone size-full wp-image-2276" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/12.png" alt="1" width="303" height="97" /></a></div>
<div>下载sra数据，转换为fastq我就不讲解了！</div>
<div>Written 30468155 spots for SRR3589959.sra</div>
<div>Written 52972617 spots for SRR3589960.sra</div>
<div>Written 36763726 spots for SRR3589961.sra</div>
<div>Written 43802631 spots for SRR3589962.sra</div>
<div>我用的是hisat2工具来比对，一般情况下我就用默认参数啦！</div>
<div>reference=/home/jianmingzeng/reference/index/hisat/grcm38/genome</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589959.fastq -S control_1.sam 2&gt;control_1.log</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589960.fastq -S control_2.sam 2&gt;control_2.log</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589961.fastq -S Akap95_1.sam 2&gt;Akap95_1.log</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589962.fastq -S Akap95_2.sam 2&gt;Akap95_2.log</div>
<div>ls *sam |while read id;do (nohup samtools sort -n -@ 5 -o ${id%%.*}.Nsort.bam $id &amp;);done</div>
<div>但是让我意外的是比对率出奇的低~~~</div>
<div>0.48% overall alignment rate</div>
<div>0.62% overall alignment rate</div>
<div>0.48% overall alignment rate</div>
<div>0.49% overall alignment rate</div>
<div></div>
<div>起初我怀疑是参考基因组用错了，但是我查看了GEO里面的介绍，的确是mouse的ESC，所以我用grcm38没有问题呀！</div>
<div>然后我怀疑是测序数据质量的问题，但是质量再差也不会导致如此低的比对率呀~~~</div>
<div>所以我还是用fastqc检查了一下：</div>
<div><img class="alignnone size-full wp-image-2277" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/22.png" alt="2" width="882" height="596" /></div>
<div></div>
<div>果然，质量值好到爆！！！！</div>
<div></div>
<div>而且我抽取了几条序列去blat一下，发现也可以比对呀，而且很明显是跨越intron的比对，超级经典的RNA-seq数据呀!!!</p>
<div><strong><span style="color: #ff0000;">( 其实我这个blat结果也没有看仔细，正常的reads不应该被截成比对到基因组的正负链的，这其实预示着我把PE序列拼接了。)</span></strong></div>
</div>
<div><img class="alignnone size-full wp-image-2278" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/31.png" alt="3" width="745" height="347" /></div>
<div>那么就是我hisat2这个步骤的问题咯,我首先怀疑是不是我下载hisat的index搞错了，虽然看起来我命名是grcm38，但是有可能是我下载错误！我打开了sam文件看了看开头：</div>
<div><img class="alignnone size-full wp-image-2279" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/41.png" alt="4" width="254" height="350" /></div>
<div>貌似的确是mouse基因组的染色体长度呀！很诡异，而且我清楚的记得，我下载的就是mouse的基因的索引呀！</div>
<div><img class="alignnone size-full wp-image-2280" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/51.png" alt="5" width="629" height="133" /></div>
<div><a href="https://ccb.jhu.edu/software/hisat2/index.shtml">https://ccb.jhu.edu/software/hisat2/index.shtml</a></div>
<div><img class="alignnone size-full wp-image-2281" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/61.png" alt="6" width="304" height="151" /></div>
<div>难道grcm38与mm10有差别？</div>
<div>我就先用bowtie2测试一下mm10吧，毕竟我还没有hisat2的mm10的index呀！</div>
<div>head -1000 SRR3589959.fastq &gt;tmp.fq</div>
<div>~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -x ~/reference/index/bowtie/mm10 -U tmp.fq -S tmp.sam</div>
<div>结果我挑出来的这1000条序列，全军覆没了，0.00% overall alignment rate，我傻眼了！</div>
<div>没办法呀，逼着我换hg19参考基因组看看：</div>
<div>~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 6 -x ~/reference/index/bowtie/hg19 -U tmp.fq -S tmp.hg19.sam</div>
<div>仍然是全军覆没了，0.00% overall alignment rate，继续傻眼！</div>
<div></div>
<div>
<div>回过头看了看fastqc的报告，发现前面10个碱基的确有问题的！<strong>如果只是对RNA-seq进行定量，可能需要trim掉，但是，我以前从来不trim，照样不影响比对呀</strong></div>
<div><img class="alignnone size-full wp-image-2282" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/71.png" alt="7" width="872" height="621" /></div>
<div>不过，暂时看到这个问题，我就试着解决一下吧，先从这个思路来，</p>
<div>而且比对工具里面本来就有这个选项，没必要自己来trim的！具体参数见：<a href="https://ccb.jhu.edu/software/hisat2/manual.shtml">https://ccb.jhu.edu/software/hisat2/manual.shtml</a></div>
<div></div>
<div>-5/--trim5 &lt;int&gt; trim &lt;int&gt; bases from 5'/left end of reads (0)</div>
<div>-3/--trim3 &lt;int&gt; trim &lt;int&gt; bases from 3'/right end of reads (0)</div>
<div></div>
</div>
</div>
<div>所以我加上了-p 6 -5 10 -3 10 --local 参数，比对人，可以拿到35.60% overall alignment rate，比对mouse，可以拿到98.80% overall alignment rate ，我勒个去，<span style="color: #ff0000;"><strong>问题出来了，看起来好像是应该trim掉呀。以前的万能默认参数不行了！！！！</strong></span></div>
<div>但是有个问题，虽然我用local模式都比对上了，但是首先100bp的reads我切成了80，而且都是40M，40S，说明只有reads的一般成功比对到了参考基因组序列呀！！！！</div>
<div></div>
<div>我然后用同样的参数，我测试了hisat2工具，但是hisat2里面压根就没有local的选项，<span style="color: #ff0000;"><strong>仅仅是trim一下，对比对的改善毫无意义，所以重点在于--local这个参数，但它只是表象，本质还是这个测序数据出问题了！</strong></span></div>
<div>数据为什么会出问题呢?</div>
<div>我再回过头看了看测序数据的fastqc报告，我勒个去，这么重要的图我居然忽略掉了，再联想到前面的40M，40S我瞬间明白了，这肯定是一个双端测序，被我搞成 了单端测序数据！</div>
<div><img class="alignnone size-full wp-image-2283" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/8.png" alt="8" width="918" height="653" /></div>
<div>而且我再去GEO介绍上面看，上面赫然写着PAIRED！！！！我死也想不明白，我明明是加了--split-3 参数呀，为什么sra转换成fastq会出这么明显的错误呢？</div>
<div><img class="alignnone size-full wp-image-2284" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/9.png" alt="9" width="303" height="136" /></div>
<div>然后我检查我的脚本，马勒戈壁，我自己从我博客里面复制了我的代码，</div>
<div><img class="alignnone size-full wp-image-2285" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/10.png" alt="10" width="367" height="126" /></div>
<div><span style="color: #ff0000;"><strong>唯一值得你看的就是这个图</strong></span></div>
<div><span style="color: #ff0000;"><strong>是-- 不是— ，全角半角害死人呀，而且这个参数不识别它居然不报错，而是忽略我 参数！！！</strong></span></div>
<div><span style="color: #ff0000;"><strong>是-- 不是— ，全角半角害死人呀，而且这个参数不识别它居然不报错，而是忽略我 参数！！！</strong></span></div>
<div><span style="color: #ff0000;"><strong>是-- 不是— ，全角半角害死人呀，而且这个参数不识别它居然不报错，而是忽略我 参数！！！</strong></span></div>
<div>更要命的是我把wget跟fastq-dump一起运行的，而wget会给出一大堆的log日志，我都懒得看，结果，把fastq-dump的报错日志给掩盖了。</div>
<div><img class="alignnone size-full wp-image-2286" src="http://www.bio-info-trainee.com/wp-content/uploads/2017/01/111.png" alt="11" width="667" height="182" /></div>
<div>这就是老司机翻车的全部故事，希望你们引以为戒！</div>
<div>因为前面一直处理的是单端的数据，所以这个错误没有被发现。</div>
<div>我痛恨我博客的脚本了，而且我痛恨--这样的参数设置！</div>
<div>下面是我修改后的代码！！！</div>
<div>cut -f 3 config.txt |while read id ; do wget $id 2&gt;/dev/null ;done</div>
<div>ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 $id;done</div>
<div>老司机现在很伤心，一天的功夫白费了。</div>
<div></div>
<div>因为我已经把sra数据删除了，想重来一次的机会都不给我~~~</div>
<div>又要重新下载一次，好惨啊！！！！</div>
<div></div>
<div></div>
<div>总结一下吧：</div>
<div>QC这一步骤非常重要，不能太马虎！</div>
<div>原始数据不要随意删除，给自己一次重新来过的机会。</div>
<div></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2275.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>一个RNA-seq实战-超级简单-2小时搞定！</title>
		<link>http://www.bio-info-trainee.com/2218.html</link>
		<comments>http://www.bio-info-trainee.com/2218.html#comments</comments>
		<pubDate>Fri, 30 Dec 2016 08:38:33 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[RNA-seq]]></category>
		<category><![CDATA[表达量]]></category>
		<category><![CDATA[转录组]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2218</guid>
		<description><![CDATA[请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。 软件 &#8230; <a href="http://www.bio-info-trainee.com/2218.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div><span style="color: #ff0000;"><strong>请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。</strong></span></div>
<div><span style="color: #ff0000;"><strong>软件请用最新版，尤其是samtools等被我存储在系统环境变量的，考虑到读者众多，一般的软件我都会自带版本信息的！</strong></span></div>
<div>我用两个小时，不代表你是两个小时就学会，有些朋友反映学了两个星期才 学会，这很正常，没毛病，不要异想天开两个小时就达到我的水平。</div>
<div></div>
<div>转录组如果只看表达量真的是超级简单，真是超级简单，而且人家作者本来就测是SE50，这种破数据，也就是看表达量用的！</div>
<div>首先作者分析结果是：</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/17.png"><img class="alignnone size-full wp-image-2224" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/17.png" alt="1" width="619" height="325" /></a></div>
<p><span id="more-2218"></span></p>
<div>数据在GEO地址是：<a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50177">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50177</a></div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/25.png"><img class="alignnone size-full wp-image-2225" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/25.png" alt="2" width="622" height="388" /></a></div>
<div>我们需要下载的RNA-seq的数据：</div>
<div><a href="https://www.ncbi.nlm.nih.gov//sra/?term=SRP029245">https://www.ncbi.nlm.nih.gov//sra/?term=SRP029245</a></div>
<div><a href="https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP029245">https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP029245</a></div>
<div><a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP029/SRP029245">ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP029/SRP029245</a></div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/33.png"><img class="alignnone size-full wp-image-2219" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/33.png" alt="3" width="690" height="79" /></a></div>
<div>下载地址很容易获取啦！</div>
<div>for ((i=677;i&lt;=680;i++)) ;do wget <a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP029/SRP029245">ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP029/SRP029245</a>/SRR957$i/SRR957$i.sra;done</div>
<div>ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 $id;done</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/42.png"><img class="alignnone size-full wp-image-2220" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/42.png" alt="4" width="339" height="160" /></a></div>
<div></div>
<div>因为我用fastqc看了看数据质量，发现没有什么问题，代码如下：</div>
<div>ls *fastq |xargs ~/biosoft/fastqc/FastQC/fastqc -t 10</div>
<div>所以直接用hisat2软件把测序得到的fastq文件比对到hg19参考基因组上面</div>
<div>reference=/home/jianmingzeng/reference/index/hisat/hg19/genome</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957677.fastq -S control_1.sam 2&gt;control_1.log</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957678.fastq -S control_2.sam 2&gt;control_2.log</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957679.fastq -S siSUZ12_1.sam 2&gt;siSUZ12_1.log</div>
<div>~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957680.fastq -S siSUZ12_2.sam 2&gt;siSUZ12_2.log</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/51.png"><img class="alignnone size-full wp-image-2221" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/51.png" alt="5" width="229" height="64" /></a></div>
<div></div>
<div>而且查看log日志可以发现，比对效果杠杠的：</div>
<div>93.10% overall alignment rate<br />
92.44% overall alignment rate<br />
92.36% overall alignment rate<br />
93.22% overall alignment rate</div>
<div></div>
<div>然后把sam文件根据reads name来排序并且转换为bam文件节省空间</div>
<div>ls *sam |while read id;do (nohup samtools sort -n -@ 5 -o ${id%%.*}.Nsort.bam $id &amp;);done</div>
<div><img class="alignnone size-full wp-image-2222" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/6.png" alt="6" width="271" height="75" /></div>
<div>最后用htseq-counts工具来对每一个样本进行基因的表达量定量！</div>
<div>ls *.Nsort.bam |while read id;do (nohup samtools view $id | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1&gt;${id%%.*}.geneCounts 2&gt;${id%%.*}.HTseq.log&amp;);done</div>
<div>得到的文件如下：</div>
<div></div>
<div>这4个样本的基因的counts数据就可以用一系列的R包来做差异分析了，包括limma的voom，DEseq2，edgeR等等。这些包的用法都烂大街了，我就不赘述了。</div>
<div>做完差异分析，就可以跟作者的结果做对比，看看自己做的是不是对的。</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/7.png"><img class="alignnone size-full wp-image-2223" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/12/7.png" alt="7" width="930" height="615" /></a></div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2218.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>hisat2+stringtie+ballgown</title>
		<link>http://www.bio-info-trainee.com/2073.html</link>
		<comments>http://www.bio-info-trainee.com/2073.html#comments</comments>
		<pubDate>Fri, 25 Nov 2016 15:06:23 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[ballgown]]></category>
		<category><![CDATA[hisat2]]></category>
		<category><![CDATA[StringTie]]></category>
		<category><![CDATA[转录组]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2073</guid>
		<description><![CDATA[早在去年九月，我就写个博文说 RNA-seq流程需要进化啦！ http://ww &#8230; <a href="http://www.bio-info-trainee.com/2073.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>早在去年九月，我就写个博文说 RNA-seq流程需要进化啦！<a href="http://www.bio-info-trainee.com/1022.html" target="_blank"> http://www.bio-info-trainee.com/1022.html </a> ，主要就是进化成hisat2+stringtie+ballgown的流程，但是我一直没有系统性的讲这个流程，因为我觉真心木有用。我只用了里面的hisat来做比对而已！但是群里的小伙伴问得特别多，我还是勉为其难的写一个教程吧，你们之间拷贝我的代码就可以安装这些软件的！然后自己找一个测试数据，我的脚本很容易用的！<span id="more-2073"></span></p>
<div>其实我最喜欢这样的文章了：<a href="http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html">http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html</a> 而且人家还提供了所有的代码，不知道大家怎么还会有疑问的<a href="http://www.nature.com/nprot/journal/v11/n9/extref/nprot.2016.095-S1.zip" target="_blank">：http://www.nature.com/nprot/journal/v11/n9/extref/nprot.2016.095-S1.zip</a></div>
<div>人家已经把流程说得清清楚楚了，我还是说一个自己的体悟吧：</div>
<div>软件安装如下：</div>
<blockquote>
<div>## Download and install HISAT</div>
<div># https://ccb.jhu.edu/software/hisat2/index.shtml</div>
<div>cd ~/biosoft</div>
<div>mkdir HISAT &amp;&amp; cd HISAT</div>
<div>#### readme: https://ccb.jhu.edu/software/hisat2/manual.shtml</div>
<div>wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.4-Linux_x86_64.zip</div>
<div>unzip hisat2-2.0.4-Linux_x86_64.zip</div>
<div>ln -s hisat2-2.0.4 current</div>
<div>## ~/biosoft/HISAT/current/hisat2-build</div>
<div>## ~/biosoft/HISAT/current/hisat2</div>
<div></div>
<div>## Download and install StringTie</div>
<div>## https://ccb.jhu.edu/software/stringtie/ ## https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual</div>
<div>cd ~/biosoft</div>
<div>mkdir StringTie &amp;&amp; cd StringTie</div>
<div>wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.2.3.Linux_x86_64.tar.gz</div>
<div>tar zxvf stringtie-1.2.3.Linux_x86_64.tar.gz</div>
<div>ln -s stringtie-1.2.3.Linux_x86_64 current</div>
<div># ~/biosoft/StringTie/current/stringtie</div>
</blockquote>
<div></div>
<div>软件使用，我比较喜欢用shell脚本，而且是简单的那种：</div>
<div>
<blockquote>
<div>while read id</div>
<div>do</div>
<div>sample=$(echo $id |cut -d" " -f 1 )</div>
<div>file1=$(echo $id |cut -d" " -f 2 )</div>
<div>file2=$(echo $id |cut -d" " -f 3 )</div>
<div>echo  $sample</div>
<div>echo $file1</div>
<div>echo $file2</div>
<div>~/biosoft/HISAT/current/hisat2  -p 4 --dta  -x  ~/reference/index/hisat/hg19/genome  -1 $file1 -2 $file2 -S $sample.hisat2.hg19.sam 2&gt;$sample.hisat2.hg19.log &amp;</div>
<div>done &lt;$1</div>
</blockquote>
<div>上面这个脚本需要一个3列的输入文件，分别是样本名，read1文件，read2文件，会产生以下的输出文件，sam文件。</div>
<div><img src="C:\Users\jimmy1314\AppData\Local\YNote\data\jmzeng1314@163.com\5262fabc557a4523a4694cb992a1a399\clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="0A2D6DB986A14AC0A37C06273FEC3647" /><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/16.png"><img class="alignnone size-full wp-image-2074" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/16.png" alt="1" width="298" height="63" /></a></div>
<blockquote>
<div>while read id</div>
<div>do</div>
<div>file=$(basename $id )</div>
<div>sample=${file%%.*}</div>
<div>echo $id $sample</div>
<div>nohup samtools sort -@ 4 -o ${sample}.sorted.bam $id &amp;</div>
<div>done &lt;$1</div>
</blockquote>
<div><span style="color: #ff0000;">最新版的samtools已经可以直接把sam文件变成排序好的bam文件啦~~~~</span></div>
<div><img src="C:\Users\jimmy1314\AppData\Local\YNote\data\jmzeng1314@163.com\adf062aca85f49d08d1d860f3a09443e\clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="9E5C149652164763BB6DE37FE9DDCA67" /><img class="alignnone size-full wp-image-2075" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/23.png" alt="2" width="266" height="65" /></div>
<blockquote>
<div>while read id</div>
<div>do</div>
<div>file=$(basename $id )</div>
<div>sample=${file%%.*}</div>
<div>echo $id $sample</div>
<div>nohup ~/biosoft/StringTie/current/stringtie  -p 4  -G ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf  -o $sample.hg19.stringtie.gtf -l $sample  $id  &amp;</div>
<div>done &lt;$1</div>
</blockquote>
<div>stringTie的用法就是这样咯。没什么好讲的</div>
<div><img class="alignnone size-full wp-image-2076" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/31.png" alt="3" width="318" height="82" /></div>
<div><img src="C:\Users\jimmy1314\AppData\Local\YNote\data\jmzeng1314@163.com\c61ae9e9ad8a47c1a5f7886632cfa1fa\clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="930C433F5E104A8FA07D0306E15026DD" /></div>
<div></div>
<div> ~/biosoft/StringTie/current/stringtie   --merge -p 8 -G ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf  -o stringtie_merged.gtf  mergelist.txt</div>
<div></div>
<div></div>
<div>while read id</div>
<div>do</div>
<div>file=$(basename $id )</div>
<div>sample=${file%%.*}</div>
<div>echo $id $sample</div>
<div>nohup ~/biosoft/StringTie/current/stringtie -e -B  -G  $2  -o ballgown/$sample/$sample.hg19.stringtie.gtf   $id  &amp;</div>
<div>done &lt;$1</div>
</div>
<div>我实在讲不下去了，因为真心不用这个东东，<strong><span style="color: #ff0000;">我都是拿到了sam/bam文件就直接去counts表达量矩阵了</span></strong>，而count reads数量是非常容易的事情，代码如下</div>
<div>nohup samtools view   A.sorted.bam.Nsort.bam |  ~/.local/bin/htseq-count -f sam  -s no -i gene_name  -   ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf    1&gt;A.geneCounts 2&gt;A.HTseq.log &amp;</div>
<div>下面的这些文件，导入到R里面用ballgown处理吧，不要在问我这个问题了。</div>
<div><img class="alignnone size-full wp-image-2077" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/4.png" alt="4" width="608" height="548" /></div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2073.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>htseq-counts跟bedtools的区别</title>
		<link>http://www.bio-info-trainee.com/2022.html</link>
		<comments>http://www.bio-info-trainee.com/2022.html#comments</comments>
		<pubDate>Tue, 15 Nov 2016 03:55:21 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[bedtools]]></category>
		<category><![CDATA[htseq-counts]]></category>
		<category><![CDATA[转录组]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2022</guid>
		<description><![CDATA[我以前写过bedtools和htseq-counts的教程，它们都可以用来对比对 &#8230; <a href="http://www.bio-info-trainee.com/2022.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>我以前写过bedtools和htseq-counts的教程，它们都可以用来对比对好的bam文件进行计数，正好群里有小伙伴问我它们的区别，我就简单做了一个比较，大家可以先看看我以前写的软件教程。写的有的挫：</p>
<p><a title="详细阅读 使用Bedtools对RNA-seq进行基因计数" href="http://www.bio-info-trainee.com/745.html" rel="bookmark">使用Bedtools对RNA-seq进行基因计数</a> ，</p>
<p><a title="详细阅读 转录组HTseq对基因表达量进行计数" href="http://www.bio-info-trainee.com/244.html" rel="bookmark">转录组HTseq对基因表达量进行计数</a></p>
<p>言归正传，我这里没精力去探究它们的具体原理，只是看看它们数一个read是否属于某个基因的时候，区别在哪里，大家看下图：<span id="more-2022"></span></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/bedtoos-vs-htseq.png"><img class="alignnone size-full wp-image-2023" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/bedtoos-vs-htseq.png" alt="bedtoos-vs-htseq" width="707" height="485" /></a></p>
<div>很明显，bedtools不管三七二十一，只要你的reads比对到基因组的坐标跟目的基因坐标有交叉，就算你一个reads，不需要管你是不是multiple mapping的。</div>
<div>但是htseq就谨慎很多，而且还可以挑选model，一般来说，它会把multiple mapping的reads归类到 not unique aligned里面。</div>
<div>而且，大家做完分析，一定要再三检查，很明显人家hisat告诉你的mapping rate高达90%以上，即使除去那15%左右的multiple mapping，你counts表达量的时候，至少也可以counts 百分之五六十吧！！！</div>
<div></div>
<div>如果出现大数量级的no_feature，你自己就应该明白有问题了！</div>
<div></div>
<div>最后htseq-counts使用的时候有一些参数尤其需要注意：</div>
<div>软件官网说明书： <a href="http://www-huber.embl.de/HTSeq/doc/count.html">http://www-huber.embl.de/HTSeq/doc/count.html</a></div>
<div>参考gtf文件可以是gencode或者是ensembl数据库的，但是尤其要注释chr的问题，而且版本问题，gtf/gff格式无所谓。比对后的文件一定要进行sort，推荐一定要sort -n，根据reads的name来sort</div>
<div>-f sam/bam 这个一定要搞清楚，如果对bam文件进行counts，必须保证你服务器的python安装了正确的pysam模块</div>
<div>-r name/pos， 一般情况下我们的bam都是按照参考基因组的pos来sort的，但是这个软件默认却是reads的name，很坑，一般建议重新把bam文件sort一下，而不是选择 -r pos，因为-r pos实在是太消耗内存了。</div>
<div>-s yes/no/reverse, 这也是巨坑的参数，默认是yes，一般人拿到的数据都是no，所以千万要注意！！！</div>
<div>-t 选择gff/gtf文件的第3列，一般是exon，也可以是gene，transcript ，这个很少调整的。</div>
<div>-i 这个需要修改，不然默认是ensembl的基因ID，一般人看不懂，可以改为gene_name，前提是你的gff文件里面有gene_name这个属性。</div>
<div>其余的就不需要修改了。</div>
<div>我的代码如下：</div>
<blockquote>
<div>nohup samtools view control.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1&gt;control.geneCounts 2&gt;control.HTseq.log &amp;</div>
<div>nohup samtools view G34V.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1&gt;G34V.geneCounts 2&gt;G34V.HTseq.log &amp;</div>
<div>nohup samtools view K27M.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1&gt;K27M.geneCounts 2&gt;K27M.HTseq.log &amp;</div>
<div>nohup samtools view WT.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1&gt;WT.geneCounts 2&gt;WT.HTseq.log &amp;</div>
<div></div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2022.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RNAseq数据完整生物信息分析流程第一讲之文献数据下载</title>
		<link>http://www.bio-info-trainee.com/1876.html</link>
		<comments>http://www.bio-info-trainee.com/1876.html#comments</comments>
		<pubDate>Tue, 09 Aug 2016 12:34:14 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[tutorial]]></category>
		<category><![CDATA[转录组软件]]></category>
		<category><![CDATA[--split-3]]></category>
		<category><![CDATA[airway]]></category>
		<category><![CDATA[fastq-dump]]></category>
		<category><![CDATA[SRA]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1876</guid>
		<description><![CDATA[我这里拿的是bioconductor里面最常用的airway数据，因为差异表达分 &#8230; <a href="http://www.bio-info-trainee.com/1876.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>我这里拿的是bioconductor里面最常用的airway数据，因为差异表达分析在bioconductor里面是重点，它们这些包在介绍自己的算法以及做示范的时候都用的这个数据。可以在GEO数据库里面看到信息描述：<a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778</a>  可以看到是Illumina HiSeq 2000 (Homo sapiens) ，75bp paired-end 这个信息很重要，决定了下载sra数据之后如何解压以及如何比对。也可以看到作者把所有的测序原始数据都上传到了SRA中心：<a href="http://www.ncbi.nlm.nih.gov/sra?term=SRP033351 ">http://www.ncbi.nlm.nih.gov/sra?term=SRP033351 </a> ，这里可以在linux服务器上面写一个简单的脚本批量下载所有的测序数据，然后根据GEO里面描述的metadata把原始数据改名。</p>
<blockquote><p>for ((i=508;i&lt;=523;i++)) ;do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP033/SRP033351/<span style="color: #ff0000;"><strong>SRR1039$i/SRR1039$i.sra;done</strong></span><br />
ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 $id;done</p></blockquote>
<p>需要自己看SRA里面的数据记录，上面的脚本不难写出，然后因为是Illumina的双端测序，所以我们用fastq-dump --split-3命令来把sra格式数据转换为fastq，但是因为这里有16个测序数据，所以最好是同步改名，我这里用脚本批量生成改名脚本如下：</p>
<p>为了节省空间，我用了--gzip压缩，该文件名，用-A参数。</p>
<blockquote><p>nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/<strong><span style="color: #ff0000;">fastq-dump --split-3 --gzip -A N61311_untreated</span></strong> SRR1039508.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N61311_Dex SRR1039509.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N61311_Alb SRR1039510.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N61311_Alb_Dex SRR1039511.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N052611_untreated SRR1039512.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N052611_Dex SRR1039513.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N052611_Alb SRR1039514.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N052611_Alb_Dex SRR1039515.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N080611_untreated SRR1039516.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N080611_Dex SRR1039517.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N080611_Alb SRR1039518.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N080611_Alb_Dex SRR1039519.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N061011_untreated SRR1039520.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N061011_Dex SRR1039521.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N061011_Alb SRR1039522.sra &amp;<br />
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 --gzip -A N061011_Alb_Dex SRR1039523.sra &amp;</p></blockquote>
<p>可以看到这里的16个样本来源于同样的4个人，是HASM细胞系，处理详情如下：</p>
<div>测序基础：</div>
<div>HASM细胞系-human airway smooth muscle，</div>
<div>The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from <b><span style="color: #ff0000;">four white male donors</span></b> under four treatment conditions:</div>
<blockquote>
<div>1) no treatment;</div>
<div>2) treatment with a β2-agonist (i.e. Albuterol, 1μM for 18h);</div>
<div>3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1μM for 18h);</div>
<div>4) simultaneous treatment with a β2-agonist and glucocorticoid</div>
</blockquote>
<div>and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument.</div>
<div>我们这里只是先根据fastq数据比对到参考基因组，然后计算每个样本的表达量即可，后续的分组计算差异表达，就需要个性化了。</div>
<p>下载的sra大小如下：</p>
<blockquote><p>-rw-rw-r-- 1 jmzeng jmzeng 1.6G Aug 9 04:21 SRR1039508.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.5G Aug 9 05:20 SRR1039509.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.6G Aug 9 06:14 SRR1039510.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.5G Aug 9 07:05 SRR1039511.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.1G Aug 9 08:07 SRR1039512.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.3G Aug 9 09:17 SRR1039513.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 3.1G Aug 9 10:56 SRR1039514.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.9G Aug 9 11:56 SRR1039515.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.1G Aug 9 13:02 SRR1039516.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.6G Aug 9 14:16 SRR1039517.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.3G Aug 9 15:17 SRR1039518.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.0G Aug 9 16:05 SRR1039519.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.1G Aug 9 16:56 SRR1039520.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.4G Aug 9 17:57 SRR1039521.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.0G Aug 9 18:46 SRR1039522.sra<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.4G Aug 9 19:28 SRR1039523.sra</p></blockquote>
<p>解压后成双端测序的fastq数据如下：</p>
<blockquote><p> -rw-rw-r-- 1 jmzeng jmzeng 2.5G Aug 9 20:12 N052611_Alb_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.5G Aug 9 20:12 N052611_Alb_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 20:44 N052611_Alb_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 20:44 N052611_Alb_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 289M Aug 9 20:44 N052611_Alb_Dex.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 951M Aug 9 20:59 N052611_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 954M Aug 9 20:59 N052611_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.7G Aug 9 20:53 N052611_untreated_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.7G Aug 9 20:53 N052611_untreated_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.5G Aug 9 20:45 N061011_Alb_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.5G Aug 9 20:45 N061011_Alb_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.9G Aug 9 20:59 N061011_Alb_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.9G Aug 9 20:59 N061011_Alb_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 16M Aug 9 20:45 N061011_Alb.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.4G Aug 9 20:48 N061011_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.4G Aug 9 20:48 N061011_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.2G Aug 9 20:00 N061011_untreated_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.2G Aug 9 20:00 N061011_untreated_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 759M Aug 9 20:00 N061011_untreated.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.9G Aug 9 20:03 N080611_Alb_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.9G Aug 9 20:03 N080611_Alb_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 19:59 N080611_Alb_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 19:59 N080611_Alb_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 535M Aug 9 19:59 N080611_Alb_Dex.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.1G Aug 9 20:06 N080611_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 2.1G Aug 9 20:06 N080611_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.6G Aug 9 20:01 N080611_untreated_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.6G Aug 9 20:01 N080611_untreated_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 08:09 N61311_Alb_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 08:09 N61311_Alb_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 08:08 N61311_Alb_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 08:08 N61311_Alb_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.2G Aug 9 08:07 N61311_Dex_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.2G Aug 9 08:07 N61311_Dex_2.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 08:09 N61311_untreated_1.fastq.gz<br />
-rw-rw-r-- 1 jmzeng jmzeng 1.3G Aug 9 08:09 N61311_untreated_2.fastq.gz</p></blockquote>
<p>接下来所有的分析就基于此数据啦</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1876.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
