<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; 基因组学</title>
	<atom:link href="http://www.bio-info-trainee.com/category/omics/genomics/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>基因组重测序的unmapped reads assembly探究</title>
		<link>http://www.bio-info-trainee.com/2523.html</link>
		<comments>http://www.bio-info-trainee.com/2523.html#comments</comments>
		<pubDate>Sat, 02 Sep 2017 12:16:55 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2523</guid>
		<description><![CDATA[基因组重测序的unmapped reads assembly探究 主要参考这篇文 &#8230; <a href="http://www.bio-info-trainee.com/2523.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<h2 class="md-end-block md-heading md-focus" contenteditable="true"><span class="md-expand">基因组重测序的unmapped reads assembly探究</span></h2>
<p><span class="md-line md-end-block">主要参考这篇文章的图4：<span spellcheck="false"><a href="http://www.nature.com/ng/journal/v42/n11/fig_tab/ng.691_F4.html">http://www.nature.com/ng/journal/v42/n11/fig_tab/ng.691_F4.html</a></span> </span><span id="more-2523"></span></p>
<p><span class="md-line md-end-block" contenteditable="true"><span class="md-image md-img-loaded" contenteditable="false" data-src="http://www.nature.com/ng/journal/v42/n11/images/ng.691-F4.jpg"><img src="http://www.nature.com/ng/journal/v42/n11/images/ng.691-F4.jpg" alt="" /></span></span></p>
<p><span class="md-line md-end-block" contenteditable="true">这是2010年发表于nature genetics杂志的<span class=""><a spellcheck="false" href="http://www.nature.com/ng/journal/v42/n11/full/ng.691.html">Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing</a></span><span class=""> 虽然文章选择的是SOAPdenovo,ABySS,Velvet这3款软件来进行组装，但毕竟是2010年的文章了，现在其实有更好的选择，比如Minia</span></span></p>
<h2 class="md-end-block md-heading">选择Minia工具来组装</h2>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">Minia软件也是基于de Bruijn图原理的短序列组装工具，优于以前的ABySS和SOAPdenovo，所以这里就选择它啦。</span></span></p>
<h3 class="md-end-block md-heading">下载安装Minia</h3>
<p><span class="md-line md-end-block">安装官网的指导说明书下载二进制版本即可，代码如下：</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-comment">## Download and install Minia</span>
<span class="cm-comment"># http://minia.genouest.org/</span>
<span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-builtin">mkdir</span> Minia &amp;&amp;  <span class="cm-builtin">cd</span> Minia
<span class="cm-builtin">wget</span> https://github.com/GATB/minia/releases/download/v2.0.7/minia-v2.0.7-bin-Linux.tar.gz 
tar <span class="cm-attribute">-zxvf</span> minia-v2.0.7-bin-Linux.tar.gz 
~/biosoft/Minia/minia-v2.0.7-bin-Linux/bin/minia <span class="cm-attribute">--help</span> 
<span class="cm-comment">## eg: ./minia -in reads.fa -kmer-size 31 -abundance-min 3 -out output_prefix </span></pre>
<p><span class="md-line md-end-block">软件使用方法也非常简单，就一行命令，其中最佳<span spellcheck="false"><code>-kmer-size</code></span>需要用<span class=""><a spellcheck="false" href="http://kmergenie.bx.psu.edu/">KmerGenie</a></span>来确定。</span></p>
<h3 class="md-end-block md-heading">使用</h3>
<h3 class="md-end-block md-heading">step1:提取比对失败的reads</h3>
<pre class="md-fences md-end-block" lang="Shell" contenteditable="false">
samtools view <span class="cm-attribute">-f4</span> jmzeng_recal.bam |perl <span class="cm-attribute">-alne</span> <span class="cm-string">'{print "\@$F[0]\n$F[9]\n+\n$F[10]" }'</span> &gt;unmapped.fq
​
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-lite.pl <span class="cm-attribute">-verbose</span> <span class="cm-attribute">-fastq</span> unmapped.fq <span class="cm-attribute">-graph_data</span> unmapped.gd <span class="cm-attribute">-out_good</span> null <span class="cm-attribute">-out_bad</span> null
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> unmapped.gd <span class="cm-attribute">-png_all</span> <span class="cm-attribute">-o</span> unmapped
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> unmapped.gd <span class="cm-attribute">-html_all</span> <span class="cm-attribute">-o</span> unmapped
​
<span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped</pre>
<p><span class="md-line md-end-block">共31481084/4=7870271，仅仅是7.8M的reads</span></p>
<h3 class="md-end-block md-heading">step2: 用KmerGenie确定kmer值</h3>
<p><span class="md-line md-end-block">KmerGenie estimates the best k-mer length for genome de novo assembly.</span></p>
<p><span class="md-line md-end-block"><span class="">KmerGenie predictions can be applied to single-k genome assemblers (e.g. Velvet, SOAPdenovo 2, ABySS, Minia).</span></span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false">
<span class="cm-comment">## http://kmergenie.bx.psu.edu/</span>
<span class="cm-builtin">cd</span> ~/biosoft
<span class="cm-builtin">mkdir</span> KmerGenie &amp;&amp;  <span class="cm-builtin">cd</span> KmerGenie
<span class="cm-builtin">wget</span> http://kmergenie.bx.psu.edu/kmergenie-1.7044.tar.gz
tar zxvf kmergenie-1.7044.tar.gz
<span class="cm-builtin">cd</span> kmergenie-1.7044
<span class="cm-builtin">make</span> 
python setup.py install <span class="cm-attribute">--user</span>
~/.local/bin/kmergenie <span class="cm-attribute">--help</span> 
<span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped
~/.local/bin/kmergenie unmapped.fq</pre>
<h3 class="md-end-block md-heading"><span class="">step3: 运行Minia</span></h3>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped
~/biosoft/Minia/minia-v2.0.7-bin-Linux/bin/minia  <span class="cm-attribute">-in</span> unmapped.fq <span class="cm-attribute">-kmer-size</span> <span class="cm-number">31</span> <span class="cm-attribute">-abundance-min</span> <span class="cm-number">3</span> <span class="cm-attribute">-out</span> output_prefix</pre>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">7.8M的reads组装之后有272007条contigs</span></span></p>
<h2 class="md-end-block md-heading">组装之后：</h2>
<p><span class="md-line md-end-block">Prinseq v0.20.4 was used to calculate assembly statistics, including N50 contig size, GC content</span></p>
<pre class="md-fences md-end-block" lang="shell" contenteditable="false"><span class="cm-builtin">cd</span> ~/data/project/myGenome/gatk/jmzeng/unmapped
​
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-lite.pl <span class="cm-attribute">-verbose</span> <span class="cm-attribute">-fasta</span> output_prefix.contigs.fa  <span class="cm-attribute">-graph_data</span> contigs.gd <span class="cm-attribute">-out_good</span> null <span class="cm-attribute">-out_bad</span> null 
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> contigs.gd <span class="cm-attribute">-png_all</span> <span class="cm-attribute">-o</span> contigs
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-graphs.pl <span class="cm-attribute">-i</span> contigs.gd <span class="cm-attribute">-html_all</span> <span class="cm-attribute">-o</span> contigs
perl ~/biosoft/PRINSEQ/prinseq-lite-0.20.4/prinseq-lite.pl <span class="cm-attribute">-verbose</span> <span class="cm-attribute">-fasta</span> output_prefix.contigs.fa  <span class="cm-attribute">-stats_assembly</span></pre>
<p><span class="md-line md-end-block"><span class="">就是给出一些指标，如下；</span></span></p>
<pre class="md-fences md-end-block" lang="" contenteditable="false">
stats_assembly  N50 176
stats_assembly  N75 113
stats_assembly  N90 78
stats_assembly  N95 70
​</pre>
<h3 class="md-end-block md-heading">Input Information</h3>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th><span class="td-span" contenteditable="true">Input file(s):</span></th>
<th><span class="td-span" contenteditable="true">output_prefix.contigs.fa</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Input format(s):</span></td>
<td><span class="td-span" contenteditable="true">FASTA</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true"><span class=""># Sequences:</span></span></td>
<td><span class="td-span" contenteditable="true">272,007</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Total bases:</span></td>
<td><span class="td-span" contenteditable="true"><span class="">44,868,011</span></span></td>
</tr>
</tbody>
</table>
<h3 class="md-end-block md-heading">Length Distribution</h3>
<table class="md-table" contenteditable="false">
<thead>
<tr class="md-end-block">
<th><span class="td-span" contenteditable="true">Mean sequence length:</span></th>
<th><span class="td-span" contenteditable="true">164.95 ± 204.44 bp</span></th>
</tr>
</thead>
<tbody>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Minimum length:</span></td>
<td><span class="td-span" contenteditable="true">63 bp</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Maximum length:</span></td>
<td><span class="td-span" contenteditable="true">10,187 bp</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Length range:</span></td>
<td><span class="td-span" contenteditable="true">10,125 bp</span></td>
</tr>
<tr class="md-end-block">
<td><span class="td-span" contenteditable="true">Mode length:</span></td>
<td><span class="td-span" contenteditable="true"><span class="">150 bp with 16,461 sequences</span></span></td>
</tr>
</tbody>
</table>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">然后用RNA-SEQ数据来比对验证！ 以后再讲</span></span></p>
<p><span class="md-line md-end-block" contenteditable="true"><span class="">把组装好的contigs拿去NCBI做blast看看物种分布,Distribution of top nucleotide BLAST hits by species from the NCBI nr database for 1000 random contigs in the assembly！其实上面的prinseq软件也简单的给出了一个污染物种分布情况表，但是这个原理不一样。以后再讲</span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2523.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>画基因的外显子覆盖度图</title>
		<link>http://www.bio-info-trainee.com/1392.html</link>
		<comments>http://www.bio-info-trainee.com/1392.html#comments</comments>
		<pubDate>Sun, 31 Jan 2016 07:15:20 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[基因组学]]></category>
		<category><![CDATA[基因]]></category>
		<category><![CDATA[外显子]]></category>
		<category><![CDATA[覆盖度图]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=1392</guid>
		<description><![CDATA[一般情况下，我们得到了测序reads在基因组的比对情况文件bam格式的，里面的信 &#8230; <a href="http://www.bio-info-trainee.com/1392.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>一般情况下，我们得到了测序reads在基因组的比对情况文件bam格式的，里面的信息非常多，如果我想特定的查看某个基因的情况，那么我们可以选择IGV等可视化工具，但它并不是万能的，因为即使是一个基因，它也会有多个转录本，多个外显子。</div>
<div>所以，我们可以画它的外显子覆盖图，如下：横坐标是外显子的长度，纵坐标是测序深度，每一个小图都是一个外显子</div>
<div> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/01/DMD.NM_000109.png"><img class="alignnone size-full wp-image-1394" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/01/DMD.NM_000109.png" alt="DMD.NM_000109" width="1080" height="1080" /></a></div>
<div>根据这个图，我们就可以很明显的看出，DMD基因NM_000109转录本的1，10-17号外显子缺失，用IGV一个个的看这些外显子区域，是同样的结果！可能是芯片捕获不到，也可能是样本本身变异，造成的大片段缺失。但是这个图的信息就非常有用！</div>
<div>那么，我们该如何画这样的图呢？</div>
<div>首先，我们需要找到需要探究的基因的全部转录信息，及外显子信息！</div>
<div>在hg19_refGene.txt里面会有，在UCSC里面可以下载，新手可能会比较麻烦，实在不行你去annovar的目录也可以找到！</div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/01/16.png"><img class="alignnone size-full wp-image-1395" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/01/16.png" alt="1" width="878" height="370" /></a></div>
<div>那么，我们根据这个信息，就可以判断该基因的起始终止位点啦</div>
<div>然后用samtools的depth命令去找这个基因的全部片段的测序深度信息</div>
<div>最后再格式化成下面的三列数据</div>
<div>第一列是该外显子的坐标，从1到该外显子的成都</div>
<div>第二列是该外显子在该坐标的测序深度，通过samtools的depth命令得到</div>
<div>最后一列是该外显子的标记，从exon:79一直倒推到exon:1，因为该基因在染色体的负链，所以外显子顺序是反着的！</div>
<div>1 84 exon:79</div>
<div>2 84 exon:79</div>
<div>3 84 exon:79</div>
<div>4 85 exon:79</div>
<div>5 85 exon:79</div>
<div>6 86 exon:79</div>
<div>7 85 exon:79</div>
<div>8 87 exon:79</div>
<div>9 89 exon:79</div>
<div>10 91 exon:79</div>
<div>11 92 exon:79</div>
<div>12 95 exon:79</div>
<div>13 96 exon:79</div>
<div>14 96 exon:79</div>
<div>15 99 exon:79</div>
<div>16 99 exon:79</div>
<div>17 97 exon:79</div>
<div>最后根据这个txt文档，用R语言，很容易就画出上面那样的图片了！</div>
<div>这里面的信息量还是蛮大的！</div>
<div></div>
<div><a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/01/APOB.NM_000384.png"><img class="alignnone  wp-image-1393" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/01/APOB.NM_000384.png" alt="APOB.NM_000384" width="921" height="921" /></a></div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/1392.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>重复序列屏蔽第二讲-用repeatscount来构建重复序列文库</title>
		<link>http://www.bio-info-trainee.com/611.html</link>
		<comments>http://www.bio-info-trainee.com/611.html#comments</comments>
		<pubDate>Thu, 02 Apr 2015 05:53:42 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>
		<category><![CDATA[repeatscount]]></category>
		<category><![CDATA[文库]]></category>
		<category><![CDATA[重复序列]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=611</guid>
		<description><![CDATA[该软件主页 http://bix.ucsd.edu/repeatscout/ w &#8230; <a href="http://www.bio-info-trainee.com/611.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>该软件主页 <a href="http://bix.ucsd.edu/repeatscout/">http://bix.ucsd.edu/repeatscout/</a></p>
<p>wget <a href="http://bix.ucsd.edu/repeatscout/RepeatScout-1.0.5.tar.gz">http://bix.ucsd.edu/repeatscout/RepeatScout-1.0.5.tar.gz</a></p>
<p>解压进入目录，make即可</p>
<p>对于草莓这个215M的基因组来说，还是蛮快的！</p>
<p>第一步：用build_lmer_table命令把整个基因组生成一个频率表格，把所有有过重复的kmer都找出来。</p>
<p>/opt/RepeatScount/build_lmer_table -l 14 -sequence strawberry.fa -freq strawberry.freq</p>
<p>第二步：用 RepeatScout 这个命令根据生成的频率表格和基因组序列产生一个包含有所有的能找到的重复元件的文件。</p>
<p>RepeatScout -sequence strawberry.fa -freq strawberry.freq -l 14 -output strawberry_repeat</p>
<p>第三步：用filter-stage-1.prl这个脚本过滤掉低复杂度和串联重复元件。</p>
<p>&nbsp;</p>
<p>貌似得到的文件为空，难道是全部过滤掉了？？？</p>
<p>第四步：需要借用repeatmasker来把这个得到repeat文件当作文库运行生成一个out文件</p>
<p>这个软件的参数其实蛮多的，我只是简单介绍了一些，关于它参数的调试，在我网盘里面还有更具体的文档说明，就不列了！</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/611.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>重复序列屏蔽第一讲RepeatMasker的一些参数调试</title>
		<link>http://www.bio-info-trainee.com/589.html</link>
		<comments>http://www.bio-info-trainee.com/589.html#comments</comments>
		<pubDate>Wed, 01 Apr 2015 13:52:39 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>
		<category><![CDATA[repeatmasker]]></category>
		<category><![CDATA[重复序列屏蔽]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=589</guid>
		<description><![CDATA[这是很久以前的一篇文章，我先贴出来给大家看看，然后讲一个实例 一：RepeatM &#8230; <a href="http://www.bio-info-trainee.com/589.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>这是很久以前的一篇文章，我先贴出来给大家看看，然后讲一个实例</p>
<p>一：RepeatMasker的一些参数运行结果比较</p>
<p>从ncbi随便下载的zebrafish的一条sequence.fasta</p>
<p>不加上任何参数跑出来结果是 RepeatMasker   sequence.fasta</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索138.png"><img class="alignnone size-full wp-image-590" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索138.png" alt="repeat-masker参数摸索138" width="505" height="598" /></a></p>
<p>加上物种的参数之后跑出来是： RepeatMasker -species Danio  sequence.fasta</p>
<p>效果里面出来了，之前得到的重复序列不到10%，这次可以达到70%以上，所以必须得选好对应的物种，这样才不会错过那么多要找的重复序列</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索267.png"><img class="alignnone size-full wp-image-591" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索267.png" alt="repeat-masker参数摸索267" width="536" height="525" /></a></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索269.png"><img class="alignnone size-full wp-image-592" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索269.png" alt="repeat-masker参数摸索269" width="554" height="331" /></a></p>
<p>再加上-low这个参数是 RepeatMasker -species Danio -low  sequence.fasta</p>
<p>感觉没有改变多少，就少了几个</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索349.png"><img class="alignnone size-full wp-image-593" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索349.png" alt="repeat-masker参数摸索349" width="551" height="526" /></a> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索351.png"><img class="alignnone size-full wp-image-594" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索351.png" alt="repeat-masker参数摸索351" width="533" height="332" /></a></p>
<p>比较-div参数：RepeatMasker -species Danio  sequence.fasta</p>
<p>RepeatMasker -species Danio -div 10  sequence.fasta</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索459.png"><img class="alignnone size-full wp-image-595" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索459.png" alt="repeat-masker参数摸索459" width="554" height="197" /></a></p>
<p>而加上-div 10之后</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索475.png"><img class="alignnone size-full wp-image-596" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索475.png" alt="repeat-masker参数摸索475" width="553" height="218" /></a></p>
<p>第二列小于10%的全部被剔除掉了</p>
<p>输出参数，本来应该是用N把重复区域屏蔽掉的</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索518.png"><img class="alignnone size-full wp-image-597" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索518.png" alt="repeat-masker参数摸索518" width="505" height="436" /></a></p>
<p>但是如果加上参数-x，原来输出是N的地方就都变成了X，感觉这个参数没啥子意义。</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索560.png"><img class="alignnone size-full wp-image-598" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索560.png" alt="repeat-masker参数摸索560" width="498" height="287" /></a></p>
<p>还有一些类似的参数，意义也不大，加上-xsmall，就是把重复区域用小写字母，不再需要N来掩盖了</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索613.png"><img class="alignnone size-full wp-image-599" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索613.png" alt="repeat-masker参数摸索613" width="507" height="403" /></a></p>
<p>如果加上-a这个参数，就多了一个文件</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索637.png"><img class="alignnone size-full wp-image-600" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索637.png" alt="repeat-masker参数摸索637" width="230" height="76" /></a></p>
<p>查看可知其内容是</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索648.png"><img class="alignnone size-full wp-image-601" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索648.png" alt="repeat-masker参数摸索648" width="553" height="348" /></a></p>
<p>The  alignments are in the cross_match/SWAT format, in which mismatches rather than matches are indicated: transitions</p>
<p>with an i and  transversions with a v. Note it exists some differences between the  alignment file and the map fi le.</p>
<p>The map fi le is produced by  ProcessRepeats that the main task is to defragment the original  map file and the alignment fi le is created from the original map fi le:  the difference between them comes from the defragmented hits.<br />
如果加上-poly，也会多出一个文件</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1139.png"><img class="alignnone size-full wp-image-602" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1139.png" alt="repeat-masker参数摸索1139" width="232" height="40" /></a></p>
<p>查看，可知其单独列出了微卫星的表格</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1159.png"><img class="alignnone size-full wp-image-603" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1159.png" alt="repeat-masker参数摸索1159" width="554" height="137" /></a></p>
<p>The ‘-xm’, ‘-ace,’ and ‘-gff ’ options create an additional out put file in cross match, ACeDB, and Gene Feature Finding format  respectively.这几个参数都是为了生成适合其它处理的文件。</p>
<p>另外针对大文件的操作，可能需要-pa来设置运行速度，或者-s，-q，-qq</p>
<p>&nbsp;</p>
<p>二：生成的文件的解释</p>
<p>会输出这些文件</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1387.png"><img class="alignnone size-full wp-image-604" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1387.png" alt="repeat-masker参数摸索1387" width="553" height="26" /></a></p>
<p>1，。Out类文件</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1399.png"><img class="alignnone size-full wp-image-605" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1399.png" alt="repeat-masker参数摸索1399" width="554" height="201" /></a></p>
<table>
<tbody>
<tr>
<td width="142">SW score</td>
<td width="248">根据Smith-Waterman算法比对的分值</td>
<td width="161">2555</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Div%</td>
<td width="248">比上区间与共有序列相比的替代率</td>
<td width="161">5.7</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Del%</td>
<td width="248">在查询序列中碱基缺失的百分率(删除碱基)</td>
<td width="161">0.0</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Ins%</td>
<td width="248">在repeat库序列中碱基缺失的百分率(插入碱基)</td>
<td width="161"> 0.0</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Query sequence</td>
<td width="248">输入的待屏蔽重复的序列</td>
<td width="161">gi|211853417|emb|CU633477.14|</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Position begin</td>
<td width="248"></td>
<td width="161">373</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Position end</td>
<td width="248"></td>
<td width="161"> 690</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Query left</td>
<td width="248">在查询序列中超出比上区域的碱基数</p>
<p>+= 比上了库中重复序列的正义链，如果是互补连用“c”表示</td>
<td width="161">(50140)</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Matching repeat</td>
<td width="248">比上的重复序列的名称</td>
<td width="161">C DNA13TA1a_DR</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Repeat family(class)</td>
<td width="248">比上的重复序列的类型</td>
<td width="161">  DNA/TcMar-Tc1</td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Position begin</td>
<td width="248"></td>
<td width="161"></td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Position end</td>
<td width="248"></td>
<td width="161"></td>
<td width="16"></td>
</tr>
<tr>
<td width="142">Query left</td>
<td width="248">比对区域距重复序列左端的碱基数</td>
<td width="161"></td>
<td width="16"></td>
</tr>
<tr>
<td width="142"></td>
<td width="248">比对的顺序ID</td>
<td width="161"></td>
<td width="16"></td>
</tr>
</tbody>
</table>
<p>3.cat文件基本类似于。Out文件<br />
3。。Tbl类文件</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1917.png"><img class="alignnone size-full wp-image-606" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1917.png" alt="repeat-masker参数摸索1917" width="550" height="440" /></a> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1919.png"><img class="alignnone size-full wp-image-607" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索1919.png" alt="repeat-masker参数摸索1919" width="554" height="395" /></a><br />
4.masked文件，就是找到的重复序列被N给代替了，或者用参数改变代替形式</p>
<p>polyout文件。就是单独列出了微卫星表格</p>
<p>Align文件，其实就是把之前的。Out文件的每一行记录单独拿出来再进行表格化解释</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索2027.png"><img class="alignnone size-full wp-image-608" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/04/repeat-masker参数摸索2027.png" alt="repeat-masker参数摸索2027" width="554" height="360" /></a></p>
<p>把373到690的核苷酸序列列出来，说明这个DNA13TA1a_DR 重复具体的意义</p>
<p>但是没看懂这个i，v是什么意思</p>
<p>&nbsp;</p>
<p>结果比较</p>
<p>从ncbi随便下载的zebrafish的一条sequence.fasta</p>
<p>不加上任何参数跑出来结果是 RepeatMasker   sequence.fasta</p>
<p>&nbsp;</p>
<p>加上物种的参数之后跑出来是： RepeatMasker -species Danio  sequence.fasta</p>
<p>效果里面出来了，之前得到的重复序列不到10%，这次可以达到70%以上，所以必须得选好对应的物种，这样才不会错过那么多要找的重复序列</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/589.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>人为创造几个测序数据然后用soap组装成基因组</title>
		<link>http://www.bio-info-trainee.com/486.html</link>
		<comments>http://www.bio-info-trainee.com/486.html#comments</comments>
		<pubDate>Wed, 25 Mar 2015 12:29:09 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[soap]]></category>
		<category><![CDATA[模拟]]></category>
		<category><![CDATA[组装]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=486</guid>
		<description><![CDATA[这里我选取酵母基因组来组装，以为它只有一条染色体，而且本身也不大！ 这个文件就4 &#8230; <a href="http://www.bio-info-trainee.com/486.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>这里我选取酵母基因组来组装，以为它只有一条染色体，而且本身也不大！</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/人为创造几个测序数据然后用soap组装成基因组130.png"><img class="alignnone size-full wp-image-488" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/人为创造几个测序数据然后用soap组装成基因组130.png" alt="人为创造几个测序数据然后用soap组装成基因组130" width="470" height="302" /></a></p>
<p>这个文件就4.5M，然后第一行就是序列名，第二列就是序列的碱基组成。共4641652个碱基。</p>
<p>我写一个perl程序来人为的创造一个测序文件</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/人为创造几个测序数据然后用soap组装成基因组58.png"><img class="alignnone size-full wp-image-487" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/人为创造几个测序数据然后用soap组装成基因组58.png" alt="人为创造几个测序数据然后用soap组装成基因组58" width="554" height="171" /></a></p>
<p>这样我们的4.5M基因组就模拟出来了486M的单端100bp的测序数据，而且是无缝连接，按照道理应该很容易就拼接的。</p>
<p>/home/jmzeng/bio-soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer</p>
<p>all -s config_file -K 63 -R -o graph_prefix 1&gt;ass.log 2&gt;ass.err</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/人为创造几个测序数据然后用soap组装成基因组331.png"><img class="alignnone size-full wp-image-489" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/人为创造几个测序数据然后用soap组装成基因组331.png" alt="人为创造几个测序数据然后用soap组装成基因组331" width="507" height="506" /></a></p>
<p>可以看到组装效果还不错哦，然后我模拟了一个测试数据，再进行组装一次，这次更好！</p>
<p>其实还可以模拟双端测序，应该就能达到百分百组装了。</p>
<p>但是由于我代码里面选取的是80在随机错开，所以我把kmer的长度设置成了81来试试看，希望这样可以把它完全组装成一条e-coli基因组。</p>
<p>/home/jmzeng/bio-soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer</p>
<p>all -s config_file -K 81 -R -o graph_prefix 1&gt;ass.log 2&gt;ass.err</p>
<p>但是也没有什么实质性的提高，虽然理论上是肯定可以组装到一起！</p>
<p>那我再模拟一个双端测序吧，中间间隔200bp的。</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/486.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>基因组组装软件SOAPdenovo安装使用</title>
		<link>http://www.bio-info-trainee.com/476.html</link>
		<comments>http://www.bio-info-trainee.com/476.html#comments</comments>
		<pubDate>Wed, 25 Mar 2015 10:05:28 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>
		<category><![CDATA[基础软件]]></category>
		<category><![CDATA[soap]]></category>
		<category><![CDATA[组装]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=476</guid>
		<description><![CDATA[一．下载并安装这个软件 下载地址进下面，但是下载源码安装总是很困难，我直接下载b &#8230; <a href="http://www.bio-info-trainee.com/476.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>一．下载并安装这个软件</p>
<p>下载地址进下面，但是下载源码安装总是很困难，我直接下载bin文件可执行程序。</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用104.png"><img class="alignnone size-full wp-image-477" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用104.png" alt="基因组组装软件SOAPdenovo安装使用104" width="554" height="145" /></a></p>
<p>解压进入目录</p>
<p>首先make</p>
<p>然后make install即可</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用731.png"><img class="alignnone size-full wp-image-478" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用731.png" alt="基因组组装软件SOAPdenovo安装使用731" width="490" height="257" /></a></p>
<p>安装总是失败，我也不知道怎么回事，懒得解决了。</p>
<p>直接去我老师那里把这个程序拷贝进来了。</p>
<p><a href="https://github.com/aquaskyline/SOAPdenovo2/archive/master.zip">https://github.com/aquaskyline/SOAPdenovo2/archive/master.zip</a></p>
<p><a href="http://sourceforge.net/projects/soapdenovo2/files/SOAPdenovo2/bin/r240/SOAPdenovo2-bin-LINUX-generic-r240.tgz/download">http://sourceforge.net/projects/soapdenovo2/files/SOAPdenovo2/bin/r240/SOAPdenovo2-bin-LINUX-generic-r240.tgz/download</a></p>
<p><a href="http://sourceforge.net/projects/soapdenovo2/files/latest/download?source=files">http://sourceforge.net/projects/soapdenovo2/files/latest/download?source=files</a></p>
<p>也可以直接下载bin程序</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用1035.png"><img class="alignnone size-full wp-image-481" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用1035.png" alt="基因组组装软件SOAPdenovo安装使用1035" width="554" height="171" /></a></p>
<p>二．准备测试数据</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用742.png"><img class="alignnone size-full wp-image-479" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用742.png" alt="基因组组装软件SOAPdenovo安装使用742" width="382" height="259" /></a></p>
<p>类似于这样的几个文库的左右两端测序数据。</p>
<p>我这里用一个小样本的单端数据做测试</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用783.png"><img class="alignnone size-full wp-image-480" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用783.png" alt="基因组组装软件SOAPdenovo安装使用783" width="349" height="156" /></a></p>
<p>三，参考命令</p>
<p>You may run it like this:</p>
<p>参考：<a href="http://www.plob.org/2012/07/06/2537.html">http://www.plob.org/2012/07/06/2537.html</a></p>
<p><a href="https://github.com/aquaskyline/SOAPdenovo2">https://github.com/aquaskyline/SOAPdenovo2</a></p>
<p>总共就四个步骤，介绍如下。</p>
<p>&nbsp;</p>
<table>
<tbody>
<tr>
<td width="569">./pregraph_sparse [parameters]</td>
</tr>
<tr>
<td width="569">./SOAPdenovo-63mer contig [parameters]</td>
</tr>
<tr>
<td width="569">./SOAPdenovo-63mer map [parameters]</td>
</tr>
<tr>
<td width="569">./SOAPdenovo-63mer scaff [parameters]</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<table>
<tbody>
<tr>
<td width="570">i) preparing the pregraph. This step is similar to velveth for velvet.</td>
</tr>
<tr>
<td width="570">ii) Determining contigs. This step is similar to velvetg for velvet.</td>
</tr>
<tr>
<td width="570">iii) Mapping back reads on to contigs.</td>
</tr>
<tr>
<td width="570">iv) Assembling contigs into scaffolds.</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<table>
<tbody>
<tr>
<td width="568"> SOAPdenovo-63mer  sparse_pregraph  <b>-s config_file -K 45 -p 28 -z 1100000000 -o outPG</b></td>
</tr>
<tr>
<td width="568"> SOAPdenovo-63mer contig  <b>-g outPG</b></td>
</tr>
<tr>
<td width="568"> SOAPdenovo-63mer map <b> -s config_file -g outPG -p 28</b></td>
</tr>
<tr>
<td width="568"> SOAPdenovo-63mer  scaff <b>  -g outPG -p 28</b></td>
</tr>
</tbody>
</table>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用1629.png"><img class="alignnone size-full wp-image-482" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用1629.png" alt="基因组组装软件SOAPdenovo安装使用1629" width="554" height="161" /></a></p>
<p>官网给出的步骤如下</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用1641.png"><img class="alignnone size-full wp-image-483" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用1641.png" alt="基因组组装软件SOAPdenovo安装使用1641" width="554" height="238" /></a></p>
<p>这个命令还需要一个配置文件</p>
<p>max_rd_len=99 设置最大reads长度，具体情况具体定义</p>
<p>[LIB] 第一个文库数据</p>
<p>avg_ins=225</p>
<p>reverse_seq=0</p>
<p>asm_flags=3</p>
<p>rank=1</p>
<p>q1=runPE_1.fq</p>
<p>q2=runPE_2.fq</p>
<p>[LIB] 第二个文库数据</p>
<p>avg_ins=2000</p>
<p>reverse_seq=1</p>
<p>asm_flags=2</p>
<p>rank=2</p>
<p>q1=runMP_1.fq</p>
<p>q2=runMP_2.fq</p>
<p>也可以全部一次性的搞一个命令</p>
<p>all -s config_file -K 63 -R -o graph_prefix 1&gt;ass.log 2&gt;ass.err</p>
<p>我简单修改了一下参考博客的代码跟官网的代码，然后运行了我自己的代码</p>
<p>/home/jmzeng/bio-soft/SOAPdenovo2-bin-LINUX-generic-r240/SOAPdenovo-127mer</p>
<p><b>all -s config_file -K 63 -R -o </b>graph_prefix 1&gt;ass.log 2&gt;ass.err</p>
<p>反正我也不懂，就先跑跑看咯</p>
<p>我选取的是7个单端数据，所以我的配置文件是</p>
<p>max_rd_len=500</p>
<p>[LIB]</p>
<p>avg_ins=225</p>
<p>reverse_seq=0</p>
<p>asm_flags=3</p>
<p>rank=1</p>
<p>p=SRR072005.fa</p>
<p>p=SRR072010.fa</p>
<p>p=SRR072011.fa</p>
<p>p=SRR072012.fa</p>
<p>p=SRR072013.fa</p>
<p>p=SRR072014.fa</p>
<p>p=SRR072029.fa</p>
<p>四．输出数据解读</p>
<p>好像我的数据都比较小，就7个三百多兆的fasta序列，几个小时就跑完啦</p>
<p>四个步骤都有输出数据</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用2446.png"><img class="alignnone size-full wp-image-484" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/基因组组装软件SOAPdenovo安装使用2446.png" alt="基因组组装软件SOAPdenovo安装使用2446" width="554" height="475" /></a></p>
<p>好像组装效果惨不忍睹呀！共86万的contig，50多万的scaffold</p>
<p>scaffolds&gt;100  505473 99.60%</p>
<p>scaffolds&gt;500  113523 22.37%</p>
<p>scaffolds&gt;1K   48283 9.51%</p>
<p>scaffolds&gt;10K  0 0.00%</p>
<p>scaffolds&gt;100K 0 0.00%</p>
<p>scaffolds&gt;1M   0 0.00%</p>
<p>这其实都相当于没有组装了，因为我的测序判断本来就很多是大于500的！</p>
<p>可能是我的kmer值选取的不对</p>
<p>Kmer为63跑出来的效果不怎么好，86万的contig，50万的scaffold的</p>
<p>Kmer为35跑出来的效果更惨，203万的contig，近60万的scaffold。</p>
<p>我觉得问题可能不是这里了，可能是没有用到那个20k和3k的双端测序库，唉，其实我习惯了illumina的测序数据，不太喜欢这个454的</p>
<p>感觉组装好难呀，业余时间搞不定呀，希望有高手能一起交流，哈哈，我自己再慢慢来试试。</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/476.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>草莓基因组数据预处理</title>
		<link>http://www.bio-info-trainee.com/467.html</link>
		<comments>http://www.bio-info-trainee.com/467.html#comments</comments>
		<pubDate>Tue, 24 Mar 2015 10:03:34 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>
		<category><![CDATA[fastqc]]></category>
		<category><![CDATA[基因组]]></category>
		<category><![CDATA[草莓]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=467</guid>
		<description><![CDATA[今天先 对7个单端数据做处理，是454数据，平均长度300bp左右，明天再处理3 &#8230; <a href="http://www.bio-info-trainee.com/467.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>今天先 对7个单端数据做处理，是454数据，平均长度300bp左右，明天再处理3KB和20KB的配对reads。</p>
<p>首先跑fastqc</p>
<p>打开一个个看结果</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理28.png"><img class="alignnone size-full wp-image-468" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理28.png" alt="草莓基因组数据预处理28" width="458" height="322" /></a></p>
<p>可以看到前面一些碱基的质量还是不错的， 因为这是454平台测序数据，序列片段长度差异很大，一般前四百个bp的碱基质量还是不错的，太长了的测序片段也不可靠</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理39.png"><img class="alignnone size-full wp-image-469" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理39.png" alt="草莓基因组数据预处理39" width="465" height="342" /></a></p>
<p>重点在下面这个图片，可以看到，前面的4个碱基是adaptor，肯定是要去除的，不是我们的测序数据。是TCAG，需要去除掉。</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理118.png"><img class="alignnone size-full wp-image-470" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理118.png" alt="草莓基因组数据预处理118" width="453" height="336" /></a></p>
<p>所以我们用了 solexaQA 这个套装软件对原始测序数据进行过滤</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理214.png"><img class="alignnone size-full wp-image-471" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理214.png" alt="草莓基因组数据预处理214" width="553" height="232" /></a></p>
<p>可以看到过滤的非常明显！！！甚至有个样本基本全军覆没了！然后我查看了我的批处理脚本，发现可能是perl DynamicTrim.pl -454 $id这个参数有问题</p>
<p>for id in *fastq</p>
<p>do</p>
<p>echo $id</p>
<p>perl DynamicTrim.pl -454 $id</p>
<p>done</p>
<p>for id in *trimmed</p>
<p>do</p>
<p>echo $id</p>
<p>perl LengthSort.pl $id</p>
<p>done</p>
<p>&nbsp;</p>
<p>可以看到末尾的质量差的碱基都被去掉了，但是头部的TCAG还是没有去掉。</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理425.png"><img class="alignnone size-full wp-image-472" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理425.png" alt="草莓基因组数据预处理425" width="553" height="259" /></a></p>
<p>处理完毕后的数据如下：</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理475.png"><img class="alignnone size-full wp-image-473" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/草莓基因组数据预处理475.png" alt="草莓基因组数据预处理475" width="351" height="155" /></a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/467.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>草莓基因组文章解读-并下载原始测序数据</title>
		<link>http://www.bio-info-trainee.com/318.html</link>
		<comments>http://www.bio-info-trainee.com/318.html#comments</comments>
		<pubDate>Tue, 17 Mar 2015 15:05:49 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基因组学]]></category>
		<category><![CDATA[基础数据库]]></category>
		<category><![CDATA[SRA]]></category>
		<category><![CDATA[原始reads]]></category>
		<category><![CDATA[基因组]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=318</guid>
		<description><![CDATA[找橡胶测序数据无果 所以我只好找了他们所参考的草莓（strawberry, Fr &#8230; <a href="http://www.bio-info-trainee.com/318.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>找橡胶测序数据无果</p>
<p>所以我只好找了他们所参考的草莓（strawberry, Fragaria vesca (2n = 2x = 14)，a small genome (240 Mb),）的文章，是发表是nature genetics上面的</p>
<p><a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326587/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326587/</a></p>
<p>可以看到它的SRA索取号。</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章1087.png"><img class="alignnone size-full wp-image-312" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章1087.png" alt="研读橡胶的基因组文章1087" width="554" height="225" /></a></p>
<p>草莓组装结果：Over 3,200 scaffolds were assembled with an N50 of 1.3 Mb .</p>
<p>Over 95% (209.8 Mb) of the total sequence is represented in 272 scaffolds.</p>
<p>草莓基因息：Gene prediction modeling identified 34,809 genes, with most being supported by transcriptome mapping.</p>
<p>草莓染色体信息：Paradoxically, the small basic (x = 7) genome size of the strawberry genus, ~240 Mb,</p>
<p>offers substantial advantages for genomic research.</p>
<p>草莓来源：diploid strawberry F. vesca ssp. vesca accession Hawaii 4</p>
<p>(National Clonal Germplasm Repository accession # PI551572).</p>
<p>然后我去NCBI上面下载这三个数据</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章1664.png"><img class="alignnone size-full wp-image-313" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章1664.png" alt="研读橡胶的基因组文章1664" width="554" height="494" /></a></p>
<p>&nbsp;</p>
<p>SRA020125 共有四个数据：</p>
<p>&nbsp;</p>
<table>
<tbody>
<tr>
<td width="284"><a href="http://www.ncbi.nlm.nih.gov/sra/SRX030575[accn]">http://www.ncbi.nlm.nih.gov/sra/SRX030575[accn]</a></td>
<td width="284"><b>Total: </b>4 runs, 4.7M spots, 2.6G bases, <a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX030/SRX030575">5.5Gb</a></td>
</tr>
<tr>
<td width="284"><a href="http://www.ncbi.nlm.nih.gov/sra/SRX030576[accn]">http://www.ncbi.nlm.nih.gov/sra/SRX030576[accn]</a>  （3 KB PE）</td>
<td width="284"><b>Total: </b>2 runs, 2.2M spots, 908.5M bases, <a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX030/SRX030576">2.1Gb</a></td>
</tr>
<tr>
<td width="284"><a href="http://www.ncbi.nlm.nih.gov/sra/SRX030577[accn]">http://www.ncbi.nlm.nih.gov/sra/SRX030577[accn]</a> （20KB片段）</td>
<td width="284"><b>Total: </b>2 runs, 1.9M spots, 800M bases, <a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX030/SRX030577">1.8Gb</a></td>
</tr>
<tr>
<td width="284"><a href="http://www.ncbi.nlm.nih.gov/sra/SRX030578[accn]">http://www.ncbi.nlm.nih.gov/sra/SRX030578[accn]</a></td>
<td width="284"><b>Total: </b>3 runs, 4M spots, 2.2G bases, <a href="ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX030/SRX030578">4.6Gb</a></td>
</tr>
</tbody>
</table>
<p>挂在后台自动下载</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章2877.png"><img class="alignnone size-full wp-image-314" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章2877.png" alt="研读橡胶的基因组文章2877" width="554" height="39" /></a></p>
<p>好了，有了这些数据我们就要进行基因组的一系列分析啦！！！</p>
<p>不过我们可以先看看他们这个研究小组的成果</p>
<p>首先他们建造了一个关于草莓的基因组信息网站</p>
<p><a href="https://strawberry.plantandfood.co.nz/">https://strawberry.plantandfood.co.nz/</a></p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章3091.png"><img class="alignnone size-full wp-image-315" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章3091.png" alt="研读橡胶的基因组文章3091" width="554" height="446" /></a></p>
<p>跟我之前在水科院做鲫鱼鲤鱼的差不多</p>
<p>直接在里面就可以下载他们做好的所有数据，也可以可视化。</p>
<p>&nbsp;</p>
<p>它的染色体如下，非常简单，就七条染色体</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章3106.png"><img class="alignnone size-full wp-image-316" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章3106.png" alt="研读橡胶的基因组文章3106" width="554" height="146" /></a></p>
<p>&nbsp;</p>
<p><a href="http://www.rosaceae.org/species/fragaria/fragaria_vesca/genome_v1.1">http://www.rosaceae.org/species/fragaria/fragaria_vesca/genome_v1.1</a></p>
<p>我找到了它组装好的草莓基因组地址，用批处理全部下载了</p>
<p><a href="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章3287.png"><img class="alignnone size-full wp-image-308" src="http://www.bio-info-trainee.com/wp-content/uploads/2015/03/研读橡胶的基因组文章3287.png" alt="研读橡胶的基因组文章3287" width="553" height="240" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/318.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
