<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>生信菜鸟团 &#187; picard</title>
	<atom:link href="http://www.bio-info-trainee.com/tag/picard/feed" rel="self" type="application/rss+xml" />
	<link>http://www.bio-info-trainee.com</link>
	<description>欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee</description>
	<lastBuildDate>Sat, 28 Jun 2025 14:30:13 +0000</lastBuildDate>
	<language>zh-CN</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.1.33</generator>
	<item>
		<title>仔细探究picard的MarkDuplicates 是如何行使去除PCR重复reads功能的</title>
		<link>http://www.bio-info-trainee.com/2008.html</link>
		<comments>http://www.bio-info-trainee.com/2008.html#comments</comments>
		<pubDate>Sat, 12 Nov 2016 02:11:23 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基础软件]]></category>
		<category><![CDATA[生信基础]]></category>
		<category><![CDATA[MarkDuplicates]]></category>
		<category><![CDATA[pcr]]></category>
		<category><![CDATA[picard]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2008</guid>
		<description><![CDATA[本帖紧跟前面的仔细探究samtools的rmdup是如何行使去除PCR重复rea &#8230; <a href="http://www.bio-info-trainee.com/2008.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>本帖紧跟前面的<a title="详细阅读 仔细探究samtools的rmdup是如何行使去除PCR重复reads功能的" href="http://www.bio-info-trainee.com/2003.html" rel="bookmark">仔细探究samtools的rmdup是如何行使去除PCR重复reads功能的</a></p>
<p>同样的我们也是分单端和双端测序来看结果，并且比较两个工具的区别！</p>
<p>首先对于那个单端数据，samtools给出的结果是：[bam_rmdupse_core] 25 / 53 = 0.4717 in library<span id="more-2008"></span></p>
<p>而我用picard得到的结果是：</p>
<blockquote><p>INFO 2016-11-12 09:48:29 MarkDuplicates <strong><span style="color: #ff00ff;">Read 53 records. 0 pairs never matched.</span></strong><br />
INFO 2016-11-12 09:48:31 MarkDuplicates After buildSortedReadEndLists freeMemory: 248541856; totalMemory: 3887595520; maxMemory: 57266405376<br />
INFO 2016-11-12 09:48:31 MarkDuplicates Will retain up to 1789575168 duplicate indices before spilling to disk.<br />
INFO 2016-11-12 09:49:14 MarkDuplicates Traversing read pair information and detecting duplicates.<br />
INFO 2016-11-12 09:49:15 MarkDuplicates Traversing fragment information and detecting duplicates.<br />
INFO 2016-11-12 09:49:15 MarkDuplicates Sorting list of duplicate records.<br />
INFO 2016-11-12 09:54:35 MarkDuplicates After generateDuplicateIndexes freeMemory: 3885082288; totalMemory: 18204327936; maxMemory: 57266405376<br />
INFO 2016-11-12 09:54:35 MarkDuplicates <span style="color: #ff00ff;"><strong>Marking 25 records as duplicates.</strong></span><br />
INFO 2016-11-12 09:54:35 MarkDuplicates Found 0 optical duplicate clusters.</p>
<p>&nbsp;</p></blockquote>
<p>看起来并没有差别哦，找到的duplicate都是一样的，但是这种java软件的缺点就是奇慢无比~~~~</p>
<p>而且picard对于单端或者双端测序数据并没有区分参数，可以用同一个命令！</p>
<p>那么接下来我测试双端测序数据, 依然是没有差别，都是去掉了4个，可能是我给出的测试数据太少了。</p>
<blockquote><p>INFO 2016-11-12 09:57:45 MarkDuplicates<strong><span style="color: #ff00ff;"> Read 30 records. 3 pairs never matched.</span></strong><br />
INFO 2016-11-12 09:57:47 MarkDuplicates After buildSortedReadEndLists freeMemory: 248541896; totalMemory: 3887595520; maxMemory: 57266405376<br />
INFO 2016-11-12 09:57:47 MarkDuplicates Will retain up to 1789575168 duplicate indices before spilling to disk.<br />
INFO 2016-11-12 09:58:26 MarkDuplicates Traversing read pair information and detecting duplicates.<br />
INFO 2016-11-12 09:58:26 MarkDuplicates Traversing fragment information and detecting duplicates.<br />
INFO 2016-11-12 09:58:26 MarkDuplicates Sorting list of duplicate records.<br />
INFO 2016-11-12 10:02:59 MarkDuplicates After generateDuplicateIndexes freeMemory: 3885083112; totalMemory: 18204327936; maxMemory: 57266405376<br />
INFO 2016-11-12 10:02:59 MarkDuplicates <strong><span style="color: #ff00ff;">Marking 4 records as duplicates.</span></strong></p>
<p>&nbsp;</p></blockquote>
<p>测试数据，大家可以去下载，里面有脚本和测试数据！<a href="http://www.biotrainee.com/jmzeng/rmDuplicate.zip " target="_blank">http://www.biotrainee.com/jmzeng/rmDuplicate.zip </a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2008.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>仔细探究samtools的rmdup是如何行使去除PCR重复reads功能的</title>
		<link>http://www.bio-info-trainee.com/2003.html</link>
		<comments>http://www.bio-info-trainee.com/2003.html#comments</comments>
		<pubDate>Sat, 12 Nov 2016 01:51:30 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[基础软件]]></category>
		<category><![CDATA[生信基础]]></category>
		<category><![CDATA[pcr]]></category>
		<category><![CDATA[picard]]></category>
		<category><![CDATA[rmdup]]></category>
		<category><![CDATA[samtools]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=2003</guid>
		<description><![CDATA[在做这个去除PCR重复reads时候必须要明白为什么要做这个呢？WGS？WES？ &#8230; <a href="http://www.bio-info-trainee.com/2003.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<div>在做这个去除PCR重复reads时候必须要明白为什么要做这个呢？WGS？WES？RNA-SEQ?CHIP-SEQ？都需要吗？随机打断测序才需要？特异性捕获不需要？</div>
<div>搞明白了，我们就开始做，首先拿一个小的单端测序数据比对结果来做测试！</div>
<div>samtools rmdup -s tmp.sorted.bam tmp.rmdup.bam</div>
<div>[bam_rmdupse_core] 25 / 53 = 0.4717 in library</div>
<div>我们的测试数据里面有53条records根据软件算出了25条reads都是PCR的duplicate，所以去除了！</div>
<div></div>
<p><span id="more-2003"></span></p>
<div><img src="file:///C:/Users/jimmy1314/AppData/Local/YNote/data/jmzeng1314@163.com/18c083a443ef43a683427e7c7b1e1f7e/clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="4798B43C7B154A5F86731005CA2CBCE4" /> <a href="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/1.png"><img class="alignnone size-full wp-image-2005" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/1.png" alt="1" width="981" height="804" /></a></div>
<div>samtools rmdup 的官方说明书见： <a href="http://www.htslib.org/doc/samtools.html">http://www.htslib.org/doc/samtools.html</a></div>
<div>samtools rmdup [-sS] &lt;input.srt.bam&gt; &lt;out.bam&gt;</div>
<div>只需要开始-s的标签， 就可以对单端测序进行去除PCR重复。其实对单端测序去除PCR重复很简单的~，因为比对flag情况只有0,4,16，只需要它们比对到染色体的起始终止坐标一致即可，flag很容易一致。但是对于双端测序就有点复杂了~</div>
<div>Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).</div>
<div></div>
<div>然后我们再拿一个小的双端测序数据来测试一下：</div>
<div>samtools rmdup tmp.sorted.bam tmp.rmdup.bam</div>
<div>[bam_rmdup_core] processing reference chr10...</div>
<div>[bam_rmdup_core] 2 / 12 = 0.1667 in library</div>
<div><img class="alignnone size-full wp-image-2004" src="http://www.bio-info-trainee.com/wp-content/uploads/2016/11/2.png" alt="2" width="1126" height="850" /></div>
<div>很明显可以看出，去除PCR重复不仅仅需要它们比对到染色体的起始终止坐标一致，尤其是flag，在双端测序里面一大堆的flag情况，所以我们的94741坐标的5个reads，一个都没有去除！</div>
<div><img src="file:///C:/Users/jimmy1314/AppData/Local/YNote/data/jmzeng1314@163.com/a5fa5ed819284d90813dc105eab41b26/clipboard.png" alt="" data-media-type="image" data-attr-org-src-id="DA34A96CFC794961AD847A279ADE8253" /></div>
<div>这样的话，双端测序数据，用samtools rmdup效果就很差，所以很多人建议用picard工具的MarkDuplicates 功能~~~</div>
<div>The optimal solution depends on many factors - the consensus seems to be the the picard markduplicates could be the best current solution.</div>
<div></div>
<div>The appropriateness of duplicate removal depends on coverage - one would want to only remove artificial duplicates and keep the natural duplicates.</div>
<div></div>
<div>MarkDuplicates is "more correct" in the strict sense. Rmdup is more efficient simply because it does handle those tough cases. Rmdup works for single-end, too, but it cannot do paired-end and single-end at the same time. It does not work properly for mate-pair reads if read lengths are different.</div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/2003.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GATK使用注意事项</title>
		<link>http://www.bio-info-trainee.com/838.html</link>
		<comments>http://www.bio-info-trainee.com/838.html#comments</comments>
		<pubDate>Mon, 06 Jul 2015 23:27:05 +0000</pubDate>
		<dc:creator><![CDATA[ulwvfje]]></dc:creator>
				<category><![CDATA[生信基础]]></category>
		<category><![CDATA[bwa]]></category>
		<category><![CDATA[gatk]]></category>
		<category><![CDATA[picard]]></category>
		<category><![CDATA[samtools]]></category>
		<category><![CDATA[snp]]></category>

		<guid isPermaLink="false">http://www.bio-info-trainee.com/?p=838</guid>
		<description><![CDATA[GATK这个软件在做snp-calling的时候使用率非常高，因为之前一直是简单 &#8230; <a href="http://www.bio-info-trainee.com/838.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>GATK这个软件在做snp-calling的时候使用率非常高，因为之前一直是简单粗略的看看snp情况而已，所以没有具体研究它。</p>
<p>这些天做一些外显子项目以找snp为重点，所以想了想还是用起它，报错非常多，调试了好久才成功。</p>
<p>所以记录一些注意事项!</p>
<p>GATK软件本身是受版权保护的，所以需要申请才能下载使用，大家自己去broad institute申请即可。</p>
<p>下载软件就可以直接使用，java软件不需要安装，但是需要你的机器上面有java，当然软件只是个开始，重点是你还得下载很多配套数据，<a href="https://software.broadinstitute.org/gatk/download/bundle" target="_blank">https://software.broadinstitute.org/gatk/download/bundle</a>（ps:这个链接可能会失效，下面的文件，请自己谷歌找到地址哈。），而且这个时候要明确你的参考基因组版本了！！！ <span style="color: #ff6600;">b36/b37/hg18/hg19/hg38，记住b37和hg19并不是完全一样的，有些微区别哦！！！</span><br />
<span id="more-838"></span></p>
<p>比如我选择了hg19</p>
<p>第一点是hg19的下载：这个下载地址非常多，常用的就是NCBI，ensembl和UCSC了，但是这里推荐用这个脚本下载</p>
<p>for i in $(seq 1 22) X Y M;</p>
<p>do echo $i;</p>
<p>wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;</p>
<p>done</p>
<p>gunzip *.gz</p>
<p>for i in $(seq 1 22) X Y M;</p>
<p>do cat chr${i}.fa &gt;&gt; hg19.fasta;</p>
<p>done</p>
<p>rm -fr chr*.fasta</p>
<p>看得懂shell脚本的应该知道这是一个个的下载hg19的染色体，再用cat按照染色体的顺序拼接起来，因为GATK后面的一些步骤对染色体顺序要求非常变态，如果下载整个hg19，很难保证染色体顺序是1-22，X,Y,M。如下</p>
<p>然后需要对下载的hg19进行索引（bwa和samtools）和建立dict文件（用picard）</p>
<p>bwa index -a bwtsw hg19.fasta</p>
<p>samtools faidx hg19.fasta</p>
<p>然后还要下载几个参考文件，这个是可以选择的.</p>
<p>对我的hg19来说，就应该是去，ftp://ftp.broadinstitute.org/bundle/hg19/ 下载咯。</p>
<p><strong><span style="color: #ff6600;">最后，所有必备的文件如下：</span></strong></p>
<p>231M Jul 2 05:14 1000G_phase1.indels.hg19.sites.vcf<br />
1.2M Jul 2 10:45 1000G_phase1.indels.hg19.sites.vcf.idx<br />
11G Jul 2 08:05 dbsnp_138.hg19.vcf<br />
2.5K Jul 1 04:31 hg19.dict<br />
3.0G Jun 30 21:29 hg19.fasta<br />
6.6K Jun 30 22:54 hg19.fasta.amb<br />
944 Jun 30 22:54 hg19.fasta.ann<br />
2.9G Jun 30 22:54 hg19.fasta.bwt<br />
788 Jul 2 01:53 hg19.fasta.fai<br />
739M Jun 30 22:54 hg19.fasta.pac<br />
1.5G Jun 30 23:23 hg19.fasta.sa<br />
87M Jul 2 05:37 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf<br />
2.3M Jul 2 10:45 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx</p>
<p>&nbsp;</p>
<p>接下来开始跑程序</p>
<p>第一步就是生成sam文件啦bwa mem -t 12 -M  hg19.fasta tmp*fq &gt;tmp.sam</p>
<p>第二步是sort，我用的是picard工具java  -Xmx100g -jar AddOrReplaceReadGroups.jar I=tmp.sam  O=tmp.sorted.bam</p>
<p>SORT_ORDER=coordinate</p>
<p>CREATE_INDEX=true</p>
<p>RGID=tmp</p>
<p>RGLB="pe"</p>
<p>RGPU="HiSeq-2000"</p>
<p>RGSM=PC3-2</p>
<p>RGCN="Human Genetics of Infectious Disease"</p>
<p>RGDS=hg19 RGPL=illumina</p>
<p>VALIDATION_STRINGENCY=SILENT</p>
<p>第三步是去除PCR重复，我还是选择用picard工具</p>
<p>java  -Xmx100g  -jar MarkDuplicates.jar</p>
<p>CREATE_INDEX=true REMOVE_DUPLICATES=True</p>
<p>ASSUME_SORTED=True VALIDATION_STRINGENCY=LENIENT</p>
<p>I=tmp.sorted.bam OUTPUT=tmp.dedup.bam METRICS_FILE=tmp.metrics</p>
<p>第四步是终于要开始用GATK啦，主要是确定要进行重新比对的区域，这个步骤分成三个小步骤：</p>
<p>首先用RealignerTargetCreator找到需要重新比对的区域，输出文件intervals</p>
<p>java -Xmx200g -jar ~/apps/gatk/GenomeAnalysisTK.jar</p>
<p>-R hg19.fasta  #这里需要用这个参考基因组，所以参考基因组特别重要，DICT也要按照流程生成</p>
<p>-T RealignerTargetCreator</p>
<p>-I tmp.dedup.bam -o tmp.intervals</p>
<p>-known /home/ldzeng/EXON/ref/1000G_phase1.indels.hg19.sites.vcf</p>
<p>这一步骤好像非常耗时</p>
<p>&nbsp;</p>
<p>可以看到，我总共就测试了5014个reads，结果就花了近半个小时才搞定，只有947个reads被过滤了。</p>
<p>输出的tmp.intervals 文件是一个1404946行的文件</p>
<p>chr1:13957-13958</p>
<p>chr1:46402-46403</p>
<p>chr1:47190-47191</p>
<p>chr1:52185-52188</p>
<p>chr1:53234-53236</p>
<p>chr1:55249-55250</p>
<p>chr1:63735-63738</p>
<p>人的外显子只有二三十万，所以我暂时也不确定这个文件是什么！</p>
<p>&nbsp;</p>
<p>然后用输出的 tmp.intervals 做输入文件来进行重新比对，也就是用IndelRealigner在这些区域内进行重新比对</p>
<p>java -Xmx150g -jar ~/apps/gatk/GenomeAnalysisTK.jar \</p>
<p>-R hg19.fasta \</p>
<p>-T IndelRealigner \</p>
<p>-targetIntervals tmp.intervals \</p>
<p>-I tmp.dedup.bam -o tmp.dedup.realgn.bam \</p>
<p>-known /home/ldzeng/EXON/ref/1000G_phase1.indels.hg19.sites.vcf</p>
<p>&nbsp;</p>
<p>我只需要它的重新比对，所以后面的一些功能没有怎么用，一个是call snp，一个是算比对质量值</p>
<p>java -Xmx200g -jar ~apps/gatk/GenomeAnalysisTK.jar</p>
<p>-nct 20 -T HaplotypeCaller -R hg19.fasta</p>
<p>-I tmp.dedup.realgn.bam</p>
<p>-o tmp.gatk.vcf</p>
<p>最后输出的文件如下</p>
<p>639K Jul 5 10:17 tmp1.fq<br />
639K Jul 5 10:19 tmp2.fq<br />
1.5M Jul 5 10:26 tmp.dedup.bai<br />
403K Jul 5 10:26 tmp.dedup.bam<br />
12K Jul 5 12:02 tmp.gatk.vcf<br />
3.4K Jul 5 12:02 tmp.gatk.vcf.idx<br />
32M Jul 5 11:24 tmp.intervals<br />
950 Jul 5 10:26 tmp.metrics<br />
1.5M Jul 5 11:31 tmp.realgn.bai<br />
409K Jul 5 11:31 tmp.realgn.bam<br />
1.6M Jul 5 10:20 tmp.sam<br />
1.5M Jul 5 10:23 tmp.sorted.bai<br />
399K Jul 5 10:23 tmp.sorted.bam</p>
<p>&nbsp;</p>
<p>备注：GATK对基因组要求一个字典文件</p>
<p>使用picard工具包的CreateSequenceDictionary.jar生成。以hg19.fa为例，生成的命令为：</p>
<div>    java -Xmx2g -jar /path_to_picard/CreateSequenceDictionary.jar R=hg19.fa O=hg19.dict</div>
]]></content:encoded>
			<wfw:commentRss>http://www.bio-info-trainee.com/838.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
