仔细探究picard的MarkDuplicates 是如何行使去除PCR重复reads功能的

ulwvfje — Sat, 12 Nov 2016 02:11:23 +0000

本帖紧跟前面的仔细探究samtools的rmdup是如何行使去除PCR重复reads功能的

同样的我们也是分单端和双端测序来看结果，并且比较两个工具的区别！

首先对于那个单端数据，samtools给出的结果是：[bam_rmdupse_core] 25 / 53 = 0.4717 in library

而我用picard得到的结果是：

INFO 2016-11-12 09:48:29 MarkDuplicates Read 53 records. 0 pairs never matched.
INFO 2016-11-12 09:48:31 MarkDuplicates After buildSortedReadEndLists freeMemory: 248541856; totalMemory: 3887595520; maxMemory: 57266405376
INFO 2016-11-12 09:48:31 MarkDuplicates Will retain up to 1789575168 duplicate indices before spilling to disk.
INFO 2016-11-12 09:49:14 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2016-11-12 09:49:15 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2016-11-12 09:49:15 MarkDuplicates Sorting list of duplicate records.
INFO 2016-11-12 09:54:35 MarkDuplicates After generateDuplicateIndexes freeMemory: 3885082288; totalMemory: 18204327936; maxMemory: 57266405376
INFO 2016-11-12 09:54:35 MarkDuplicates Marking 25 records as duplicates.
INFO 2016-11-12 09:54:35 MarkDuplicates Found 0 optical duplicate clusters.

看起来并没有差别哦，找到的duplicate都是一样的，但是这种java软件的缺点就是奇慢无比~~~~

而且picard对于单端或者双端测序数据并没有区分参数，可以用同一个命令！

那么接下来我测试双端测序数据, 依然是没有差别，都是去掉了4个，可能是我给出的测试数据太少了。

INFO 2016-11-12 09:57:45 MarkDuplicates Read 30 records. 3 pairs never matched.
INFO 2016-11-12 09:57:47 MarkDuplicates After buildSortedReadEndLists freeMemory: 248541896; totalMemory: 3887595520; maxMemory: 57266405376
INFO 2016-11-12 09:57:47 MarkDuplicates Will retain up to 1789575168 duplicate indices before spilling to disk.
INFO 2016-11-12 09:58:26 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2016-11-12 09:58:26 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2016-11-12 09:58:26 MarkDuplicates Sorting list of duplicate records.
INFO 2016-11-12 10:02:59 MarkDuplicates After generateDuplicateIndexes freeMemory: 3885083112; totalMemory: 18204327936; maxMemory: 57266405376
INFO 2016-11-12 10:02:59 MarkDuplicates Marking 4 records as duplicates.

测试数据，大家可以去下载，里面有脚本和测试数据！http://www.biotrainee.com/jmzeng/rmDuplicate.zip

仔细探究samtools的rmdup是如何行使去除PCR重复reads功能的

ulwvfje — Sat, 12 Nov 2016 01:51:30 +0000

在做这个去除PCR重复reads时候必须要明白为什么要做这个呢？WGS？WES？RNA-SEQ?CHIP-SEQ？都需要吗？随机打断测序才需要？特异性捕获不需要？

搞明白了，我们就开始做，首先拿一个小的单端测序数据比对结果来做测试！

samtools rmdup -s tmp.sorted.bam tmp.rmdup.bam

[bam_rmdupse_core] 25 / 53 = 0.4717 in library

我们的测试数据里面有53条records根据软件算出了25条reads都是PCR的duplicate，所以去除了！

samtools rmdup 的官方说明书见： http://www.htslib.org/doc/samtools.html

samtools rmdup [-sS]

只需要开始-s的标签，就可以对单端测序进行去除PCR重复。其实对单端测序去除PCR重复很简单的~，因为比对flag情况只有0,4,16，只需要它们比对到染色体的起始终止坐标一致即可，flag很容易一致。但是对于双端测序就有点复杂了~

Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).

然后我们再拿一个小的双端测序数据来测试一下：

samtools rmdup tmp.sorted.bam tmp.rmdup.bam

[bam_rmdup_core] processing reference chr10...

[bam_rmdup_core] 2 / 12 = 0.1667 in library

很明显可以看出，去除PCR重复不仅仅需要它们比对到染色体的起始终止坐标一致，尤其是flag，在双端测序里面一大堆的flag情况，所以我们的94741坐标的5个reads，一个都没有去除！

这样的话，双端测序数据，用samtools rmdup效果就很差，所以很多人建议用picard工具的MarkDuplicates 功能~~~

The optimal solution depends on many factors - the consensus seems to be the the picard markduplicates could be the best current solution.

The appropriateness of duplicate removal depends on coverage - one would want to only remove artificial duplicates and keep the natural duplicates.

MarkDuplicates is "more correct" in the strict sense. Rmdup is more efficient simply because it does handle those tough cases. Rmdup works for single-end, too, but it cannot do paired-end and single-end at the same time. It does not work properly for mate-pair reads if read lengths are different.

生信菜鸟团 » pcr

仔细探究picard的MarkDuplicates 是如何行使去除PCR重复reads功能的

仔细探究samtools的rmdup是如何行使去除PCR重复reads功能的