使用trimmomatic对illumina数据做质控-去接头还有去除低质量碱基

ulwvfje — Sat, 22 Oct 2016 02:50:42 +0000

因为一直拿到的是公司给的特别好的数据，所以没太关注质控这个问题，最近拿到了raw data，才发现其实里面的门道挺多的。前面都是用cutadapt这个python软件来去除接头的，但是它有一个弊端，需要自己指定接头文件。正好朋友推荐了trimmomatic，是java软件，所以直接Google找到其官网，然后下载二进制版本解压即可使用！

反正对我的illumina测序数据来说，直接用它就可以把raw data 变成 clean data啦！

这个软件设计就是为了illumina的测序数据的，因为它自带的adaptor文件有限，上图可以看到！而且一般只去除TruSeq Universal Adapter 这个接头，运行的时候，不报错才算是成功的！

官网有例子，很简单的：http://www.usadellab.org/cms/?page=trimmomatic

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 ## 所以只需要把参数放对位置即可！

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

一般就使用这个默认参数就好啦，处理的时间会有一点慢，我取了10个线程也得十几分钟才搞定2G的fq.gz压缩格式的测序文件，文件的log日志如下：

TrimmomaticPE: Started with arguments:

-threads 10 -phred33 -trimlog tmp.log CHG006373_R1.fastq.gz CHG006373_R2.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:/home/jmzeng//biosoft/trimmomatic/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:20 SLIDINGWINDOW:4:25 MINLEN:36

Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'

ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

Input Read Pairs: 21427010 Both Surviving: 14507723 (67.71%) Forward Only Surviving: 5297811 (24.72%) Reverse Only Surviving: 375547 (1.75%) Dropped: 1245929 (5.81%)

TrimmomaticPE: Completed successfully

记住指定接头文件一定要用全路径哦！！！

可以看到它使用了自带的文件TruSeq3-PE.fa里面的接头 TACACTCTTTCCCTACACGACGCTCTTCCGATCT其实只是 TruSeq Universal Adapter (可以在https://github.com/csf-ngs/fastqc/blob/master/Contaminants/contaminant_list.txt 找到接头信息)的后半段，直接在R1测序文件里面搜索可以看到，距离AAAAAAAAAAAAATTTTTTTTTTTTTTTTT这样的字符串和它的接头 TACACTCTTTCCCTACACGACGCTCTTCCGATCT之间还有序列：

比如我们拿第一个序列举例，可以看到第一条序列被trimmomatic丢到了output_forward_unpaired.fq.gz，它就懒得给它去除接头了，因为右端序列更可怜！

检查文件，发现有的地方是根据质量值来去除的，因为跟接头没有半毛钱关系！

因为它是接头和低质量碱基一起去除，我很难探究它到底是如何去除接头的，非常郁闷，但是它对illumina的数据效果非常好！因为去除的百分比很高。

用cutadapt软件来对双端测序数据去除接头

ulwvfje — Thu, 06 Oct 2016 14:32:25 +0000

一般来讲，我们对测序数据进行QC，就三个大的方向：Quality trimming， Adapter removal， Contaminant filtering，当我们是双端测序数据的时候，去除接头时，也会丢掉太短的reads，就容易导致左右两端测序文件reads数量不平衡，有一个比较好的软件能解决这个问题，我比较喜欢的是cutadapt软件的PE模式来去除接头！尤其是做基因组或者转录组de novo 组装的时候，尤其要去掉接头，去的干干净净！
cutadapt是经典的python软件，但是因为我的linux服务器有点问题，可能是root权限问题，没有用pip install cutadapt 安装成功，我懒得搞这些了，其实可以自己去下载cutadapt的源码，然后进入源码文件夹里面 python setup.py install --user 到自己的 ~/.local/bin下面。
所以我用conda安装了cutadapt软件，http://www.bio-info-trainee.com/1906.html 所以我需要 python ~/miniconda2/pkgs/cutadapt-1.10-py27_0/bin/cutadapt --help 才能调用这个软件，不过，问题不大，我也就是试用一下。

首先用fastqc软件对测序数据进行简单检测，看看有什么接头：

既然fastqc能探测到你的接头，说明它里面内置了所有的接头序列，在github可以查到：https://github.com/csf-ngs/fastqc/blob/master/Contaminants/contaminant_list.txt

或者：Download common Illumina adapters from https://github.com/vsbuffalo/scythe/blob/master/illumina_adapters.fa

TruSeq Universal Adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Illumina Small RNA 3p Adapter 1 ATCTCGTATGCCGTCTTCTGCTTG

最严重的一般是TruSeq Universal Adapter ，而且它检测到你其它的接头可能就是这个 TruSeq Universal Adapter 的一部分而已~！我们可以先用cutadapt去除试一下

cutadapt软件支持对PE 测序数据的处理，基本的用法是：

cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq

-a和-A是左右端测序数据的3端接头，-g和-G是左右端测序数据的5端接头。

支持fastq和fasta格式的gz压缩文件，必要时用-f参数指定测序文件数据格式即可。

我自己的实际例子如下：

python ~/miniconda2/pkgs/cutadapt-1.10-py27_0/bin/cutadapt \

-a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -a ATCTCGTATGCCGTCTTCTGCTTG \

-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -A CAAGCAGAAGACGGCATACGAGAT \

-e 0.1 -O 5 -m 50 -n 2 --pair-filter=both \

-o read_PE1.fq -p read_PE2.fq test1.fastq test2.fastq >& log.txt

参数解释如下：

两个-a 参数后面接的是两种接头，两个-A参加后面接的是同样的两个接头的反向互补序列！

加上 -n 2 是因为有两个接头，我的测序数据里面有可能一个read里面含有两种接头，所以需要检测两次。

-e 0.1 -O 5 -m 50 是标准参数，自己看readme就好了，其中-m 设置为50 是表示去除接头后如果read长度小于50我就不要了，因为我是PE150测序的，这也就是为什么要用PE模式来去除接头，保证过滤后的reads还是数量继续平衡的。

去除了接头之后的文件再用fastqc软件跑一下，很明显可以看到接头去除成功啦！

当然，这个软件还有很多其它的功能，我就不一一介绍了：参考：http://cutadapt.readthedocs.io/en/stable/guide.html

帮助文档还是蛮长的：

cutadapt version 1.9.1

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:

cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:

cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard

characters are supported. The reverse complement is *not* automatically

searched. All reads from input.fastq will be written to output.fastq with the

adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter

sequences can be given (use further -a options), but only the best-matching

adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and

auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for

standard input/output. Without the -o option, output is sent to standard output.

Citation:

Marcel Martin. Cutadapt removes adapter sequences from high-throughput

sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.

http://dx.doi.org/10.14806/ej.17.1.200

Use "cutadapt --help" to see all command-line options.

See http://cutadapt.readthedocs.org/ for full documentation.

Options:

--version show program's version number and exit

-h, --help show this help message and exit

--debug Print debugging information.

-f FORMAT, --format=FORMAT

Input file format; can be either 'fasta', 'fastq' or

'sra-fastq'. Ignored when reading csfasta/qual files

(default: auto-detect from file name extension).

Options that influence how the adapters are found:

Each of the three parameters -a, -b, -g can be used multiple times and

in any combination to search for an entire set of adapters of possibly

different types. Only the best matching adapter is trimmed from each

read (but see the --times option). Instead of giving an adapter

directly, you can also write file:FILE and the adapter sequences will

be read from the given FASTA FILE.

-a ADAPTER, --adapter=ADAPTER

Sequence of an adapter that was ligated to the 3' end.

The adapter itself and anything that follows is

trimmed. If the adapter sequence ends with the '$'

character, the adapter is anchored to the end of the

read and only found if it is a suffix of the read.

-g ADAPTER, --front=ADAPTER

Sequence of an adapter that was ligated to the 5' end.

If the adapter sequence starts with the character '^',

the adapter is 'anchored'. An anchored adapter must

appear in its entirety at the 5' end of the read (it

is a prefix of the read). A non-anchored adapter may

appear partially at the 5' end, or it may occur within

the read. If it is found within a read, the sequence

preceding the adapter is also trimmed. In all cases,

the adapter itself is trimmed.

-b ADAPTER, --anywhere=ADAPTER

Sequence of an adapter that was ligated to the 5' or

3' end. If the adapter is found within the read or

overlapping the 3' end of the read, the behavior is

the same as for the -a option. If the adapter overlaps

the 5' end (beginning of the read), the initial

portion of the read matching the adapter is trimmed,

but anything that follows is kept.

-e ERROR_RATE, --error-rate=ERROR_RATE

Maximum allowed error rate (no. of errors divided by

the length of the matching region) (default: 0.1)

--no-indels Do not allow indels in the alignments (allow only

mismatches). (default: allow both mismatches and

indels)

-n COUNT, --times=COUNT

Remove up to COUNT adapters from each read (default:

-O LENGTH, --overlap=LENGTH

Minimum overlap length. If the overlap between the

read and the adapter is shorter than LENGTH, the read

is not modified. This reduces the no. of bases trimmed

purely due to short random adapter matches (default:

3).

--match-read-wildcards

Allow IUPAC wildcards in reads (default: False).

-N, --no-match-adapter-wildcards

Do not interpret IUPAC wildcards in adapters.

--no-trim Match and redirect reads to output/untrimmed-output as

usual, but do not remove adapters.

--mask-adapter Mask adapters with 'N' characters instead of trimming

them.

Additional read modifications:

-u LENGTH, --cut=LENGTH

Remove LENGTH bases from the beginning or end of each

read. If LENGTH is positive, bases are removed from

the beginning of each read. If LENGTH is negative,

bases are removed from the end of each read. This

option can be specified twice if the LENGTHs have

different signs.

-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF

Trim low-quality bases from 5' and/or 3' ends of reads

before adapter removal. If one value is given, only

the 3' end is trimmed. If two comma-separated cutoffs

are given, the 5' end is trimmed with the first

cutoff, the 3' end with the second. See documentation

for the algorithm. (default: no trimming)

--quality-base=QUALITY_BASE

Assume that quality values in FASTQ are encoded as

ascii(quality + QUALITY_BASE). This needs to be set to

64 for some old Illumina FASTQ files. Default: 33

--trim-n Trim N's on ends of reads.

-x PREFIX, --prefix=PREFIX

Add this prefix to read names. Use {name} to insert

the name of the matching adapter.

-y SUFFIX, --suffix=SUFFIX

Add this suffix to read names; can also include {name}

--strip-suffix=STRIP_SUFFIX

Remove this suffix from read names if present. Can be

given multiple times.

--length-tag=TAG Search for TAG followed by a decimal number in the

description field of the read. Replace the decimal

number with the correct length of the trimmed read.

For example, use --length-tag 'length=' to correct

fields like 'length=123'.

Options for filtering of processed reads:

--discard-trimmed, --discard

Discard reads that contain an adapter. Also use -O to

avoid discarding too many randomly matching reads!

--discard-untrimmed, --trimmed-only

Discard reads that do not contain the adapter.

-m LENGTH, --minimum-length=LENGTH

Discard trimmed reads that are shorter than LENGTH.

Reads that are too short even before adapter removal

are also discarded. In colorspace, an initial primer

is not counted (default: 0).

-M LENGTH, --maximum-length=LENGTH

Discard trimmed reads that are longer than LENGTH.

Reads that are too long even before adapter removal

are also discarded. In colorspace, an initial primer

is not counted (default: no limit).

--max-n=COUNT Discard reads with too many N bases. If COUNT is an

integer, it is treated as the absolute number of N

bases. If it is between 0 and 1, it is treated as the

proportion of N's allowed in a read.

Options that influence what gets output to where:

--quiet Do not print a report at the end.

-o FILE, --output=FILE

Write modified reads to FILE. FASTQ or FASTA format is

chosen depending on input. The summary report is sent

to standard output. Use '{name}' in FILE to

demultiplex reads into multiple files. (default:

trimmed reads are written to standard output)

--info-file=FILE Write information about each read and its adapter

matches into FILE. See the documentation for the file

format.

-r FILE, --rest-file=FILE

When the adapter matches in the middle of a read,

write the rest (after the adapter) into FILE.

--wildcard-file=FILE

When the adapter has N bases (wildcards), write

adapter bases matching wildcard positions to FILE.

When there are indels in the alignment, this will

often not be accurate.

--too-short-output=FILE

Write reads that are too short (according to length

specified by -m) to FILE. (default: discard reads)

--too-long-output=FILE

Write reads that are too long (according to length

specified by -M) to FILE. (default: discard reads)

--untrimmed-output=FILE

Write reads that do not contain the adapter to FILE.

(default: output to same file as trimmed reads)

Colorspace options:

-c, --colorspace Enable colorspace mode: Also trim the color that is

adjacent to the found adapter.

-d, --double-encode

Double-encode colors (map 0,1,2,3,4 to A,C,G,T,N).

-t, --trim-primer Trim primer base and the first color (which is the

transition to the first nucleotide)

--strip-f3 Strip the _F3 suffix of read names

--maq, --bwa MAQ- and BWA-compatible colorspace output. This

enables -c, -d, -t, --strip-f3 and -y '/1'.

--no-zero-cap Do not change negative quality values to zero in

colorspace data. By default, they are changed to zero

since many tools have problems with negative

qualities.

-z, --zero-cap Change negative quality values to zero. This is

enabled by default when -c/--colorspace is also

enabled. Use the above option to disable it.

Paired-end options:

The -A/-G/-B/-U options work like their -a/-b/-g/-u counterparts.

-A ADAPTER 3' adapter to be removed from second read in a pair.

-G ADAPTER 5' adapter to be removed from second read in a pair.

-B ADAPTER 5'/3 adapter to be removed from second read in a pair.

-U LENGTH Remove LENGTH bases from the beginning or end of each

second read (see --cut).

-p FILE, --paired-output=FILE

Write second read in a pair to FILE.

--pair-filter=(any|both)

Which of the reads in a paired-end read have to match

the filtering criterion in order for it to be

filtered. Default: any.

--interleaved Read and write interleaved paired-end reads.

--untrimmed-paired-output=FILE

Write second read in a pair to this FILE when no

adapter was found in the first read. Use this option

together with --untrimmed-output when trimming paired-

end reads. (Default: output to same file as trimmed

reads.)

--too-short-paired-output=FILE

Write second read in a pair to this file if pair is

too short. Use together with --too-short-output.

--too-long-paired-output=FILE

Write second read in a pair to this file if pair is

too long. Use together with --too-long-output.

I think assembly is one of the things were you definitely want to remove adapters, if not you'll have spurious overlaps creating erroneous contigs/transcripts, just because they share an adapter.

去接头对基因组或者转录组组装是非常重要的：

http://www.opiniomics.org/we-need-to-stop-making-this-simple-fcking-mistake/

http://grahametherington.blogspot.jp/2014/09/why-you-should-qc-your-reads-and-your.html

生信菜鸟团 » 接头

使用trimmomatic对illumina数据做质控-去接头还有去除低质量碱基

用cutadapt软件来对双端测序数据去除接头