使用trimmomatic对illumina数据做质控-去接头还有去除低质量碱基

ulwvfje — Sat, 22 Oct 2016 02:50:42 +0000

因为一直拿到的是公司给的特别好的数据，所以没太关注质控这个问题，最近拿到了raw data，才发现其实里面的门道挺多的。前面都是用cutadapt这个python软件来去除接头的，但是它有一个弊端，需要自己指定接头文件。正好朋友推荐了trimmomatic，是java软件，所以直接Google找到其官网，然后下载二进制版本解压即可使用！

反正对我的illumina测序数据来说，直接用它就可以把raw data 变成 clean data啦！

这个软件设计就是为了illumina的测序数据的，因为它自带的adaptor文件有限，上图可以看到！而且一般只去除TruSeq Universal Adapter 这个接头，运行的时候，不报错才算是成功的！

官网有例子，很简单的：http://www.usadellab.org/cms/?page=trimmomatic

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 ## 所以只需要把参数放对位置即可！

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

一般就使用这个默认参数就好啦，处理的时间会有一点慢，我取了10个线程也得十几分钟才搞定2G的fq.gz压缩格式的测序文件，文件的log日志如下：

TrimmomaticPE: Started with arguments:

-threads 10 -phred33 -trimlog tmp.log CHG006373_R1.fastq.gz CHG006373_R2.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:/home/jmzeng//biosoft/trimmomatic/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:20 SLIDINGWINDOW:4:25 MINLEN:36

Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'

ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

Input Read Pairs: 21427010 Both Surviving: 14507723 (67.71%) Forward Only Surviving: 5297811 (24.72%) Reverse Only Surviving: 375547 (1.75%) Dropped: 1245929 (5.81%)

TrimmomaticPE: Completed successfully

记住指定接头文件一定要用全路径哦！！！

可以看到它使用了自带的文件TruSeq3-PE.fa里面的接头 TACACTCTTTCCCTACACGACGCTCTTCCGATCT其实只是 TruSeq Universal Adapter (可以在https://github.com/csf-ngs/fastqc/blob/master/Contaminants/contaminant_list.txt 找到接头信息)的后半段，直接在R1测序文件里面搜索可以看到，距离AAAAAAAAAAAAATTTTTTTTTTTTTTTTT这样的字符串和它的接头 TACACTCTTTCCCTACACGACGCTCTTCCGATCT之间还有序列：

比如我们拿第一个序列举例，可以看到第一条序列被trimmomatic丢到了output_forward_unpaired.fq.gz，它就懒得给它去除接头了，因为右端序列更可怜！

检查文件，发现有的地方是根据质量值来去除的，因为跟接头没有半毛钱关系！

因为它是接头和低质量碱基一起去除，我很难探究它到底是如何去除接头的，非常郁闷，但是它对illumina的数据效果非常好！因为去除的百分比很高。

转录组 de novo流程–包括转录本完整注释

ulwvfje — Tue, 12 Jul 2016 12:03:39 +0000

有网友咨询过对于没有参考基因组或者转录组的物种，如何做RNA-seq分析。我觉得这个问题太大了，而且我还真的对这个没有经验。但是我以前看到过一篇文献，里面提到过一个非常全面的转录组 de novo组装注释流程，所以我摘抄了文章里面的生物信息学处理部分，分享给大家：

文章是RNA-seq analysis for plant carnivory gene discovery in Nepenthes × ventrata马来西亚的学者的研究，

文章非常短小，吓了我一跳~

期刊名 FRONTIERS IN PLANT SCIENCE 出版周期：不详. 常用链接 ... SCI(2014)：3.948 感觉这个杂志影响因子还会继续升

实验设计流程一模一样，发了两篇paper

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4707257/

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4778577/

测序策略都是 Illumina HiSeq 2500 sequencing platform. Paired end reads of 125 bp

数据处理流程：

Trimmomatic --》 Trinity (v2.0.6) --》 Transdecoder --》Trinotate (v2.0.0)

这些软件我博客都有使用记录

下面是技术详情：

Raw reads from all three data sets were filtered to remove adapter sequences with sequence pre-processing tool, Trimmomatic [2]. High quality Illumina raw reads with phred score ≥ 25 were kept for assembly. De novo assembly of these processed reads was performed with Trinity (v2.0.6) [3]. Statistics of the assembly is showed in Table 1.

Protein coding sequences of unique transcripts were analyzed via Transdecoder version v2.0.1 as a part of Trinity analysis pipeline. Standard Trinotate (v2.0.0) annotation pipeline (https://trinotate.github.io/) was carried out to annotate the assembled unique transcripts against Swissprot [4], Pfam [5], eggNOG [6], Gene Ontology [7], SignalP [8], and Rnammer [9]. Summary of the annotation is showed in Table 2.

所以重点是学会以下几个软件：

Trinotate http://trinotate.github.io download Trinotate
Trinity (includes support for expression and DE analysis using RSEM and Bioconductor): http://trinityrnaseq.github.io/ download Trinity. >Note, Trinity is not absolutely required. It is possible to use Trinotate with other sources of transcript data as long as suitable inputs are available.
TransDecoder for predicting coding regions in transcripts http://transdecoder.github.io download TransDecoder.
sqlite (required for database integration): http://www.sqlite.org/
NCBI BLAST+: Blast database Homology Search: http://www.ncbi.nlm.nih.gov/books/NBK52640/
HMMER/PFAM Protein Domain Identification: http://hmmer.janelia.org/download.html

数据都是可以下载的，也比较适合大家练手：

Transcriptome profile of N. × ventrata were generated from the polyA-enriched cDNA libraries prepared from total RNA extracted from its pitcher. The short reads were filtered, processed, assembled and analyzed as describe in the next section. Raw data for this project were deposited at SRA database with the accession numbers SRX1389337 (http://www.ncbi.nlm.nih.gov/sra/SRX1389337) for day 0 control, SRX1389392 (http://www.ncbi.nlm.nih.gov/sra/SRX1389392) for day 3 longevity experiment, and SRX1389395 (http://www.ncbi.nlm.nih.gov/sra/SRX1389395) for day 3 chitin-treatment experiment.

Transcriptome profile of N. ampullaria was generated from the polyA-enriched cDNA libraries prepared from total RNA extracted from its pitcher. The short reads were filtered, processed, assembled, and analyzed as described in the next section. Raw data for this project were deposited at SRA database with the accession numbers SRX1400303 (http://www.ncbi.nlm.nih.gov/sra/SRX1400303) for day 0 control, SRX1400308 (http://www.ncbi.nlm.nih.gov/sra/SRX1400308) for day 3 longevity experiment, and SRX1400311 (http://www.ncbi.nlm.nih.gov/sra/SRX1400311) for day 3 fluid protein depletion experiment. Assembled transcriptome fasta sequences can be accessed at http://gohlab.researchfrontier.org/public-datasets/Nepenthes-ampullaria-Trinity-gohlab.fasta.

生信菜鸟团 » Trimmomatic

使用trimmomatic对illumina数据做质控-去接头还有去除低质量碱基

转录组 de novo流程–包括转录本完整注释