生信菜鸟团 » fastqc

自学CHIP-seq分析第四讲~必要软件安装以及文章结果下载

ulwvfje — Tue, 05 Jul 2016 00:34:53 +0000

博文的顺序有点乱，因为怕读到前面的公共测序数据下载这篇文章的朋友搞不清楚，我如何调用各种软件的，所以我这里强势插入一篇博客来描述这件事，当然也只是略过，我所有的软件理论上都是安装在我的home目录下的biosoft文件夹，所以你看到我一般安装程序都是:

cd ~/biosoft
mkdir macs2 && cd macs2 ##指定的软件安装在指定文件夹里面

这只是我个人的安装习惯，因为我不是root，所以不能在linux系统下做太多事，我这里贴出我所有的软件安装代码：

## pre-step: download sratoolkit /fastx_toolkit_0.0.13/fastqc/bowtie2/bwa/MACS2/HOMER/QuEST/mm9/hg19/bedtools
## http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
## http://www.ncbi.nlm.nih.gov/books/NBK158900/

## Download and install sratoolkit
cd ~/biosoft
mkdir sratoolkit && cd sratoolkit
wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.6.3/sratoolkit.2.6.3-centos_linux64.tar.gz
##
## Length: 63453761 (61M) [application/x-gzip]
## Saving to: "sratoolkit.2.6.3-centos_linux64.tar.gz"
tar zxvf sratoolkit.2.6.3-centos_linux64.tar.gz

## Download and install bedtools
cd ~/biosoft
mkdir bedtools && cd bedtools
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
## Length: 19581105 (19M) [application/octet-stream]
tar -zxvf bedtools-2.25.0.tar.gz
cd bedtools2
make

## Download and install PeakRanger
cd ~/biosoft
mkdir PeakRanger && cd PeakRanger
wget https://sourceforge.net/projects/ranger/files/PeakRanger-1.18-Linux-x86_64.zip/
## Length: 1517587 (1.4M) [application/octet-stream]
unzip PeakRanger-1.18-Linux-x86_64.zip
~/biosoft/PeakRanger/bin/peakranger -h

## Download and install bowtie
cd ~/biosoft
mkdir bowtie && cd bowtie
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/bowtie2-2.2.9-linux-x86_64.zip/download
#Length: 27073243 (26M) [application/octet-stream]
#Saving to: "download" ## I made a mistake here for downloading the bowtie2
mv download bowtie2-2.2.9-linux-x86_64.zip
unzip bowtie2-2.2.9-linux-x86_64.zip

mkdir -p ~/biosoft/bowtie/hg19_index
cd ~/biosoft/bowtie/hg19_index

# download hg19 chromosome fasta files
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
# unzip and concatenate chromosome and contig fasta files
tar zvfx chromFa.tar.gz
cat *.fa > hg19.fa
rm chr*.fa
## ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/biosoft/bowtie/hg19_index/hg19.fa ~/biosoft/bowtie/hg19_index/hg19
## Download and install BWA
cd ~/biosoft
mkdir bwa && cd bwa

http://sourceforge.net/projects/bio-bwa/files/

tar xvfj bwa-0.7.12.tar.bz2 # x extracts, v is verbose (details of what it is doing), f skips prompting for each individual file, and j tells it to unzip .bz2 files
cd bwa-0.7.12
make
export PATH=$PATH:/path/to/bwa-0.7.12 # Add bwa to your PATH by editing ~/.bashrc file (or .bash_profile or .profile file)
# /path/to/ is an placeholder. Replace with real path to BWA on your machine
source ~/.bashrc
# bwa index [-a bwtsw|is] index_prefix reference.fasta
bwa index -p hg19bwaidx -a bwtsw ~/biosoft/bowtie/hg19_index/hg19.fa
# -p index name (change this to whatever you want)
# -a index algorithm (bwtsw for long genomes and is for short genomes)
## Download and install macs2
## // https://pypi.python.org/pypi/MACS2/
cd ~/biosoft
mkdir macs2 && cd macs2
wget ~~~~~~~~~~~~~~~~~~~~~~MACS2-2.1.1.20160309.tar.gz
tar zxvf MACS2-2.1.1.20160309.tar.gz
cd MACS2-2.1.1.20160309
python setup.py install --user

#################### The log for installing MACS2:
Creating ~/.local/lib/python2.7/site-packages/site.py
Processing MACS2-2.1.1.20160309-py2.7-linux-x86_64.egg
Copying MACS2-2.1.1.20160309-py2.7-linux-x86_64.egg to ~/.local/lib/python2.7/site-packages
Adding MACS2 2.1.1.20160309 to easy-install.pth file
Installing macs2 script to ~/.local/bin
Finished processing dependencies for MACS2==2.1.1.20160309
############################################################
~/.local/bin/macs2 --help

Example for regular peak calling:
macs2 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01
Example for broad peak calling:
macs2 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1

## Download and install homer (Hypergeometric Optimization of Motif EnRichment)
## // http://homer.salk.edu/homer/
## // http://blog.qiubio.com:8080/archives/3024
## pre-install: Ghostscript，seqlogo,blat
cd ~/biosoft
mkdir homer && cd homer
wget http://homer.salk.edu/homer/configureHomer.pl
perl configureHomer.pl -install
perl configureHomer.pl -install hg19

一般来说，对我这样水平的人来说，软件安装就跟家常便饭一样，没有什么问题了，但如果你是初学者呢，肯定没那么轻松，所以请加强学习，我无法在这里讲解太具体的知识了。

所有软件安装完毕后就可以下载文章对这些CHIP-seq的处理结果了，这个很重要，检验我们是否重复了人家的数据分析过程：

## step3 : download the results from paper
## http://www.bio-info-trainee.com/1571.html
mkdir paper_results && cd paper_results
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE52nnn/GSE52964/suppl/GSE52964_RAW.tar
tar xvf GSE52964_RAW.tar

ls *gz |xargs gunzip

## step4 : run FastQC to check the sequencing quality.

##这里可以看到我们下载的原始数据已经被作者处理好了，去了接头，去了低质量序列

ls *.fastq | while read id ; do ~/biosoft/fastqc/FastQC/fastqc $id;done
## Sequence length 51
## %GC 39
## Adapter Content passed

The quality of the reads is pretty good, we don't need to do any filter or trim

mkdir QC_results
mv *zip *html QC_results/

所以我们可以直接拿这些数据去做比对了！！！

草莓基因组数据预处理

ulwvfje — Tue, 24 Mar 2015 10:03:34 +0000

今天先对7个单端数据做处理，是454数据，平均长度300bp左右，明天再处理3KB和20KB的配对reads。

首先跑fastqc

打开一个个看结果

可以看到前面一些碱基的质量还是不错的，因为这是454平台测序数据，序列片段长度差异很大，一般前四百个bp的碱基质量还是不错的，太长了的测序片段也不可靠

重点在下面这个图片，可以看到，前面的4个碱基是adaptor，肯定是要去除的，不是我们的测序数据。是TCAG，需要去除掉。

所以我们用了 solexaQA 这个套装软件对原始测序数据进行过滤

可以看到过滤的非常明显！！！甚至有个样本基本全军覆没了！然后我查看了我的批处理脚本，发现可能是perl DynamicTrim.pl -454 $id这个参数有问题

for id in *fastq

echo $id

perl DynamicTrim.pl -454 $id

done

for id in *trimmed

echo $id

perl LengthSort.pl $id

done

可以看到末尾的质量差的碱基都被去掉了，但是头部的TCAG还是没有去掉。

处理完毕后的数据如下：

仿写fastqc软件的一些功能-R代码

ulwvfje — Sun, 15 Mar 2015 02:53:11 +0000

仿写fastqc软件的一些功能(下)

文件来自于上面perl代码的输出文件，好像算法有点问题，26G的文件居然处理近一个小时才出数据！

R语言本身自带的画图工具都很丑，懒得说了，可以用ggplot2来重新画一个，不是项目要求没有报酬我就懒得画了，大家面前看看画图原理即可。

a=read.table("meanQ.txt")

看看数据结构如下

> head(a)

V1 V2

1 2 93879

2 3 17800

3 4 25295

4 5 33259

5 6 55685

6 7 84866

plot(a,type='l',col='red',ylab='reads number',xlab='mean quality',main='mean Q distribution')

可以看出绝大部分的reads的Q值都在30-35直接，也就是说本次测序挺符合要求的，但是还是需要对那些平均Q20以下的reads过滤掉。

a=read.table('meanGC.txt')

看看数据结构如下

> head(a)

V1 V2

1 0 503

2 1 151

3 2 163

4 3 179

5 4 315

6 5 443

plot(a,type='l',col='red',ylab='reads number',xlab='reads bp',main='GC% distribution')

可以看出GC含量的分布看起来挺符合正态分布的，大部分reads的GC含量都是在40%-60%直接

a=read.table('fivenum.txt',header=T)

看看数据结构如下

boxplot(t(a[,3:7]),xlab='reads bp',ylab='Q value',main='mean Q boxplot')

可以看出测序质量从1-100bp过去质量越来越差，但是大部分都是高于Q30，但是88bp之后的碱基测序质量不咋地，可能需要trim掉

对于这个数据还可以画一个图

plot(a[,1:2],type='l',col='red',ylab='Q value',xlab='reads bp',main='mean Q value distribution')

可以看到88bp之后的平均Q值小于30，根据我们的阈值可能要把所有的reads的后面约10个bp的碱基要trim掉

仿写fastqc软件的部分功能-perl代码

ulwvfje — Sat, 14 Mar 2015 00:21:08 +0000

仿写fastqc软件的部分功能（上）

前面我们介绍了fastqc这个软件的使用方法 http://www.bio-info-trainee.com/?p=95 ，这是一个java软件，但是有些人服务器没有配置好这个java环境，导致无法使用，这里我贴出几个perl代码，也能实现fastqc的部分功能

统一测试文件是illumina的phred33格式的fastq文件，共100000/4=25000条reads，读长都是101个碱基

程序名-fastq2quality.pl

使用命令：perl fastq2quality.pl SRR504517_1.fastq >quality.txt

功能：把fastq格式的每条原始reads的第四行ascii码质量值，转换为Q值并输出一个矩阵，有多少条reads就有多少行，每条reads的碱基数就是列数。

[perl]
while (<>){

next unless $.%4==0;

chomp;

s/\r//g;

@F=split//;

foreach (@F){

$num=ord($_);

$num-=33;

print "$num\t";

}

print "\n";

}
[/perl]

统计结果如下

程序名-fastq2meanQ.pl

使用命令：perl fastq2meanQ.pl SRR504517_1.fastq

功能：把fastq格式的原始reads统计每条reads的平均Q值，并画出Q值1到50各有多少条reads的分布图

[perl]
while (<>){

next unless $.%4==0;

chomp;

s/\r//g;

@F=split//;

$mean=0;

$sum=0;

foreach (@F){

$num=ord($_);

$num-=33;

$sum+=$num;

}

$mean=int($sum/@F);

$hash{$mean}++;

}

print "$_ \t$hash{$_}\n" foreach sort {$a<=>$b} keys %hash;
[/perl]

统计结果如下

程序名-fastq2fivenum.pl

使用命令：perl fastq2fivenum.pl SRR504517_1.fastq

功能：把fastq格式的每条原始reads的第四行ascii码质量值，转换为Q值，并对每一个位点统计所以reads的四分位数，加上平均数。

[perl]
use List::Util qw/max min sum maxstr minstr shuffle/;

while (<>){

next unless $.%4==0;

chomp;

s/\r//g;

@F=split//;

foreach (0..@F-1){

$num=ord($F[$_]);

$num-=33;

$tmp[$_]->{$num}++;

}

print "num\tmean\tmin\tq25\tq50\tq75\tmax\n";

$i=0;

foreach $hash (@tmp){

$sum_reads=sum values %{$hash};

$num_q25=int($sum_reads/4);

$num_q50=int($sum_reads/2);

$num_q75=int(3*$sum_reads/4);

$sum_Q=0;

$sum_value=0;

foreach (sort {$a<=>$b} keys %{$hash}){

#print "$_ \t$hash->{$_}------"

$sum_Q+=$_ * $hash->{$_};

$q25_before=($sum_value<$num_q25);

$q50_before=($sum_value<$num_q50);

$q75_before=($sum_value<$num_q75);

$sum_value+=$hash->{$_};

$q25_last=($sum_value>$num_q25);

$q50_last=($sum_value>$num_q50);

$q75_last=($sum_value>$num_q75);

$q25=$_ if $q25_before && $q25_last;

$q50=$_ if $q50_before && $q50_last;

$q75=$_ if $q75_before && $q75_last;

}

$mean=$sum_Q/$sum_reads;

$min=min keys %{$hash};

$max=max keys %{$hash};

$i++;

print "$i\t$mean\t$min\t$q25\t$q50\t$q75\t$max\n";

}
[/perl]

统计结果文件如下

最后一个，统计GC含量

程序名-fastq2meanGC.pl

使用命令：perl fastq2meanGC.pl SRR504517_1.fastq

功能：把fastq格式的原始reads统计每条reads的平均Q值，并画出Q值1到50各有多少条reads的分布图

[perl]
while (<>){

next unless $.%4==2;

chomp;

s/\r//g;

@F=split//;

$GC=0;

foreach (@F){

$GC++ if /G/;

$GC++ if /C/;

}

#print "$GC\n";

$GC=int(100*$GC/length);

$hash{$GC}++;

}

print "$_ \t$hash{$_}\n" foreach sort {$a<=>$b} keys %hash;

[/perl]
结果如下所示

这个我将会在下一篇讲诉如何用R画图