生信菜鸟团 » 数据比对

自学miRNA-seq分析第四讲~测序数据比对

ulwvfje — Sat, 25 Jun 2016 09:25:10 +0000

序列比对是大多数类型数据分析的核心，如果要利用好测序数据，比对细节非常重要，我这里只是研读一篇文章也就没有对比对细节过多考虑，只是列出自己的代码和自己的几点思考，力求重现文章作者的分析结果。对miRNA-seq数据有两条比对策略，一种是下载miRBase数据库里面的已知miRNA序列来进行比对，一种直接比对到参考基因组(比如人类的是hg19/hg38)，前面的比对非常简单，而且很容易就可以数出已经的所以miRNA序列的表达量，后面的比对有点耗时，而且算表达量的时候也不是很方便，但是它有个有点是可以来预测新的miRNA，所以大多数文章都会把这两条路给走一下。

本文选择的是SHRiMP这个小众软件，起初我并没有在意，就用的bowtie2而已，参考基因组我这里因为服务器原因，就用了miRBase数据库下载的人类的参考序列，现在的miRNA版本来说，人类这个物种已知的成熟miRNA共有2588条序列，而前体miRNA共有1881条序列，我下载（下载时间2016年6月）的代码见自学miRNA-seq分析第二讲~学习资料的搜集，下面比对所用到的软件已经序列在我的：自学miRNA-seq分析第三讲~公共测序数据下载

## step5 : alignment to miRBase v21 (hairpin.human.fa/mature.human.fa )

#### step5.1 using bowtie2 to do alignment

mkdir bowtie2_index && cd bowtie2_index

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ../hairpin.human.fa hairpin_human

~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ../mature.human.fa mature_human

ls *_clean.fq.gz | while read id ; do ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -x miRBase/bowtie2_index/hairpin_human -U $id -S ${id%%.*}.hairpin.sam ; done

## overall alignment rate: 10.20% / 5.71%/ 10.18%/ 4.36% / 10.02% / 4.95% (before convert U to T )

## overall alignment rate: 51.77% / 70.38%/51.45% /61.14%/ 52.20% / 65.85% (after convert U to T )

ls *_clean.fq.gz | while read id ; do ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -x miRBase/bowtie2_index/mature_human -U $id -S ${id%%.*}.mature.sam ; done

## overall alignment rate: 6.67% / 3.78% / 6.70% / 2.80%/ 6.55% / 3.23% (before convert U to T )

## overall alignment rate: 34.94% / 46.16%/ 35.00%/ 38.50% / 35.46% /42.41%(after convert U to T )

#### step5.2 using SHRiMP to do alignment

## http://compbio.cs.toronto.edu/shrimp/README

## 3.5 Mapping cDNA reads against a miRNA database

cd ~/biosoft/SHRiMP/SHRiMP_2_2_3

export SHRIMP_FOLDER=$PWD

cd -

##　　We project the database with:

$SHRIMP_FOLDER/utils/project-db.py --seed 00111111001111111100,00111111110011111100,00111111111100111100,00111111111111001100,00111111111111110000 \

--h-flag --shrimp-mode ls miRBase/hairpin.human.fa

##

$SHRIMP_FOLDER/bin/gmapper-ls -L hairpin.human-ls SRR1542716.fastq --qv-offset 33 \

-o 1 -H -E -a -1 -q -30 -g -30 --qv-offset 33 --strata -N 8 >map.out 2>map.log

大家可以看到我们把测序reads比对到前体miRNA和成熟的miRNA结果是有略微区别的，因为一个前体miRNA可以形成多个成熟的miRNA，而并不是所有的成熟的miRNA形式都被记录在数据库，所以一般推荐我们比对到前体miRNA数据库，这样还可以预测新的成熟miRNA，也是非常有意义的。

而且有个非常重要的一点，就是大家可以看到我把U变成T前后比对率差异非常大，这其实是一个非常蠢的错误。我就不多说了。但是做到这一步，其实可以跟文章来做验证了，文章有提到比对率，比对的序列。

我也是在博客里面看到这个信息的：

Thank you so much!. Yes I contacted the lab-guy and he just said that trimmed the first 4 bp and last 4bp. ( as you found)

So I firstly trimmed the adapter sequences(TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC)

And then, trimmed the first 4bp and last 4bp from reads, which leads to the 22bp peak of read-length distribution(instead of 24bp)

Anyhow, I tried to map with bowtie2 again.

> bowtie2 --local -N 1 -L 16

-x ../miRNA_reference/hairpin_UtoT.fa

-U first4bptrimmed_A1-SmallRNA_S1_L001_R1_001_Illuminaadpatertrim.fastq

-S f4_trimmed.sam

I also changed hairpin.fa file (U to T)

Oh.. thank you David,

Finallly, I got

2565353 reads; of these:
2565353 (100.00%) were unpaired; of these:
479292 (18.68%) aligned 0 times
11959 (0.47%) aligned exactly 1 time
2074102 (80.85%) aligned >1 times
81.32% overall alignment rate