用broad出品的软件来处理bam文件几次遇到文件头错误

ulwvfje — Thu, 14 Jan 2016 12:41:28 +0000

报错如下：ERROR MESSAGE: SAM/BAM file input.marked.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups ！

有些人遇到的是bam的染色体顺序不一样，还有可能是染色体的名字不一样，比如>1和>chr1的区别，虽然很傻，但是遇到这样问题的还不少！

还有一些人是遇到基因组没有dict文件，也是用picard处理一下就好。

大部分人是在GATK遇到的，我是在RNA-SeQC遇到的，不过原理都是一样的。

都是因为做alignment的时候并未添加头信息，比如：

bwa samse ref.fa my.sai my.fastq > my.sam

samtools view -bS my.sam > my.bam

samtools sort my.bam my_sorted

java -jar ReordereSam.jar I=/path/my_sorted.bam O=/path/my_reordered.bam R=/path/ref.fa

通过这个代码可以得到排序好的bam，但是接下来用GATK就会报错

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /paht/ref.fa -I /path/aln_reordered.bam

就是因为没有头信息，group相关信息，解决方法有两种：

bwa samse -r @RG\tID:IDa\tSM:SM\tPL:Illumina ref.fa my.sai my.fastq > my.sam

java -jar AddOrReplaceReadGroups I=my.bam O=myGr.bam LB=whatever PL=illumina PU=whatever SM=whatever

一种是比对的时候就加入头信息，这个需要比对工具的支持。

第二种是用picard工具来修改bam，推荐用这个！虽然我其实并不懂这些头文件信息是干嘛的，但是broad开发的软件就是需要！希望将来去读PHD能系统性的学习一些基础知识！

参考：http://seqanswers.com/forums/showthread.php?t=17233

https://www.biostars.org/p/115819/

http://gatkforums.broadinstitute.org/gatk/discussion/2667/bam-is-malformed-depthofcoverage

用RNA-SeQC得到表达矩阵RPKM值

ulwvfje — Thu, 14 Jan 2016 12:40:14 +0000

这个软件不仅仅能做QC，而且可以统计各个基因的RPKM值！尤其是TCGA计划里面的都是用它算的

一、程序安装

直接在官网下载java版本软件即可使用：http://www.broadinstitute.org/cancer/cga/tools/rnaseqc/RNA-SeQC_v1.1.8.jar

但是需要下载很多注释数据

二、输入数据

箭头所指的文件，一个都不少，只有那个rRNA.tar我没有用，因为这个软件有两种使用方式，我用的是第一种

三、软件使用

软件的官网给力例子，很容易学习：

RNA-SeQC can be run with or without a BWA-based rRNA level estimation mode. To run without (less accurate, but faster) use the command:
java -jar RNASeQC.jar -n 1000 -s "TestId|ThousandReads.bam|TestDesc" -t gencode.v7.annotation_goodContig.gtf -r Homo_sapiens_assembly19.fasta -o ./testReport/ -strat gc -gc gencode.v7.gc.txt

我用的就是这个例子，这个例子需要的所有文件里面，染色体都是没有chr的，这个非常重要！！！

代码如下：

java -jar RNA-SeQC_v1.1.8.jar \
-n 1000 \
-s "TestId|ThousandReads.bam|TestDesc" \
-t gencode.v7.annotation_goodContig.gtf \
-r ~/ref-database/human_g1k_v37/human_g1k_v37.fasta \
-o ./testReport/ \
-strat gc \
-gc gencode.v7.gc.txt \

To run the more accurate but slower, BWA-based method :
java -jar RNASeQC.jar -n 1000 -s "TestId|ThousandReads.bam|TestDesc" -t gencode.v7.annotation_goodContig.gtf -r Homo_sapiens_assembly19.fasta -o ./testReport/ -strat gc -gc gencode.v7.gc.txt -BWArRNA human_all_rRNA.fasta
Note: this assumes BWA is in your PATH. If this is not the case, use the -bwa flag to specify the path to BWA

四、结果解读

运行要点时间，就那个一千条reads的测试数据都搞了10分钟！

出来一大堆突变，具体解释，官网上面很详细，不过，比较重要的当然是RPKM值咯，还有QC的信息

TCGA数据里面都会提供由RNA-SeQC软件处理得到的表达矩阵！

Expression

RPKM data are used as produced by RNA-SeQC.
Filter on >=10 individuals with >0.1 RPKM and raw read counts greater than 6.
Quantile normalization was performed within each tissue to bring the expression profile of each sample onto the same scale.
To protect from outliers, inverse quantile normalization was performed for each gene, mapping each set of expression values to a standard normal.

软件的主页是：

http://www.broadinstitute.org/cancer/cga/rnaseqc_run

http://www.broadinstitute.org/cancer/cga/rnaseqc_download

帮助文件：http://www.broadinstitute.org/cancer/cga/sites/default/files/data/tools/rnaseqc/RNA-SeQC_Help_v1.1.2.pdf