TPM值就是RPKM的百分比嘛！

ulwvfje — Mon, 14 Nov 2016 11:34:12 +0000

很久以前就有人问过这个问题啦，虽然目前主流还是用RPKM/FPKM来形容一个基因的表达量。但是既然大家都说TPM更好，我也来探究一下吧！

我不喜欢看公式，直接说事情，我有一个基因A，它在这个样本的转录组数据中被测序而且mapping到基因组了 5000个的reads，而这个基因A长度是10K，我们总测序文库是50M，所以这个基因A的RPKM值是 5000除以10，再除以50，为10. 就是把基因的reads数量根据基因长度和样本测序文库来normalization 。

那么它的TPM值是多少呢？这个时候这些信息已经不够了，需要知道该样本其它基因的RPKM值是多少，加上该样本有3个基因，另外两个基因的RPKM值是5和35，那么我们的基因A的RPKM值为10需要换算成TPM值就是 1,000,000 *10/(5+10+35)=200,000，看起来是不是有点大呀，其实主要是因为我们假设的基因太少了，一般个体里面都有两万多个基因的，总和会大大的增加，这样TPM值跟RPKM值差别不会这么恐怖的。

TPM值就是RPKM的百分比！！！

大家肯定想问，TPM的优点是什么呢？很明显，所有基因的TPM值加起来肯定是1M，因为百分比的总和就是1嘛，与样本无关，各个样本都可以保证TPM库是一样的，这样比较更有意义！！！

我这里没有讲FPKM，大家自己搜索学习吧，没什么意思

最后还是贴上公式吧！

一大波我懒得看的参考资料：

http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702322/

https://www.biostars.org/p/88751/

https://www.biostars.org/p/133488/

https://www.biostars.org/p/115674/

用RNA-SeQC得到表达矩阵RPKM值

ulwvfje — Thu, 14 Jan 2016 12:40:14 +0000

这个软件不仅仅能做QC，而且可以统计各个基因的RPKM值！尤其是TCGA计划里面的都是用它算的

一、程序安装

直接在官网下载java版本软件即可使用：http://www.broadinstitute.org/cancer/cga/tools/rnaseqc/RNA-SeQC_v1.1.8.jar

但是需要下载很多注释数据

二、输入数据

箭头所指的文件，一个都不少，只有那个rRNA.tar我没有用，因为这个软件有两种使用方式，我用的是第一种

三、软件使用

软件的官网给力例子，很容易学习：

RNA-SeQC can be run with or without a BWA-based rRNA level estimation mode. To run without (less accurate, but faster) use the command:
java -jar RNASeQC.jar -n 1000 -s "TestId|ThousandReads.bam|TestDesc" -t gencode.v7.annotation_goodContig.gtf -r Homo_sapiens_assembly19.fasta -o ./testReport/ -strat gc -gc gencode.v7.gc.txt

我用的就是这个例子，这个例子需要的所有文件里面，染色体都是没有chr的，这个非常重要！！！

代码如下：

java -jar RNA-SeQC_v1.1.8.jar \
-n 1000 \
-s "TestId|ThousandReads.bam|TestDesc" \
-t gencode.v7.annotation_goodContig.gtf \
-r ~/ref-database/human_g1k_v37/human_g1k_v37.fasta \
-o ./testReport/ \
-strat gc \
-gc gencode.v7.gc.txt \

To run the more accurate but slower, BWA-based method :
java -jar RNASeQC.jar -n 1000 -s "TestId|ThousandReads.bam|TestDesc" -t gencode.v7.annotation_goodContig.gtf -r Homo_sapiens_assembly19.fasta -o ./testReport/ -strat gc -gc gencode.v7.gc.txt -BWArRNA human_all_rRNA.fasta
Note: this assumes BWA is in your PATH. If this is not the case, use the -bwa flag to specify the path to BWA

四、结果解读

运行要点时间，就那个一千条reads的测试数据都搞了10分钟！

出来一大堆突变，具体解释，官网上面很详细，不过，比较重要的当然是RPKM值咯，还有QC的信息

TCGA数据里面都会提供由RNA-SeQC软件处理得到的表达矩阵！

Expression

RPKM data are used as produced by RNA-SeQC.
Filter on >=10 individuals with >0.1 RPKM and raw read counts greater than 6.
Quantile normalization was performed within each tissue to bring the expression profile of each sample onto the same scale.
To protect from outliers, inverse quantile normalization was performed for each gene, mapping each set of expression values to a standard normal.

软件的主页是：

http://www.broadinstitute.org/cancer/cga/rnaseqc_run

http://www.broadinstitute.org/cancer/cga/rnaseqc_download

帮助文件：http://www.broadinstitute.org/cancer/cga/sites/default/files/data/tools/rnaseqc/RNA-SeQC_Help_v1.1.2.pdf