十一 01

WES(七)看de novo变异情况

de novo变异寻找这个也属于snp-calling的一部分,但是有点不同的就是该软件考虑了一家三口的测序文件,找de novo突变

软件有介绍这个功能:http://varscan.sourceforge.net/trio-calling-de-novo-mutations.html

而且还专门有一篇文章讲ASD和autism与de novo变异的关系,但是文章不清不楚的,没什么意思

Trio Calling for de novo Mutations

image001

Min coverage:   10

Min reads2:     4

Min var freq:   0.2

Min avg qual:   15

P-value thresh: 0.05

Adj. min reads2:        2

Adj. var freq:  0.05

Adj. p-value:   0.15

Reading input from trio.filter.mpileup

1371416525 bases in pileup file (137M的序列)

83123183 met the coverage requirement of 10 (其中有83M的测序深度大于10X)

145104 variant positions (132268 SNP, 12836 indel) (共发现15.5万的变异位点)

4403 were failed by the strand-filter

139153 variant positions reported (126762 SNP, 12391 indel)

502 de novo mutations reported (376 SNP, 126 indel) (真正属于 de novo mutations只有502个)

1734 initially DeNovo were re-called Germline

12 initially DeNovo were re-called MIE

3 initially DeNovo were re-called MultAlleles

522 initially MIE were re-called Germline

1 initially MIE were re-called MultAlleles

3851 initially Untransmitted were re-called Germline

然后我看了看输出的文件trio.mpileup.output.snp.vcf

软件是这样解释的:The output of the trio subcommand is a single VCF in which all variants are classified as germline (transmitted or untransmitted), de novo, or MIE.

  • FILTER - mendelError if MIE, otherwise PASS
  • STATUS - 1=untransmitted, 2=transmitted, 3=denovo, 4=MIE
  • DENOVO - if present, indicates a high-confidence de novo mutation call

里面的信息量好还是不清楚

我首先对我们拿到的trio.de_novo.mutaion.snp.vcf文件进行简化,只看基因型!
head status.txt   (顺序是dad,mom,child)
STATUS=2 0/0 0/1 0/1
STATUS=2 1/1 1/1 1/1
STATUS=2 0/1 0/0 0/1
STATUS=2 1/1 1/1 1/1
STATUS=1 0/1 0/0 0/0
STATUS=1 0/1 0/0 0/0
STATUS=2 1/1 1/1 1/1
STATUS=2 1/1 1/1 1/1
STATUS=2 1/1 1/1 1/1
STATUS=2 0/1 0/1 0/1
那么总结如下:
  26564 STATUS=1 无所以无 (0/0 0/1 0/0或者 0/1 0/0 0/0等等)
  97764 STATUS=2 有所以有 (1/1 1/1 1/1 或者0/1 0/1 1/1等等)
    385 STATUS=3 无中生有 (0/0 0/0 0/1 或者0/0 0/0 1/1)
   1485 STATUS=4 有中生无 (1/1 0/1 0/0 等等)

我用annovar注释了一下

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4  trio.mpileup.output.snp.vcf > trio.snp.annovar

/home/jmzeng/bio-soft/annovar/annotate_variation.pl -buildver hg19  --geneanno --outfile  trio.snp.anno trio.snp.annovar /home/jmzeng/bio-soft/annovar/humandb

结果是:

A total of 132268 locus in VCF file passed QC threshold, representing 132809 SNPs (86633 transitions and 46176 transversions) and 3 indels/substitutions

可以看到最后被注释到外显子上面的突变有两万多个

23794  284345 3123333 trio.snp.anno.exonic_variant_function

这个应该是非常有意义的,但是我还没学会后面的分析。只能先做到这里了

 

十一 01

WES(六)用annovar注释

使用annovar软件参考自:http://www.bio-info-trainee.com/?p=641

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4  Sample3.varscan.snp.vcf > Sample3.annovar

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4  Sample4.varscan.snp.vcf > Sample4.annovar

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4  Sample5.varscan.snp.vcf > Sample5.annovar

然后用下面这个脚本批量注释

image001

Reading gene annotation from /home/jmzeng/bio-soft/annovar/humandb/hg19_refGene.txt ... Done with 50914 transcripts (including 11516 without coding sequence annotation) for 26271 unique genes

最后查看结果可知,真正在外显子上面的突变并不多

23515 Sample3.anno.exonic_variant_function

23913 Sample4.anno.exonic_variant_function

24009 Sample5.anno.exonic_variant_function

annovar软件就是把我们得到的十万多个snp分类了,看看这些snp分别是基因的哪些位置,是否引起蛋白突变

downstream

exonic

exonic;splicing

intergenic

intronic

ncRNA_exonic

ncRNA_intronic

ncRNA_splicing

ncRNA_UTR3

ncRNA_UTR5

splicing

upstream

upstream;downstream

UTR3

UTR5

UTR5;UTR3

 

十一 01

WES(五)不同软件比较

主要是画韦恩图看看,参考:http://www.bio-info-trainee.com/?p=893

对合并而且过滤的高质量snp信息来看看四种不同的snp calling软件的差异

我们用R语言来画韦恩图

image001

可以看出不同软件的差异还是蛮大的,所以我只选四个软件的公共snp来进行分析

首先是sample3

image002

然后是sample4

image003

然后是sample5

image004

可以看出,不同的软件差异还是蛮大的,所以我重新比较了一下,这次只比较,它们不同的软件在exon位点上面的snp的差异,毕竟,我们这次是外显子测序,重点应该是外显子snp

image005

然后我们用同样的程序,画韦恩图,这次能明显看出来了,大部分的snp位点都至少有两到三个软件支持

所以,只有测序深度达到一定级别,用什么软件来做snp-calling其实影响并不大。

 

image006

 

十一 01

WES(四)不同个体的比较

3-4-5分别就是孩子、父亲、母亲

我对每个个体取他们的四种软件的公共snp来进行分析,并且只分析基因型,看看是否符合孟德尔遗传定律

image001

结果如下:

image002

粗略看起来好像很少不符合孟德尔遗传定律耶

然后我写了程序计算

image003

总共127138个可以计算的位点,共有18063个位点不符合孟德尔遗传定律,而且它们在染色体的分布情况如下

我检查了一下,不符合的原因,发现我把

chr1 100617887 C T:DP4=0,0,36,3 T:1/1:40 T:1/1:0,40:40 miss T:DP4=0,0,49,9 T:1/1:59 T:1/1:0,58:59 miss T:DP4=0,0,43,8 T:1/1:53 T:1/1:0,53:53 T:1/1:50

计算成了chr1 100617887 C 0/0 0/0 1/1 所以认为不符合,因为我认为只有四个软件都认为是snp的我才当作是snp的基因型,否则都是0/0

那么我就改写了程序,全部用gatk结果来计算。这次可以计算的snp有个176036,不符合的有20309,而且我看了不符合的snp的染色体分布,Y染色体有点异常

image004 image005

但是很失败,没什么发现!

 

 

十一 01

WES(三)snp-filter

其中freebayes,bcftools,gatk都是把所有的snp细节都call出来了,可以看到下面这些软件的结果有的高达一百多万个snp,而一般文献都说外显子组测序可鉴定约8万个变异!

image001

这样得到突变太多了,所以需要过滤。这里过滤的统一标准都是qual大于20,测序深度大于10。过滤之后的snp数量如下

image002

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next if /INDEL/;/(DP4=.*?);/;print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$1"}' Sample3.bcftools.vcf >Sample3.bcftools.vcf.filter

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next if /INDEL/;/(DP4=.*?);/;print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$1"}' Sample4.bcftools.vcf >Sample4.bcftools.vcf.filter

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next if /INDEL/;/(DP4=.*?);/;print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$1"}' Sample5.bcftools.vcf >Sample5.bcftools.vcf.filter

 

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next unless /TYPE=snp/;@tmp=split/:/,$F[9];print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[1]"}'  Sample3.freebayes.vcf > Sample3.freebayes.vcf.filter

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next unless /TYPE=snp/;@tmp=split/:/,$F[9];print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[1]"}'  Sample4.freebayes.vcf > Sample4.freebayes.vcf.filter

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next unless /TYPE=snp/;@tmp=split/:/,$F[9];print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[1]"}'  Sample5.freebayes.vcf > Sample5.freebayes.vcf.filter

 

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next if length($F[3]) >1;next if length($F[4]) >1;@tmp=split/:/,$F[9];print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[1]:$tmp[2]"}'  Sample3.gatk.UG.vcf  >Sample3.gatk.UG.vcf.filter

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next if length($F[3]) >1;next if length($F[4]) >1;@tmp=split/:/,$F[9];print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[1]:$tmp[2]"}'  Sample4.gatk.UG.vcf  >Sample4.gatk.UG.vcf.filter

perl -alne '{next if $F[5]<20;/DP=(\d+)/;next if $1<10;next if length($F[3]) >1;next if length($F[4]) >1;@tmp=split/:/,$F[9];print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[1]:$tmp[2]"}'  Sample5.gatk.UG.vcf  >Sample5.gatk.UG.vcf.filter

 

perl -alne '{@tmp=split/:/,$F[9];next if $tmp[3]<10;print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[3]"}' Sample3.varscan.snp.vcf >Sample3.varscan.snp.vcf.filter

perl -alne '{@tmp=split/:/,$F[9];next if $tmp[3]<10;print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[3]"}' Sample4.varscan.snp.vcf >Sample4.varscan.snp.vcf.filter

perl -alne '{@tmp=split/:/,$F[9];next if $tmp[3]<10;print "$F[0]\t$F[1]\t$F[3]\t$F[4]:$tmp[0]:$tmp[3]"}' Sample5.varscan.snp.vcf >Sample5.varscan.snp.vcf.filter

这样不同工具产生的snp记录数就比较整齐了,我们先比较四种不同工具的call snp的情况,然后再比较三个人的区别。

然后写了一个程序把所有的snp合并起来比较

image003

得到了一个很有趣的表格,我放在excel里面看了看 ,主要是要看生物学意义,但是我的生物学知识好多都忘了,得重新学习了 image004

十一 01

WES(二)snp-calling

准备文件:下载必备的软件和参考基因组数据

1、软件

ps:还有samtools,freebayes和varscan软件,我以前下载过,这次就没有再弄了,但是下面会用到

2、参考基因组

3、参考 突变数据

第一步,下载数据

第二步,bwa比对

第三步,sam转为bam,并sort好

第四步,标记PCR重复,并去除

第五步,产生需要重排的坐标记录

第六步,根据重排记录文件把比对结果重新比对

第七步,把最终的bam文件转为mpileup文件

第八步,用bcftools 来call snp

第九步,用freebayes来call snp

第十步,用gatk     来call snp

第十一步,用varscan来call snp

下面的图片是按照顺序来的,我就不整理了

image001 image002 image003 image004 image005

image006 image007 image008 image009 image010 image011 image012 image013 image014 image015 image016 image017 image018 image019 image020 image021

 

十一 01

基因及外显子到底占基因组多少比例

很多文献都提到外显子组的序列仅占全基因组序列的1%左右,但大多数与疾病相关的变异位于外显子区。通过外显子组测序可鉴定约8万个变异,全基因组测序可鉴定300万个变异,因此与全基因组测序相比,外显子组测序不仅费用较低,数据阐释也更为简单。外显子组测序技术以其经济有效的优势广泛应用于孟德尔遗传病、罕见综合征及复杂疾病的研究,并于2010年被Science杂志评为十大突破之一。所以我一直想验证一下外显子到底占基因组什么样的百分比,是哪些位点。

首先我们拿到外显子记录和基因信息,下面是通过R来从bioconductor中心提取数据

library("TxDb.Hsapiens.UCSC.hg19.knownGene")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

txdb

exon_txdb=exons(txdb)

genes_txdb=genes(txdb)

tmp=as.data.frame(exon_txdb)

write.table(tmp,'exon.pos',row.names=F)

tmp=as.data.frame(genes_txdb)

write.table(tmp,'gene.pos',row.names=F)

首先我们看看我刚才从TxDb.Hsapiens.UCSC.hg19.knownGene包里面提取的外显子记录文件exon.pos

image001

我简单用脚本统计了一下:perl -alne '{$tmp+=$F[3]} END{print $tmp}' exon.pos

共289970个外显子记录,长度共98759283bp,也就是98.5Mb的区域都是外显子,感觉有点不大对,毕竟真正的外显子也就三十多M的,所以必须有些外显子是并列关系,也就是有的外显子包含着另外一些外显子,然后我写脚本检查了一下,的确太多了,这样的重复!

image002

也可以去NCBI里面拿到 consensus coding sequence (CCDS)记录:ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/

那么我再看看NCBI里面拿到 consensus coding sequence (CCDS)记录CCDS.20150512.txt文件。

共33140行,一个基因一行,所以需要格式化成外显子文件,简单看了一下文件的列,很容易格式化。每一行的信息如下

1 NC_000001.11 SAMD11 148398 CCDS2.2 Public + 925941 944152 [925941-926012, 930154-930335, 931038-931088, 935771-935895, 939039-939128, 939274-939459, 941143-941305, 942135-942250, 942409-942487, 942558-943057, 943252-943376, 943697-943807, 943907-944152] Identical

用下面这个脚本格式化

image003

格式化后的文件如下

image004

外显子记录增加到了346493个,我再用单行命令计算长度,是56321786bp,这次只有56M了,还是有点大,所以应该是还有重复!

所以,我写了一个脚本来精确计算外显子的总长度,不能那样马马虎虎的用单行命令了。

image005

这次才是34729283bp,也就是约35M,根很多文献里面说的都一样!

同理,我统计了一下外显子的侧翼长度

54692160 外显子加上前后50bp

73066288  外显子加上前后100bp

90362533  外显子加上前后150bp

 

而hg19版本基因组里面有着entrez gene ID号的基因是23056个基因,所以我接下来探究一下这些基因的信息!

我们首先看看基因与基因之间的交叉情况

其中有12454266bp的位点,是多个外显子共有的,可能是一个基因的多个外显子,或者是不同基因的外显子

我们主要看看不同基因的交叉情况

不同基因交叉情况共516种,共涉及498种基因!

ZNF341   NFS1

ZNF397   ZSCAN30

ZNF428   SRRM5

ZNF436-AS1    RPL11

ZNF578   ZNF137P

ZNF619   ZNF621

ZNF816   ZNF816-ZNF321P

ZSCAN26  ZSCAN31

ZSCAN5A  ZNF542P

ZZEF1    ANKFY1

我随便在数据库里面搜索一下验证,发现这些基因的确是交叉重叠的

 

30

R语言也可以用hash啦

本来想着是如何习惯R里面的list对象,来实现hash的功能,无意中发现了居然还有这个包

https://cran.r-project.org/web/packages/hash/hash.pdf

我简单看了看文档,的确跟perl里面的hash功能类似,非常好用。

创建一个hash很简单

h <- hash( keys=letters, values=1:26 )

h <- hash( letters, 1:26 )

类似于下面的列表

L=setNames(as.list(LETTERS),1:26)

如果是列表要提取keys需要用names函数,如果需要提取values,需要用unlist函数,而用了hash之后,这两个函数就可以独自运行啦

还有很多其它函数,其实在list中也可以实现,主要是看你对哪种语法更熟悉,感觉自己现在编程能力差不多算是小有所成了,看什么都一样了,要是以前学R的时候看到了这个hash包,我肯定会很兴奋的!

clear signature(x = "hash"): Remove all key-value pairs from hash
del signature(x = "ANY", hash = "hash"): Remove specified key-value pairs from hash
has.key signature(key = "ANY", hash = "hash"): Test for existence of key
is.empty signature(x = "hash"): Test if no key-values are assigned
length signature(x = "hash"): Return number of key-value pairs from the hash
keys signature(hash = "hash"): Retrieve keys from hash
values signature(x = "hash"): Retrieve values from hash
copy signature(x = "hash"): Make a copy of a hash using a new environment.
format signature(x = "hash"): Internal function for displaying hash

 

30

R里面list对象和AnnDbBimap对象区别

AnnDbBimap对象是R里面的bioconductor系列包的基础对象,在探针数据里面会包装成ProbeAnnDbMap,跟go通路相关又是GOTermsAnnDbBimap对象。

但是它都是AnnDbBimap对象衍生过来的

主要存在于芯片系列的包和org系列的包,其实AnnDbBimap对象就是list对象的包装,比如下面这些例子:

ls("package:hgu133plus2.db")

 [1] "hgu133plus2"              "hgu133plus2.db"           "hgu133plus2_dbconn"
 [4] "hgu133plus2_dbfile"       "hgu133plus2_dbInfo"       "hgu133plus2_dbschema"
 [7] "hgu133plus2ACCNUM"        "hgu133plus2ALIAS2PROBE"   "hgu133plus2CHR"
[10] "hgu133plus2CHRLENGTHS"    "hgu133plus2CHRLOC"        "hgu133plus2CHRLOCEND"
[13] "hgu133plus2ENSEMBL"       "hgu133plus2ENSEMBL2PROBE" "hgu133plus2ENTREZID"
[16] "hgu133plus2ENZYME"        "hgu133plus2ENZYME2PROBE"  "hgu133plus2GENENAME"
[19] "hgu133plus2GO"            "hgu133plus2GO2ALLPROBES"  "hgu133plus2GO2PROBE"
[22] "hgu133plus2MAP"           "hgu133plus2MAPCOUNTS"     "hgu133plus2OMIM"
[25] "hgu133plus2ORGANISM"      "hgu133plus2ORGPKG"        "hgu133plus2PATH"
[28] "hgu133plus2PATH2PROBE"    "hgu133plus2PFAM"          "hgu133plus2PMID"
[31] "hgu133plus2PMID2PROBE"    "hgu133plus2PROSITE"       "hgu133plus2REFSEQ"
[34] "hgu133plus2SYMBOL"        "hgu133plus2UNIGENE"       "hgu133plus2UNIPROT"

那么我们可以随便挑选包中的一个数据分析一下

x <- hgu133plus2SYMBOL

xlist=as.list(x)

我们查看X对象,可知,它是object of class "ProbeAnnDbMap",而这个对象,就是常见的list对象,被包装了一下, 只有我们明白了它和list对象到底有什么区别,才算是真正搞懂了这一系列包

但是这个ProbeAnnDbMap对象,在GO等包里面会被包装成更复杂的对象-GOTermsAnnDbBimap,但是对他们的理解都大同小异。

我们先回顾一下在R语言里面的list的基础知识:

R 的 列表(list)是一个以对象的有序集合构成的对象。 列表中包含的对象又称为它的分量(components)。

分量可以是不同的模式或者类型, 如一个列表可以包括数值向量,逻辑向量,矩阵,复向量, 字符数组,函数等等

如果你会perl的话,可以把它理解成hash。

分量常常会被编号的(numbered),并且可以利用它来 访问分量

列表的分量可以被命名,这种情况下 可以通过名字访问。

特别要注意一下 Lst[[1]] 和 Lst[1] 的差别。 [[...]] 是用来选择单个 元素的操作符,而 [...] 是一个更为一般 的下标操作符。

因此前者得到的是列表 Lst 中的第一个对象, 并且含有分量名字的命名列表(named list)中分量的名字会被排除在外的。

后者得到的则是列表 Lst 中仅仅由第一个元素 构成的子列表。如果是命名列表, 分量名字会传给子列表的。

那么接下来我们就看看x和xlist的区别。

它们里面的数据都是affymetrix公司出品的人类的hgu133plus2芯片的探针ID与基因symbol的对应关系

如果我想拿到所有的探针ID,那么对于AnnDbBimap对象,需要用mappedkeys(x),对于普通的list对象,需要用names(xlist).

对于普通的list对象,如果我想看前几个元素,直接head就可以了,但是对于AnnDbBimap对象,数据被封装好了,需要先as.list,然后才能head

mapped_probes <- mappedkeys(x)

PID2=head(mapped_probes)

那么,如果我们想根据以下探针ID来查看它们在这些数据里面被对应着哪些基因symbol 呢?

> PID2 #这是一串探针ID,后面的操作都需要用的

[1] "1053_at"   "117_at"    "121_at"    "1255_g_at" "1316_at"   "1320_at"

如果是对于AnnDbBimap对象,我们可以用mget函数来操作,取多个探针的对应基因symbol

accnum <- mget(PID2, env=hgu133plus2ACCNUM);

gname <- mget(PID2, env=hgu133plus2GENENAME)

gsymbol <- mget(PID2, env=hgu133plus2SYMBOL)

mget函数返回的就是普通的list函数了,可以直接查看了。

如果是对于普通的list对象,我们想取多个探针的对应基因symbol也是非常简单的,xlist[PID2]即可。

那么我们不禁有问了,既然它们两个功能完全一样,何苦把它包装成一个对象了,我直接操作list对象不就好了,学那么多规矩干嘛?

所以,重点就来了

>  length(mapped_probes)

[1] 42125

> length(names(xlist))

[1] 54675

看懂了吗?

 

但是,事实上用处也不大,你觉得下面这两个有区别吗?

SYMBOL <- AnnotationDbi::as.list(hgu133plus2SYMBOL)

SYMBOL <- SYMBOL[!is.na(SYMBOL)];

 

x <- hgu133plus2SYMBOL

mapped_probes <- mappedkeys(x)

SYMBOL <- AnnotationDbi::as.list(x[mapped_probes])

 

PS,在R里面创建一个list对象是非常简单的

The setNames() built-in function makes it easy to create a hash from given key and value lists.(Thanks to Nick K for the better suggestion.)

Usage: hh <- setNames(as.list(values), keys)

Example:

players <- c("bob", "tom", "tim", "tony", "tiny", "hubert", "herbert")
rankings <- c(0.2027, 0.2187, 0.0378, 0.3334, 0.0161, 0.0555, 0.1357)
league <- setNames(as.list(rankings), players)

29

gene的各种ID转换终结者-bioconductor系列包

经常会有人问这样的问题I have list of 10,000 Entrez IDs and i want to convert the multiple Entrez IDs into the respective gene names. Could someone suggest me the way to do this?等等类似的基因转换,能做的基因转换的方法非常多,以前不懂编程的时候,都是用各种网站,而最常用的就是ensembl的biomart了,它支持的ID非常多,高达几百种,好多ID我到现在都不知道是什么意思。

现在学会编程了,我比较喜欢的是R的一些包,是bioconductor系列,一般来说,其中有biomart,org.Hs.eg.db,annotate,等等。关于biomart我就不再讲了,我前面的博客至少有七八篇都提到了它。本次我们讲讲简单的, 我就以把gene entrez ID转换为gene symbol 为例子把。

当然,首先要安装这些包,并且加载。

if("org.Hs.eg.db" %in% rownames(installed.packages()) == FALSE) {source("http://bioconductor.org/biocLite.R");biocLite("org.Hs.eg.db")}
suppressMessages(library(org.Hs.eg.db))  我比较喜欢这样加载包

library(annotate) #一般都是这样加载包

如果是用org.Hs.eg.db包,首先你只需要读取你的待转换ID文件,构造成一个向量,tmp,然后只需要symbols <- org.Hs.egSYMBOL[as.character(tmp)]就可以得到结果了,返回的symbols是一个对象,需要用toTable这个函数变成数据框。但是这样转换容易出一些问题,比如如果你的输入数据tmp,里面含有一些无法转换的gene entrez ID,就会报错。

而且它支持的ID转换很有限,具体看看它的说明书即可:https://www.bioconductor.org/packages/release/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf

org.Hs.eg.db
org.Hs.eg_dbconn
org.Hs.egACCNUM
org.Hs.egALIAS2EG
org.Hs.egCHR
org.Hs.egCHRLENGTHS
org.Hs.egCHRLOC
org.Hs.egENSEMBL
org.Hs.egENSEMBLPROT
org.Hs.egENSEMBLTRANS
org.Hs.egENZYME
org.Hs.egGENENAME
org.Hs.egGO
org.Hs.egMAP
org.Hs.egMAPCOUNTS
org.Hs.egOMIM
org.Hs.egORGANISM
org.Hs.egPATH
org.Hs.egPMID
org.Hs.egREFSEQ
org.Hs.egSYMBOL
org.Hs.egUCSCKG
org.Hs.egUNIGENE
org.Hs.egUNIPROT

如果是用annotate包,首先你还是需要读取你的待转换ID文件,构造成一个向量,tmp,然后用getSYMBOL(as.character(tmp), data='org.Hs.eg')这样直接就返回的还是以向量,只是在原来向量的基础上面加上了names属性。说明书:http://www.bioconductor.org/packages/3.3/bioc/manuals/annotate/man/annotate.pdf

然后你可以把转换好的向量写出去,如下:

1 A1BG
2 A2M
3 A2MP1
9 NAT1
10 NAT2
12 SERPINA3
13 AADAC
14 AAMP
15 AANAT
16 AARS

PS:如果是芯片数据,需要把探针的ID转换成gene,那么一般还需要加载特定芯片的数据包才行:

platformDB <- paste(eset.mas5@annotation, ".db", sep="") #这里需要确定你用的是什么芯片
cat("the annotation is ",platformDB,"\n")
if(platformDB %in% rownames(installed.packages()) == FALSE) {source("http://bioconductor.org/biocLite.R");tmp=try(biocLite(platformDB))}
library(platformDB, character.only=TRUE)
probeset <- featureNames(eset.mas5)
rowMeans <- rowMeans(exprSet)

library(annotate) # lookUp函数是属于annotate这个包的
EGID <- as.numeric(lookUp(probeset, platformDB, "ENTREZID"))

29

使用GEOmetadb包来获取对应GEO数据的实验信息

理论上我前面提到的GEOquery包就可以根据一个GSE索引号来获取NCBI提供的所有关于这个GSE索引号的数据了,包括metadata,表达矩阵,soft文件,还有raw data

但是很多时候,那个metadata并不是很整齐,而且一个个下载太麻烦了,所以就需要用R的bioconductor的另一个神奇的包了GEOmetadb

它的示例:http://bioconductor.org/packages/devel/bioc/vignettes/GEOmetadb/inst/doc/GEOmetadb.R

里面还是很多数据库基础知识的
代码托管在github,它的示例代码是这样连接数据库的:
library(GEOmetadb)
if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
file.info('GEOmetadb.sqlite')
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
dbDisconnect(con)
但是一般不会成功,因为这个包把它的GEOmetadb.sqlite文件放在了国外网盘共享,在国内很难访问,推荐大家想办法下载到本地
tmp2
用这个代码就会成功了,需要自己下载GEOmetadb.sqlite文件然后放在指定目录:/path/GEOmetadb.sqlite 需要自己修改
我们的diabetes.GEO.list文件内容如下:
GSE1009
GSE10785
GSE1133
GSE11975
GSE121
GSE12409
那么会产生的表格文件如下:共有32列数据信息,算是蛮全面的了
tmp
28

模拟Y染色体测序判断,并比对到X染色体上面,看同源性

首先下载两条染色体序列

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrX.fa.gz;

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrY.fa.gz;

152M Mar 21  2009 chrX.fa

58M Mar 21  2009 chrY.fa

然后把X染色体构建bwa的索引

bwa index chrX.fa

[bwa_index] Pack FASTA... 1.97 sec

[bwa_index] Construct BWT for the packed sequence...

[BWTIncCreate] textLength=310541120, availableWord=33850812

[BWTIncConstructFromPacked] 10 iterations done. 55838672 characters processed.

[BWTIncConstructFromPacked] 20 iterations done. 103157920 characters processed.

[BWTIncConstructFromPacked] 30 iterations done. 145211344 characters processed.

[BWTIncConstructFromPacked] 40 iterations done. 182584528 characters processed.

[BWTIncConstructFromPacked] 50 iterations done. 215797872 characters processed.

[BWTIncConstructFromPacked] 60 iterations done. 245313968 characters processed.

[BWTIncConstructFromPacked] 70 iterations done. 271543920 characters processed.

[BWTIncConstructFromPacked] 80 iterations done. 294853104 characters processed.

[bwt_gen] Finished constructing BWT in 88 iterations.

[bwa_index] 98.58 seconds elapse.

[bwa_index] Update BWT... 0.96 sec

[bwa_index] Pack forward-only FASTA... 0.91 sec

[bwa_index] Construct SA from BWT and Occ... 33.18 sec

[main] Version: 0.7.8-r455

[main] CMD: /lrlhps/apps/bioinfo/bwa/bwa-0.7.8/bwa index chrX.fa

[main] Real time: 141.623 sec; CPU: 135.605 sec

由于X染色体也就152M,所以很快,两分钟解决战斗!

然后模拟Y染色体的测序判断(PE100insert400

209M Oct 28 10:19 read1.fa

209M Oct 28 10:19 read2.fa

模拟的程序很简单

tmp

 

while(<>){
chomp;
$chrY.=uc $_;
}
$j=0;
open FH_L,">read1.fa";
open FH_R,">read2.fa";
foreach (1..4){
for ($i=600;$i<(length($chrY)-600);$i = $i+50+int(rand(10))){
$up = substr($chrY,$i,100);
$down=substr($chrY,$i+400,100);
next unless $up=~/[ATCG]/;
next unless $down=~/[ATCG]/;
$down=reverse $down;
$down=~tr/ATCG/TAGC/;
$j++;
print FH_L ">read_$j/1\n";
print FH_L "$up\n";
print FH_R ">read_$j/2\n";
print FH_R "$down\n";
}
}
close FH_L;
close FH_R;

然后用bwa mem 来比对

bwa mem -t 12 -M chrX.fa read*.fa >read.sam

用了12个线层,所以也非常快

[main] Version: 0.7.8-r455

[main] CMD: /apps/bioinfo/bwa/bwa-0.7.8/bwa mem -t 12 -M chrX.fa read1.fa read2.fa

[main] Real time: 136.641 sec; CPU: 1525.360 sec

643M Oct 28 10:24 read.sam

然后统计比对结果

samtools view -bS read.sam >read.bam

158M Oct 28 10:26 read.bam

samtools flagstat read.bam

3801483 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

2153410 + 0 mapped (56.65%:-nan%)

3801483 + 0 paired in sequencing

1900666 + 0 read1

1900817 + 0 read2

645876 + 0 properly paired (16.99%:-nan%)

1780930 + 0 with itself and mate mapped

372480 + 0 singletons (9.80%:-nan%)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

我自己看sam文件也发现真的同源性好高呀,总共就模拟了380万reads,就有120万是百分百比对上了。

所以对女性个体来说,测序判断比对到Y染色体是再正常不过的了。如果要判断性别,必须要找那些X,Y差异性区段

23

生物信息小白如何自学编程

这本来是我在知乎上面看到的问题,所以就抽空回答了一下:http://www.zhihu.com/question/36701137/answer/68928111

首先,你懂得想去看源码,这是一个很好的兆头,一些非常正规的源码的确是编程进阶的的捷径,毕竟我们大部分人都不可能得到别人的手把手指导,所以只能靠自己的悟性了。

我就以我自己的经历来回答这个问题吧,我作为一个纯生物出身的小白,现在编程技术应该还算可以了!

首先,不管是哪个语言,perl,python,R,matlab都好,它们都有一堆的基础书籍,你必须以囫囵吞枣的心态看完一两本书(书没有好坏,别要我给你推荐书名),必须看完,了解编程基础。

接下来的步骤最重要,就是实践,不停的实践,在实践中运用编程技术,这样是学的最快的,不然你看再多的书也只是一个概念。

我这里重点推荐一个工具集,它实现了很多生物信息学需要的常用操作,网址是:Bioinformatics Tools
包含以下64中工具,而且网页也很清楚的描述了它们的功能,其实非常简单,但是这样写程序非常有效。
"Combines multiple FASTA entries into a single sequence."
"Returns the entire sequence contained in an EMBL file in FASTA format."
"Parses the feature table of an EMBL file and returns the feature sequences."
"Parses the feature table of an EMBL file and returns the protein translations."
"Removes non-DNA characters from text."
"Removes non-protein characters from text."
"Returns the entire sequence contained in a GenBank file in FASTA format."
"Parses the feature table of a GenBank file and returns the feature sequences."
"Parses the feature table of a GenBank file and returns the protein translations."
"Converts single letter amino acid codes to three letter codes."
"Reads a list of positions and ranges and returns those parts of a DNA sequence."
"Reads a list of positions and ranges and returns those parts of a protein sequence."
"Determines the reverse-complement, reverse, or complement of the sequence you enter."
"Separates bases according to codon position."
"Converts a FASTA sequence into multiple sequences."
"Converts three letter amino acid codes to one letter codes."
"Returns DNA sequence segments specified by a position and window size."
"Returns protein sequence segments specified by a position and window size."
"Plots codon frequency (according to the codon table you enter) for each codon in a DNA sequence."
"Returns a standard codon usage table."
"Returns a list of potential CpG islands."
"Calculates the molecular weight of DNA sequences."
"Returns positions of the patterns you enter."
"Returns basic sequence statistics."
"Returns sequences that are identical or similar to a query sequence."
"Returns sequences that are identical or similar to a query sequence."
"Accepts aligned sequences in FASTA format and calculates the identity and similarity of each sequence pair."
"Can be used to predict a DNA sequence in another species using a protein sequence alignment."
"Finds DNA sequences that can easily be converted to a restriction site."
"Determines the positions of open reading frames."
"Returns the optimal global alignment for two coding DNA sequences."
"Returns the optimal global alignment for two DNA sequences."
"Returns the optimal global alignment for two protein sequences."
"Returns a report describing PCR primer properties"
"Generates PCR products from a template and two primer sequences."
"Returns the grand average of hydropathy value of protein sequences."
"Returns the predicted isoelectric point of protein sequences."
"Calculates the molecular weight of protein sequences."
"Returns positions of the patterns you enter."
"Returns basic sequence statistics."
"Converts the sequence you enter into restriction fragments."
"Returns the number and positions of restriction sites."
"Can be used to convert protein into DNA."
"Returns the translation in the reading frame you specify."
"Colors a sequence alignment based on sequence conservation."
"Colors a protein alignment based on biochemical properties of residues."
"Numbers and groups DNA according to your specifications."
"Numbers and groups amino acids according to your specifications."
"Shows PCR primer annealing sites, translations, and restriction sites."
"Shows restriction sites and protein translations."
"Shows protein translations."
"Introduces random mutations into DNA sequences."
"Introduces random mutations into protein sequences."
"Generates a random coding sequence of the length you specify."
"Generates a random DNA sequence of the length you specify."
"Replaces regions of the DNA sequences you enter with random bases."
"Generates a random protein sequence of the length you specify."
"Replaces regions of the protein sequences you enter with random residues."
"Samples bases from a DNA sequence with replacement."
"Samples residues from a protein sequence with replacement."
"Randomly shuffles the DNA sequences you enter."
"Randomly shuffles the protein sequences you enter."
"IUPAC codes for DNA and protein."
"The genetic codes used in the Sequence Manipulation Suite."
当你实现完了这些需求,你不仅仅学会了编程,而且是学会了编程该如何应用在生物信息学里面!
用perl,python,R,matlab中的任何一种都可以实现,它们没有任何区别的,别纠结语言的问题。
不推荐初学者看源代码,因为源代码太正规了,定义变量就几十行代码了,再定义函数又是几百行代码,而真正学生物信息学的压根写代码都不超过五十行的,比如我上面提到那64个生物数据处理需求,一般就七八行代码就可以(在perl里面)
不信你可以看看这个github里面托管的代码:trinityrnaseq/util/misc at master · trinityrnaseq/trinityrnaseq · GitHub
里面有很多perl代码,都是实现各种数据转换的,写的非常正规,甚至能把一行代码就能解决的问题写成几百甚至上千行,除非你想把自己的代码拿去发文章或者出售,否则正常的生物信息学研究根本用不着!
当然,回到你最初的问题,哪里能找到源码呢?
首先,你可以去图书馆看一堆书籍,它们都会有光盘,下载既有视频又有源码,或者书上一般会说源码在哪里下载,比如这个pleac/include/perl at master · pleac/pleac · GitHub
然后,你可以找一大堆的生物信息学软件,它们一般都托管在github上面,这个链接里面有三百多个生物信息学转录组领域的软件:List of RNA-Seq bioinformatics tools
这个链接有几百个生物信息学里面做alignment的软件:
甚至连常见的生物信息学数据库也有自己的源码包:例如NCBI,ensembl,UCSC
下面就是ENSEMBL数据库的:NGS数据比对工具持续收集
(记住,这些软件都是人家发表文章的,非常难,你一辈子能搞定一个就很了不起了,比如我,就搞了一下bowtie,也是一知半解的)
分享了所有的代码,实在是太方便了:Ensembl Project · GitHub
可以跟着这些代码学习编程:Ensembl/ensembl-pipeline · GitHub
它的官网的帮助文档也特别详细:Help & Documentation
你现在还缺资料吗?

22

R语言中的排序,集合运算,reshape,以及merge总结

一直以来我都是随便看了点R的编程教程,因为我学了一点点C,所以还算有基础,现在基本上简单看看教程就能懂一门语言了,区别只是熟练度而已。R用得比较多,所以还算擅长,但是很多快捷应用的地方,我总是寄希望于到时候再查资料,所以没能用心的记住,这次花了点时间好好整理了一下R里面关于数据操作的重点,我想,以后再碰到类似的数据处理要求,应该很快能解决了把。

首先看看排序:

在R中,和排序相关的函数主要有三个:sort(),rank(),order()。

sort(x)是对向量x进行排序,返回值排序后的数值向量。

rank()是求秩的函数,它的返回值是这个向量中对应元素的“排名”。

order()的返回值是对应“排名”的元素所在向量中的位置。

其中sort(x)等同于x[order(x)]

下面以一小段R代码来举例说明:

> x<-c(97,93,85,74,32,100,99,67)

> sort(x)

[1]  32  67  74  85  93  97  99 100

> order(x)

[1] 5 8 4 3 2 1 7 6

> rank(x)

[1] 6 5 4 3 1 8 7 2

> x[order(x)]

[1]  32  67  74  85  93  97  99 100

其中比较有用的order,它可以用来给数据框进行排序

dat[order(dat[,1]),] 以该数据框的第一列进行排序

dat[order(dat[,1],dat[,2]),] 以该数据框的第一列为主要次序,第二列为次要序列进行排序

 

然后我们看看集合运算:

在R里面除了简单的对两个向量求交集并集补集之外,比较重要的就是match和 %in% 了,需要重点讲讲。

#首先对集合A,B,C赋值

> A<-1:10

> B<-seq(5,15,2)

> C<-1:5

> #求A和B的并集

> union(A,B)

[1]  1  2  3  4  5  6  7  8  9 10 11 13 15

> #求A和B的交集

> intersect(A,B)

[1] 5 7 9

> #求A-B

> setdiff(A,B)

[1]  1  2  3  4  6  8 10

> #求B-A

> setdiff(B,A)

[1] 11 13 15

> #检验集合A,B是否相同

> setequal(A,B)

[1] FALSE

> #检验元素12是否属于集合C

> is.element(12,C)

[1] FALSE

> #检验集合A是否包含C

> all(C%in%A)

[1] TRUE

> all(C%in%B)

从上面可以看到%in%这个操作符只返回逻辑向量TRUE 或者FALSE,而且返回值应该与%in%这个操作符前面的向量程度相等。也就是说它相当于遍历了C里面的一个个元素,判断它们是否在B中出现过,然后返回是或者否即可。

而match(C,B)的结果就很不一样了,它的返回结果同样与前面的向量等长,但是它并非返回逻辑向量,而是遍历了C里面的一个个元素,判断它们是否在B中出现过,如果出现就返回在B中的索引号,如果没有出现,就返回NA

>B

[1]  5  7  9 11 13 15

>C

[1] 1 2 3 4 5

>match(C,B)

[1] NA NA NA NA  1

>C%in%B

[1] FALSE FALSE FALSE FALSE  TRUE

 接下来我们看看reshape:

这是一个需要安装的包,起得就是R里面最经典的把长型数据变宽,和把宽数据拉长的作用。

其中melt函数是把很宽的数据拉长,它就是需要指定几列数据是保证不被融合的, 其余每一列数据都必须被融合到一列了,融合后的这一列数据每个元素旁边就用列名来标记该数据来自于哪一列。

# example of melt function
library(reshape)
mdata <- melt(mydata, id=c("id","time"))

融合后的数据如下:

id time variable value
1 1 x1 5
1 2 x1 3
2 1 x1 6
2 2 x1 2
1 1 x2 6
1 2 x2 5
2 1 x2 1
2 2 x2 4

可以看到variable列里面有两个不同的元素,说明是把旧数据中的两列给融合了,融合后的一个很长的列就是value

而cast函数的功能就是把刚才融合好的数据给还原。

# cast the melted data
# cast(data, formula, function)
subjmeans <- cast(mdata, id~variable, mean)
timemeans <- cast(mdata, time~variable, mean)

可以看到它们返回的结果是:

subjmeans

id x1 x2
1 4 5.5
2 4 2.5

timemeans

time x1 x2
1 5.5 3.5
2 2.5 4.5

可以看到cast函数比较复杂一点,formula公式右边的变量是需要拆开的variable,这一列变量有多少不重复元素,就新增多少列,左边的变量是行标记了,其余所有数据都会被计算一下再放在合适的位置。

在reshape2这个包里面还进化出来dcast和acast函数,功能大同小异。

 

最后我们来看看merge函数:

这个函数的功能非常强大,类似于SQL语句里面的join系列函数

测试数据如下,它们这两个表的连接是作者名

Capture1

 

如果要实现类似sql里面的inner join 功能,则用代码

m1 <- merge(authors, books, by.x = "surname", by.y = "name")

如果要实现left join功能则用代码

m1 <- merge(authors, books, by.x = "surname", by.y = "name",all.x=TRUE)

right join功能代码

m1 <- merge(authors, books, by.x = "surname", by.y = "name",all.y=TRUE)

all join功能代码

m1 <- merge(authors, books, by.x = "surname", by.y = "name",all=TRUE)

关于单变量匹配的总结就是这些,但对于多变量匹配呢,例如下面两个表,需要对k1,k2两个变量都相等的情况下匹配

x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)

y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)

匹配代码如下merge(x, y, by = c("k1","k2"))  #inner join

另外一个多行匹配的例子如下:

Capture2

我们的测试数据如上,这两个表的连接在于作者名。下面这两个语句等价

merge(authors, books, by=1:2)

merge(authors, books, by.x=c("FirstName", "LastName"),

by.y=c("AuthorFirstName", "AuthorLastName"),

all.x=TRUE)

都可以把两张表关联起来。

当然,在我搜索资料的时候,发现了另外一个解决问题的方法:

A[with(A, paste(C1, C2, sep = "\r")) %in% with(B, paste(C1, C2, sep="\r")), ]

 

 

参考:http://blog.sina.com.cn/s/blog_6caea8bf0100spe9.html

http://blog.sina.com.cn/s/blog_6caea8bf010159dt.html

http://blog.csdn.net/u014801157/article/details/24372441

http://www.statmethods.net/management/reshape.html

https://docs.tibco.com/pub/enterprise-runtime-for-R/1.5.0_may_2013/TERR_1.5.0_LanguageRef/base/merge.html

http://r.789695.n4.nabble.com/Matching-multiple-columns-in-a-data-frame-td890832.html

http://bbs.pinggu.org/thread-3234639-1-1.html

http://www.360doc.com/content/14/0821/13/18951041_403561544.shtml

http://developer.51cto.com/art/201312/423612.htm

22

用R和mysql把基因对应的最大表达量探针挑出来

懂点芯片数据分析的都应该知道,芯片设计的时候,对一个基因设计了多个探针,这样设计是为了更好的捕获某些难以发现的基因,或者重点研究某些基因。
但是对我们的差异分析不方便,所以我们只分析哪些有对应了entrez ID的探针,而且对每个entrez ID,我们只需要挑选它表达量最高的那个探针。
所以就演化为一个编程问题:分组求最大值,多公共列合并!
如果是在R语言里面,那么首先这个table的表示形式如下
> esetDataTable[1:10,c(7,8)]
EGID rowMeans
1000_at 5595 1840.04259751826
1001_at 7075 799.075414422572
1002_f_at 1557 50.4884096416177
1003_s_at 643 142.372008051308
1004_at 643 211.65300963049
1005_at 1843 4281.29318032004
1006_at 4319 38.5784289213085
1007_s_at NA 1489.98158531843
1008_f_at 5610 4013.576753977
1009_at 3094 3070.50648167305
我们首先看看一个R语言函数处理方式吧,这个是比较容易想到的算法,但是用到了循环,非常的不经济,计算量很大。

Capture1
因为里面涉及到了双重循环。我进行了人工计时,这个程序耗费了一分钟十二秒,程序里面的计时器有点问题。

然后我再讲一个精简版的算法
dat=esetDataTable[,c(7,8)]
dat=as.data.frame(apply(dat,2,as.numeric))
dat$prob=rownames(esetDataTable)
首先可以得到需要提取的数据所在行的两个值
match_row=aggregate(dat,by=list(dat[,1]),max)
类似于这句SQL语句:SELECT max(a.rowMeans) as val, a.EGID FROM test.esetDataTable a group by a.EGID
现在要根据match_row表去原表esetDataTable里面提取我们的探针ID数据。

Capture2

这样就完美解决了问题,我们的合并后的数据的prob列,就是前面那个函数计算了一分多钟的返回的非冗余探针ID向量,relevantProbesets,但是这次只花了不到一秒钟。
tmp_prob=merge(dat,match_row,by=c("EGID","rowMeans"))
setdiff(as.character(relevantProbesets),as.character(tmp_prob$prob))
length(union(as.character(relevantProbesets),as.character(tmp_prob$prob)))
setdiff(as.character(tmp_prob$prob),as.character(relevantProbesets))
我顺便检查了一下上面那个复杂的R函数跟我这次精简版的结果,发现这次才是对的,上面那个错了。
你们能发现上面那个为什么错了吗?

如果是在mysql里面它的表现形式如下;
mysql> select row_names,EGID,rowMeans from esetDataTable order by EGID limit 10;
+------------+-------+------------------+
| row_names | EGID | rowMeans |
+------------+-------+------------------+
| 38912_at | 10 | 85.8820246154773 |
| 41654_at | 100 | 301.720067595558 |
| 907_at | 100 | 273.100008206028 |
| 2053_at | 1000 | 354.207060199715 |
| 2054_g_at | 1000 | 33.8472900312781 |
| 40781_at | 10000 | 425.007640082848 |
| 35430_at | 10001 | 152.885791914329 |
| 35431_g_at | 10001 | 181.915087187117 |
| 35432_at | 10001 | 184.869017764782 |
| 31366_at | 10002 | 44.9716205901791 |
+------------+-------+------------------+
10 rows in set (0.05 sec)

如果要用SQL来做同样的事情需要下面这个语句,这个就非常简单啦!
select b.*
from test.esetDataTable b
inner join (
SELECT max(a.rowMeans) as val, a.EGID FROM test.esetDataTable a group by a.EGID
) as c on b.EGID=c.EGID and b.rowMeans=c.val
结果是:8640 rows in set (4.56 sec) 看起来mysql还没有R语言快速,但是这个mysql语句很明显也不是很高效,但是我的mysql水平有限。
稍微解释一下这个mysql语句,其中SELECT max(a.rowMeans) as val, a.EGID FROM test.esetDataTable a group by a.EGID类似于R里面的match_row=aggregate(dat,by=list(dat[,1]),max),就是把每个entrez ID对应着的最大值提取出来成一个新的表
而inner join on b.EGID=c.EGID and b.rowMeans=c.val 就是我们的tmp_prob=merge(dat,match_row,by=c("EGID","rowMeans")) 根据两列来合并两个数据框

21

一个MIT的博士要离开学术圈,结果引发了上千人的热烈讨论(下)

Dr Petra EichelsdoerferFebruary 17, 2014 at 11:17 AM

My university lacked the funds to support me while applying for my first grant after my post-doc. So I was unable to re-write it after its first rejection. Although it was heartbreaking to walk away, I had a family to support. This was nearly four years ago. How many others have had to make the same decision I have? What will be the long-term consequences?

我们大学也是资金紧张,所以我直到博士后出站才拿到资金的第一笔科研基金。大约四年前把,当我第一次被基金会拒绝后,我实在难以鼓起勇气再次申请基金。尽管我必须去申请,因为我还得养家糊口。会有多少人曾与我经历过同样的处境呢,又有多少人与我一样做了同样的选择呢? Continue reading

20

一个MIT的博士要离开学术圈,结果引发了上千人的热烈讨论(中)

的确是上千人进行讨论,我这里分两次翻译网友们的讨论结果:

Lenny TeytelmanFebruary 15, 2014 at 7:53 AM

The reason I did not want to publish this - a single voice is invariably dismissed. So, I want to assemble in a central place as many essays like this from students, postdocs, and professors. The funding crisis will not be addressed until the severity of it is acknowledged and NIH, politicians, and scientists are alarmed enough. Please e-mail me your stories to lenny at zappylab dot com (whether new or published elsewhere). I will put together a site aggregating all of them.

以前我不发表这个观点的原因是-势单力孤。所以我希望等到足够多的类似的观点由不同的学生,博士后,甚至教授提出来之后,把它们整合在一起。学术圈的资金危机很难解决,除非NIH本身意识到问题的严重性,而且最好是政客,科学家们也对此有着足够的警醒。所以请把你的经历发邮件给我,不管你以前是否也在其它什么场合抱怨过,我将对他们做一个汇总,统一发表出来。

BioluminessaFebruary 17, 2014 at 10:55 AM

Hi Lenny. Thank you for sharing your perspective! Here is another to add to your collection of essays. I wrote this piece the other day coming to similar conclusions. http://bioluminate.blogspot.com/2014/02/the-seven-stages-of-grief-for-academic.html

谢谢你分享你的观点,这里有个故事可能比较符合你的要求。

sheiselsewhereFebruary 25, 2014 at 11:54 AM

I often get told that I shouldn't be so negative and that things will get better. Unfortunately, I don't have the time to wait. Here is my contribution.

http://sheiselsewhere.mosdave.com/2014/02/16/singing-for-supper/

长久以来,人们就告诉我我不应该如此的悲观,一切都会好起来的。不幸的是,我已经没有足够的时间去等待事情好转了。下面是我的经历。

Dregev21February 28, 2014 at 2:13 PM

Wow, thank you for posting this! I have gone through a very similar situation and have also decided to quit pursuing this dream. I was a 4th year PhD student at the University of Florida (where I had already had to change labs since my first mentor moved to UAB) and my project was going nowhere fast. I also started seeing academia for what it has become; an industry of cheap labor and false hopes. But like you, I stayed in it for as long as I could because of my love for science, learning and teaching. I quit and got out with a MS degree this past November and I am very happy with my decision. I began working as a research coordinator at UF, making more money and like you felt liberated and free from the constant stress of graduate work and research. I believe most students come in to graduate scholl for the same reasons, but it has become so disheartening and scary, that it didn't seem worth it to me anymore. I think it is important for current students to know and understand that there are other things to do in life that are more fruitfull, less stressful and just as intelectually stimulating and rewarding. In any case, thank you for sharing!

哇,谢谢你的分享!我有着同样的经历,也准备放弃一直追寻的梦想了。我是佛罗里达大学的一名在读博士生,这是我博士生涯的第四年,而博士期间,我已经换过一次导师了,因为我的前导师去了UAB,所以我的博士课题也换了。就我的经历来看,我也认为学生圈现在充斥着廉价劳动力和虚假的期望。然而就像你一样,凭着对科学的一腔热血,我坚持了很长时间。直到去年十一月,我辍学了,就拿到了一个硕士学位,并且我非常开心我做了这样的决定。我的第一份工作是在UF做一个研究协调员,可以挣很多钱了,也终于从无止境的毕业研究课题压力中解放出来了,跟你一样的自由了。我觉得是时候让现在的学生知道在我们的生命中还有这非常多的更有意义,会收获更大,但却压力更小的事情可以做。

Travels with Moby March 1, 2014 at 12:58 PM

Good Luck to you. I made this choice for many similar reasons about 6 years ago. But I was a college professor at a small college. You have aptly described the scenario. Even without the added stress of the grant machine, the choices that we are forces to make that divorce ourselves from family, friends to pursue this academic dream are incredibly costly. What I did find after a year in industry, is that I was not alone, I met former academics in industry and elsewhere that have expressed the same concerns. I wish you the best, and you are not alone.

祝你好运!我也做了同样的决定,在六年前。但是我曾经是一个小学院的一名教授。你非常精准的描述了我们这样的教授的状态。即使没有基金方面的压力,我们这样的选择也使得我们无法兼顾家庭,朋友。追求学术的道理牺牲真的好大。当我离开学术圈,进入产业界的第一年,我终于不再觉得孤单了,我遇到以前学术圈的好友的时候也听到过他们有着类似的抱怨。不管怎么样,还是希望你能好好的,毕竟,我们都不孤单。

Kevin ZelnioMarch 3, 2014 at 8:51 AM

Good luck! Life is better outside academia lol. I left 2 phds (got my masters after first one) and a decade long research career with 12 and then a 5+ year science communication career, left the country and started a microbrewery in Sweden. My skills as a scientist have been instrumental in my new profession as a beer maker (serious lab and sanitation skills here!) and a business person (improved and more diverse funding sources! AKA investors and people who drink BEER - which is like everybody). I cried a lot, I won't lie. Almost wrecked my marriage and the stress turned me into a horrible father for a while. Its just not a sustainable career for some types of people. Which is a shame, because the career is selecting for the same type of people and missing out on a diversity of life styles which could most likely benefit the scientific community in a number of ways. Here was my story: http://deepseanews.com/2013/02/19294/

祝你好运!学术圈外面的世界更精彩,O(∩_∩)O哈哈哈~我有着长达12年的科学研究经历和多于5年的科学普及经历,终于离开了我的祖国,来到了瑞典开始酿酒。在我还是科学家的时候掌握的各种技能对我现在作为一个啤酒酿造师还是蛮有帮助的,尤其是严肃认真的实验习惯以及无菌操作技巧。那时候申请各种稀奇古怪的基金所锻炼出来的沟通技巧也特别适合自己现在转行做商人。我曾经为这样的转变哭泣过,我承认。这几乎摧毁了我的婚姻生活,也在一段时间让我变成了一个糟糕的父亲。对某些人来说,这样的职业不是一辈子的事情,而且是某种意义上的丢脸。以下是我的故事。

UnknownMarch 3, 2014 at 2:10 PM

I don't see why I should view your departure as
a bad sign for the life sciences. As an engineer,
we celebrate when our students graduate, go
start a company or join an existing one, and
create products that make the world a better
place. Or, go work at a national laboratory,
the FCC, a non-profit, or any of the other types of
jobs where engineers make a contribution.

我不认为,应该把你的离开看作是科学界的损失。作为工程师,我们欢迎毕业了的同学开公司或者加入已有的公司,去创造一些让人们生活更美好的产品。或者,干脆去国家实验室工作,或者FCC也行,一个非盈利组织,或者可以参加很多其它类似的,对工程师有需求的工作

Lenny TeytelmanMarch 3, 2014 at 2:26 PM

First of all, it is terrific that you are supportive of graduate students who go on to be productive outside of academia! Unfortunately, in life sciences, you often lose support of your mentor the second you say that you do not plan to be a tenure track professor.
Second, and most importantly - the reason our departures and anxieties are cause for concern - being a professor, in the current funding climate, requires a level of sacrifice for science that fewer and fewer of the most talented and brightest scientists will make. Our taxpayers spend an extraordinary amount funding research. If the best scientists leave academia, research will suffer. Training of the future scientists will suffer. Science, inside and outside of academia will suffer.

首先,像你这样的毕业生,走出学术圈之后还能创造如此多的产品,必然是非常了不起的。不幸的是,在生命科学领域,你经常就会失去你导师的资助,就像你博文里面提到的第二点那样,你也不打算去争取一个终身教职了。

其次,最重要的是我们的离开仅仅是因为焦虑,在如今的基金资助条件下,成为一个教授,需要一定程度的牺牲,只有极少数的非常聪明,非常有才智的科学家能做到了。我们的纳税人还是投入了大笔资金在科学研究的,如果大量的优秀科学家都离开了学术圈,那谈何学术成果呢。一个科学家的训练是需要耗费大量时间和金钱的,他们都将因此受损,同时,不管是学术圈里面还是外面都会蒙受巨大的损失。

gregMarch 3, 2014 at 3:07 PM

Your story collection is a great idea. I hope you'll keep the sources of the site open? I bet a lot of people would like to contribute to making that project stand out - I would certainly be helping out.

Thanks a ton for your blog post. Your last point about leaving your wife if she'd treated you as badly as science does is awesome. I'm just coming to the realization that you seem to already have: the notion that "If you can see yourself possibly loving any profession as much as you love science, you're not cut out for science" is unhealthy - it's a mark of the sort of brainwashing that academia does to you.

Best wishes on your future path.

你这个想把所有类似的学术圈失意的故事收集起来的想法很棒。我希望你能持续收集,并且保持开放。我也确信会有非常多的人参与进来贡献他们的故事,让这个计划传播开来。我也会尽我的努力去推动它。

能看到你这个博文真的是三生有幸。你最后那句话(如果你妻子对你像科学对你那样你肯定把她给甩了)简直是太精彩了。我也开始有着你曾经有过的想法了,如果你对追求科学还抱有疑问,是因为你的爱不够坚定,这样的想法是不对的,这只是学术圈对你的洗脑。

Jessica WilsonMarch 6, 2014 at 11:43 AM

This is fantastic writing, despite the sadness. I sympathize (finishing PhD in neuroscience, considering heading out).

I'd love to try and make a video with some of the stories you've accumulated. I'm already looking through that Google Doc you posted right now, and my heart is breaking.

文章写的真棒,尽管让人感到莫名的悲伤和失望。作为一个刚刚结束了神经科学学习课程的博士生,我深有同感。

我很乐意为你收集的这些故事拍一个video,我也看完了你在google doc里面所表达的观点,实在是太震撼了。

CBMarch 12, 2014 at 6:09 PM

Really great stuff. I have reread this post a dozen times over the past couple weeks, as I am a postdoc currently on the precipice of throwing in the towel on my academic career. I find the last sentence particularly meaningful. I can't shake the feeling that giving up on this career that I have been laser-focused on for ten years feels an awful lot like a traumatic breakup. But the simple truth is exactly as you described, academic science simply doesn't respect its professionals nearly enough for the best of us to stick around.

Ugh, breakups suck!

 

Michael RuddyMarch 14, 2014 at 1:55 PM

How appropriate for a Valentine post ... if you do not love everything about what you are doing – move on until you find it!

O(∩_∩)O哈哈~在情人节发表这样的观点实在是太适合不过了。如果你实在是不喜欢你正在做的事情,果断的放弃,持续寻找直到找到你所爱的。

Nick EffordFebruary 15, 2014 at 8:53 AM

I sympathise and wish you a successful and fulfilling future, wherever that takes you. The pressures in UK academia are much the same, as is the relatively low pay. We've seen our pay fall 13% taking into account inflation over the last 5 or 6 years, and universities refuse to offer a decent pay increase despite increasing their income from students and despite the fact that they are sitting on huge cash reserves. My own institution would rather spend £50 million on new buildings than reward its staff for their dedication.

Like you and countless others, I'm reluctant to leave a job that can be very exciting and stimulating. But the truth is that the stress levels make it increasingly unsustainable. There is constant pressure to write papers and secure research funding and simultaneous pressure to improve teaching quality, but there is a failure to recognise that time is a finite resource, so one activity must inevitably be traded off against the other.

I don't expect to receive the same remuneration as I would in industry, but I do need one of two things to happen: either working conditions need to improve or the pay needs to improve to reflect the real pressures of the job. I've sacrificed too many evenings and weekends over the years, and that has had a negative impact on personal physical and mental health as well as family relationships. If something doesn't give soon, I could well end up following you out of academia.

The trade union for academics in the UK is currently locked in a bitter pay dispute with the universities. You can find out more about it at http://fairpay.web.ucu.org.uk/

Good luck!

Nick

我也深有同感,也期望你有个成功而且精彩的未来,不管你走向哪个领域。英国的学术圈压力也很类似,因为相对来说基金资助量都很小。在过去的5到6年间,我们的资金总量,考虑到通货膨胀,反而缩减了13%,大学却拒绝对研究基金做一个比较像样的提高,尽管他们向学生收取的学费更多了,而且他们的现金流也非常健康。我所在的研究所情愿花五千万欧元区修建一个大楼,也不会去奖励那些辛辛苦苦奉献着的教职工。

像你以及无数的其他类似经历的人一样,我也不情愿离开这个即兴奋又刺激的研究工作。但是压力的与日俱增让我的研究工作越来越难以为继。我们既要保证发表足够的文章,还要争取研究基金的支持,同时还要提高我们的教学质量,但是我们不得不承认,一个人的时间是有限的,所以我们必然会在有些方面做得不够。

我不期望能得到与工业界相当的报酬,我仅仅是期望我们的工作条件能得到改善,并且我们的报酬水平应该能对得住我们实际上所承受的工作压力。长年以来,我已经付出了无数个夜晚和周末,而这对我个人的身心健康是一个很大的影响,同时也极大的影响了我的家庭关系。如果这一切不尽快改善的话,我想,我应该马上就会追寻你的脚步,离开这学术圈了。

20

一个MIT的博士要离开学术圈,结果引发了上千人的热烈讨论(上)

看到了一篇好文,所以就抽空翻译了一下!比较搞笑的是这个博客在国内居然被墙了,所以我就只贴出我的译文吧。

再见吧,我的学术生涯!

在过去的12年间,我一直从事着科学研究和教学工作,而且我也乐于其中。然而,就在几个星期以前,我辞掉了MIT的博士后职位,放弃了长久以来追寻的学术梦想。随后,我觉得非常的自由快乐,而这对于美国的生命科学研究来说,的确是一非常不好的消息。

Michael Eisen,我的联合导师之一,他毕业于伯克利大学,最近在博客中写道:这是一个做科学研究非常好的时代,但对科学家来说确实一个非常烂的时代。几个月前,我与我的另一个联合导师Jasper Rine讨论了NIH研究基金会的资助风波。Jasper说道:除非NIH立刻醒悟,基金的管理方式需要重要的改变,否则,我们这一代的科学家命运将非常坎坷。而我就是这些命运坎坷的科学家的一员,所以我出局了。

2001年 我大学毕业之后,我拒绝了一个高薪的程序员职位。相反,我选择了在著名的冷泉港实验室做生物信息学,尽管薪水要低很多。我个人是非常兴奋能有这个机会用各 种计算技术来做生物学研究的。两年后,我如愿以偿的进入了该领域的研究生队伍,方向是分子生物学,与此同时,我的薪水在接下来的六年间都只有刚开始的一半 了。而在MIT做博士后的期间,我的薪水也没能回到十年前我作为一个初级程序员的时候,尽管那时的我技能欠佳,也没有什么拿得出手的专业领域知识。在商业领域中,个人报酬也当与他掌握的技能以及知识水平相当,但是,在学术领域,它们二者之间的关联度大打折扣。

幸运的是,我的妻子一直都很支持我对科学研究的热爱,从2006年起,她就在做物理系的助教来平衡我们的财政支持。她工作的报酬还行,所以我们能负担的起在剑桥的房租和日常开销。而月内我的另一个孩子即将来到这个世上,我还有一个孩子在托儿所,所以我必须得改变自己,来改善我们的财务状况,做一个居家好男人总比在MIT做一个日夜忙碌的博士后要强。

早在我还是 一名研究生的时候,我就意识到了追求学术生涯的种种弊端。我接受了微博的薪水,忍受了换学校单位的不稳定性以及教授们那近乎疯狂的工作量。我也接受了变幻 莫测的天气,在熬过来了十多年的研究生和博士后的日子,我本可以顺理成章的拿到教职。我甚至也能接受,在追寻着光荣而神圣的目标的过程中,五年后有可能会 被拒绝,从而不得不再次搬迁去另一个学校再寻找教职。我也能想象到,即使拿到了教职,我也不得不投入我的全部身心来为我的课题组争取研究基金。我能看到所 有的一切我将为我追求所爱而付出的代价。

然而,本来应该是五年之后选择作为一个教授的烦恼却提取让我害怕了,因为拿到NIH的资助的不确定性太大了。没什么好说的了,一想到我所提交的所有研究基金申请书都会被打道回府,我就恐惧万分。我本应该靠着研究基金来提升自己,但是现在却看起来更糟糕了。就我所认识的十多名在近两三年拿到教授职位的科学家,他们没有一个人拿到了NIH的巨大的资金支持,他们得到的只有拒绝,再拒绝,进一步的拒绝。其中有一个非常年轻,且富有才华的教授,申请基金已经高达13次了,但是也是一直被拒绝,尽管她的申请书写的非常棒,而且立题也非常新颖。因此,她马上就要失去她的实验室了,因为她的启动资金即将耗尽,而这一切对她来说都犹如噩梦般,甚至需要安眠药帮助度过。这一切,又如何能令我不害怕呢?

在青少年时 期,我就一直认为工作应该像度过周末一样愉快,而我也一直痴迷于此。在过去的十二年中,我一直试图在学术领域寻找这一点,而且认为只有学术这一条路能达到 这一点。幸运的是,一年前,我与一个生命科学家合作开发一个开放的,实时更新的中央资料库,我非常享受开发过程的每一个步骤,而且我也很确定,在公司也能 得到在科学界能得到的那种全心全意投入的感觉。

我现在还不 确定科学家们是否会用到我们所创造的产品,也不知道我在公司能否就有充足资金来追寻我的梦想。一个星期前的辞职,我的确是冒了很大的风险。风险是很大,但 这并不是疯狂的行为。真正疯狂的是按部就班的执着于学术圈。我可以这样说,通过这一年的创造各种科学性的产品,我比以前更接近于教授了。

我也明白, 很多读者都会认为我是一种吃不到葡萄就说葡萄酸的心态,或者认为我对学术的渴望并不是想象中的那么强烈,我真的企图是变得富有。如果这也是你所认为的,那 么你其实是抱着要科学家什么都不去想,只安心的做一个简简单单的教授的想法。我的确是热爱过我所从事的研究教学工作,我也好想念那些美好的日子,尽管想起 来很受伤。但我也爱我的妻子,如果她像学术界对待科学家那样对待我,我早就离开她了。

原文地址:http://anothersb.blogspot.com/2014/02/goodbye-academia.html

我又翻墙把原文拷贝出来了,如下:

Goodbye Academia

I have enjoyed research and teaching for the last twelve years. Yet, I have resigned from my postdoctoral position at MIT a week ago, giving up on the dream of an academic position. I feel liberated and happy, and this is a very bad sign for the future of life sciences in the United States.

Michael Eisen, my co-advisor from graduate school at Berkeley recently wrote that it is a great time to do science but a terrible time to be a scientist. A few months ago I was discussing with my other co-adviser Jasper Rine the crisis in NIH research funding awards (better known as "lottery"). Jasper said that unless NIH wakes up and there is a major restructuring, we will lose an entire generation of scientists. I am a member of this generation, and I am out.

In 2001, about to graduate from college, I turned down a programming position at a hedge fund. Instead, I chose to do bioinformatics at Cold Spring Harbor Laboratory for a much lower salary. I was excited about the possibilities of doing biological research using computational tools. Two years later, I enthusiastically entered graduate school in molecular biology, with my salary dropping by half for the next six years. As a postdoctoral researcher at MIT, I am not even back to earning what I did ten years ago as a junior programmer with no skills or domain-specific knowledge. In a commercial setting, my compensation would have kept pace with my knowledge and skills, but in academia, there seems to be a complete decoupling of the two.

Luckily, my wife has always been supportive of my passion for science and balanced my foolhardiness with a practical job as a physician’s assistant since 2006. She is well compensated, allowing us to pay off our loans and afford the monthly expenses in Cambridge. With a daughter in daycare and another child due in a month, we would certainly be in a better financial shape with me as a stay-at-home dad than a postdoctoral scientist at MIT.

Science has also meant wrenching moves across the country. In 2003, we moved to California for me to begin my graduate studies. We both love New York, and my wife was devastated to leave her family and friends. In 2009, after many tearful discussions, she agreed to move to Boston from California for my postdoc. The next move for a professor position would surely require moving to yet another new place in the country.

As a graduate student, I was well aware of all of the negatives of an academic career. I accepted the miniscule pay, the inability to choose where to live, and the insane workloads of professors. I accepted the uncertainty of whether, after 10-12 years as a graduate student and postdoc, I would actually get a job as a professor. I accepted that even after attaining this lofty goal, five years later, I could be denied tenure and would have to move to another university or go into industry. I accepted that even with tenure, I would have to worry my entire life about securing research funding for the lab. I saw all of these as the price to pay for doing something that I love.

However, one aspect of being a professor has been terrifying me for over five years now – the uncertainty of getting funding from NIH. No let me rephrase that. What is terrifying is the near-certainty that any grant I submit would be rejected. I have been waiting for the funding situation to improve, but it seems to only be getting worse. I personally know about ten scientists who have become professors in the last 3-4 years. Not a single one of them has been able to get a grant proposal funded; just rejection, after rejection, after rejection. One of these is a brilliant young professor who has applied for grants thirteen times and has been rejected consistently, despite glowing reviews and high marks for innovation. She is on the brink of losing her lab as her startup funds are running out and the prospect of this has literally led to sleepless nights and the need for sleeping pills. How can this not terrify me?

I have been obsessed since my teens with the idea that work should be something one desires to come back to after a weekend. For the last twelve years, being an academic was the only path I saw toward this. Fortunately, a year ago, I co-founded a startup to create an open, up-to-date, central protocol repository for life scientists. I have enjoyed every step of getting ZappyLab going, and I am certain that the company will give me the feeling that I still get from science - wanting to go into work every day.

I don’t know yet if scientists will use what we are building. I don’t know if we will be able to raise the capital needed to build what I dream of building. By resigning from my postdoc a week ago, I have done something very risky. Risky, but not crazy. What seems crazy is aiming to stay in the academic track. I say this despite having had the most scientifically productive year of my life; I am closer to getting a professorship than ever before.

I realize that many will dismiss my story as a tale of sour grapes, or say that my desire is not strong enough or my primary motivation is to get rich. If that is your position, you are simply hoping that future scientists will be unable to love anything other than being a professor. I do love research and teaching with every fiber of my being. I will miss them and it will hurt. But I also love my wife, and if she had treated me the way academia treats its scientists, I would have left her long ago.

他们对生物信息学的讨论有点类似于我国的科学网,但是我国的科学网博主水平很有限
16

最全面的转录组研究软件收集

能看到这个网站真的是一个意外,现在看来,还是外国人比较认真呀, 这份软件清单,能看出作者的确是花了大力气的,满满的都是诚意。from: https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools
https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools软件主要涵盖了转录组分析的以下18个方向,看我我才明白自己的水平的确没到家,印象中的转录组分析也就是差异表达,然后注释以下,最多分析一下融合基因,要不然就看看那些miRNA,和lncRNA咯,没想到里面的学问也大着呢,怪不得生物是一个大坑,来再多的学者也不怕,咱有的是研究方向给你。

    1 Quality control and pre-processing data
        1.1 Quality control and filtering data
        1.2 Detection of chimeric reads
        1.3 Errors Correction
        1.4 Pre-processing data
    2 Alignment Tools
        2.1 Short (Unspliced) aligners
        2.2 Spliced aligners
            2.2.1 Aligners based on known splice junctions (annotation-guided aligners)
            2.2.2 De novo Splice Aligners
                2.2.2.1 De novo Splice Aligners that also use annotation optionally
                2.2.2.2 Other Spliced Aligners
    3 Normalization, Quantitative analysis and Differential Expression
        3.1 Multi-tool solutions
    4 Workbench (analysis pipeline / integrated solutions)
        4.1 Commercial Solutions
        4.2 Open (free) Source Solutions
    5 Alternative Splicing Analysis
        5.1 General Tools
        5.2 Intron Retention Analysis
    6 Bias Correction
    7 Fusion genes/chimeras/translocation finders/structural variations
    8 Copy Number Variation identification
    9 RNA-Seq simulators
    10 Transcriptome assemblers
        10.1 Genome-Guided assemblers
        10.2 Genome-Independent (de novo) assemblers
            10.2.1 Assembly evaluation tools
    11 Co-expression networks
    12 miRNA prediction
    13 Visualization tools
    14 Functional, Network & Pathway Analysis Tools
    15 Further annotation tools for RNA-Seq data
    16 RNA-Seq Databases
    17 Webinars and Presentations
    18 References