十 16

最全面的转录组研究软件收集

Posted on 2015年10月16日 by ulwvfje

能看到这个网站真的是一个意外，现在看来，还是外国人比较认真呀，这份软件清单，能看出作者的确是花了大力气的，满满的都是诚意。from: https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools

https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools软件主要涵盖了转录组分析的以下18个方向，看我我才明白自己的水平的确没到家，印象中的转录组分析也就是差异表达，然后注释以下，最多分析一下融合基因，要不然就看看那些miRNA，和lncRNA咯，没想到里面的学问也大着呢，怪不得生物是一个大坑，来再多的学者也不怕，咱有的是研究方向给你。

1 Quality control and pre-processing data

1.1 Quality control and filtering data

1.2 Detection of chimeric reads

1.3 Errors Correction

1.4 Pre-processing data

2 Alignment Tools

2.1 Short (Unspliced) aligners

2.2 Spliced aligners

2.2.1 Aligners based on known splice junctions (annotation-guided aligners)

2.2.2 De novo Splice Aligners

2.2.2.1 De novo Splice Aligners that also use annotation optionally

2.2.2.2 Other Spliced Aligners

3 Normalization, Quantitative analysis and Differential Expression

3.1 Multi-tool solutions

4 Workbench (analysis pipeline / integrated solutions)

4.1 Commercial Solutions

4.2 Open (free) Source Solutions

5 Alternative Splicing Analysis

5.1 General Tools

5.2 Intron Retention Analysis

6 Bias Correction

7 Fusion genes/chimeras/translocation finders/structural variations

8 Copy Number Variation identification

9 RNA-Seq simulators

10 Transcriptome assemblers

10.1 Genome-Guided assemblers

10.2 Genome-Independent (de novo) assemblers

10.2.1 Assembly evaluation tools

11 Co-expression networks

12 miRNA prediction

13 Visualization tools

14 Functional, Network & Pathway Analysis Tools

15 Further annotation tools for RNA-Seq data

16 RNA-Seq Databases

17 Webinars and Presentations

18 References

十 16

3000多份水稻全基因组测序数据共享-主要是突变数据

Posted on 2015年10月16日 by ulwvfje

感觉最近接触的生物信息学知识越多，越对大数据时代的到来更有同感了。现在的研究者，其实很多都可以自己在家里做了，大量的数据基本都是公开的，但是一个人闭门造车成就真的有限，与他人交流的思想碰撞还是蛮重要的。

https://aws.amazon.com/cn/blogs/aws/new-aws-public-data-set-3000-rice-genome/

https://aws.amazon.com/cn/public-data-sets/3000-rice-genome/

https://wiki.dnanexus.com/Featured-Projects/3000-rice-genomes

这里面列出了3000多份水稻全基因组测序数据，都共享在亚马逊云上面，是全基因组的双端测序数据，共3,024个水稻数据，比对到了五种不同的水稻参考基因组上面，而且主要是用GATK来找差异基因的。

而且，数据收集者还给出了一个snp calling的标准流程

我以前也是用这样的流程
SNP Pipeline Commands

1. Index the reference genome using bwa index

   /software/bwa-0.7.10/bwa index /reference/japonica/reference.fa

2. Align the paired reads to reference genome using bwa mem. 
   Note: Specify the number of threads or processes to use using the -t parameter. The possible number of threads depends on the machine where the command will run.

   /software/bwa-0.7.10/bwa mem -M -t 8 /reference/japonica/reference.fa /reads/filename_1.fq.gz /reads/filename_2.fq.gz > /output/filename.sam

3. Sort SAM file and output as BAM file

   java -Xmx8g -jar /software/picard-tools-1.119/SortSam.jar INPUT=/output/filename.sam OUTPUT=/output/filename.sorted.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

4. Fix mate information

   java -Xmx8g -jar /software/picard-tools-1.119/FixMateInformation.jar INPUT=/output/filename.sorted.bam OUTPUT=/output/filename.fxmt.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

5. Mark duplicate reads

   java -Xmx8g -jar /software/picard-tools-1.119/MarkDuplicates.jar INPUT=/output/filename.fxmt.bam OUTPUT=/output/filename.mkdup.bam METRICS_FILE=/output/filename.metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000

6. Add or replace read groups

   java -Xmx8g -jar /software/picard-tools-1.119/AddOrReplaceReadGroups.jar INPUT=/output/filename.mkdup.bam OUTPUT=/output/filename.addrep.bam RGID=readname PL=Illumina SM=readname CN=BGI VALIDATION_STRINGENCY=LENIENT SO=coordinate CREATE_INDEX=TRUE

7. Create index and dictionary for reference genome

   /software/samtools-1.0/samtools faidx /reference/japonica/reference.fa
   
   java -Xmx8g -jar /software/picard-tools-1.119/CreateSequenceDictionary.jar REFERENCE=/reference/japonica/reference.fa OUTPUT=/reference/reference.dict

8. Realign Target 

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -o /output/filename.intervals -fixMisencodedQuals -nt 8

9. Indel Realigner

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T IndelRealigner -fixMisencodedQuals -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -targetIntervals /output/filename.intervals -o /output/filename.realn.bam 

10. Merge individual BAM files if there are multiple read pairs per sample

   /software/samtools-1.0/samtools merge /output/filename.merged.bam /output/*.realn.bam

11. Call SNPs using Unified Genotyper

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /reference/japonica/reference.fa -I /output/filename.merged.bam -o filename.merged.vcf -glm BOTH -mbq 20 --genotyping_mode DISCOVERY -out_mode EMIT_ALL_SITES

十 16

NGS数据比对工具持续收集

Posted on 2015年10月16日 by ulwvfje

无意中看到了这个网站，比wiki的还有全面和专业。搜集了现有还算比较出名的比对软件，并且列出来了，还做了简单评价，里面对比对工具的收集，主要是基于2012年的一个综述《Tools for mapping high-throughput sequencing data》，相信应该是有不少人都看过这篇综述的，其实生物信息初学者应该自己去文献数据库找点感兴趣的关键词的综述多看看，广泛涉猎总没有坏处的。

Features Comparison

The following Table enables a comparison of mappers based on different characteristics. The table can be sorted by column (just click on the column name). The data was collected from different sources and in some cases was provided by the developers. For execution times and memory requirements we refer to the above mentioned review (supplementary data is available here).

十 16

根据染色体起始终止点坐标来获取碱基序列

Posted on 2015年10月16日 by ulwvfje

这次要介绍一个非常实用的工具，很多时候，我们有一个染色体编号已经染色体起始终止为止，我们想知道这段序列是什么样的碱基。当然我们一般用去UCSC的genome browser里面去查询，而且可以得到非常多的信息，多到正常人根本就无法完全理解。但是我如果仅仅是想要一段序列呢？

诚然，我们可以下载3G的那个hg19.fa文件，然后写一个脚本去拿到序列，但是毕竟太麻烦，而且一般这种需求都是临时性的需要，我们当然想要一个非常简便的方法咯。

我这里介绍一个非常简单的方法，是基于perl的cgi编程，当然，不需要你编程了。人家UCSC已经写好了程序，你只需要把网页地址构造好即可，比如chr17:7676091,7676196 ，那么我只需要构造下面一个网页地址

http://genome.ucsc.edu/cgi-bin/das/hg38/dna?segment=chr17:7676091,7676196

hg38可以更换成hg19，dna?segment= 后面可以按照标准格式更换，既可以返回我们想要的序列了。

网页会返回一个xml格式的信息，解析一下即可。

This XML file does not appear to have any style information associated with it. The document tree is shown below.

aggggccaggagggggctggtgcaggggccgccggtgtaggagctgctgg tgcaggggccacggggggagcagcctctggcattctgggagcttcatctg gacctg

</DNA>

</SEQUENCE>

</DASDNA>

很明显里面的aggggccaggagggggctggtgcaggggccgccggtgtaggagctgctgg tgcaggggccacggggggagcagcctctggcattctgggagcttcatctg gacctg 就是我们想要的序列啦。

赶快去试一试吧

当然你不仅可以搜索DNA，还可以搜索很多其它的，你也不只是可以搜索人类的

See http://www.biodas.org for more info on DAS.
Try http://genome.ucsc.edu/cgi-bin/das/dsn for a list of databases.

X-DAS-Version: DAS/0.95
X-DAS-Status: 200
Content-Type:text
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: X-DAS-Version X-DAS-Status X-DAS-Capabilities

UCSC DAS Server.
See http://www.biodas.org for more info on DAS.
Try http://genome.ucsc.edu/cgi-bin/das/dsn for a list of databases.
See our DAS FAQ (http://genome.ucsc.edu/FAQ/FAQdownloads#download23)
for more information.  Alternatively, we also provide query capability
through our MySQL server; please see our FAQ for details
(http://genome.ucsc.edu/FAQ/FAQdownloads#download29).

Note that DAS is an inefficient protocol which does not support
all types of annotation in our database.  We recommend you
access the UCSC database by downloading the tab-separated files in
the downloads section (http://hgdownload.cse.ucsc.edu/downloads.html)
or by using the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables)
instead of DAS in most circumstances.

十 16

nature发表的统计学专题Statistics in biology

Posted on 2015年10月16日 by ulwvfje

生物学里面，唯一还算有点技术含量，和有点门槛，就是生物统计了，而这也是绝大部分研究者的痛点，有能力的，可以看看nature上面关于统计学的专题讨论，而且主要是应用于自然科学的统计学讨论。

http://www.nature.com/collections/qghhqm

里面有几句统计学名言警句：

Statistics does not tell us whether we are right. It tells us the chances of being wrong.

统计学并不会告诉我们是否正确，而只是说明我们错误的可能性是多少。

Quality is often more important than quantity.

数据的质量远比数量要重要的多

The meaning of error bars is often misinterpreted, as is the statistical significance of their overlap.

Good experimental designs mitigate experimental error and the impact of factors not under study.

文章列表：

Research methods: Know when your numbers are significant

Scientific method: Statistical errors

Weak statistical standards implicated in scientific irreproducibility

The fickle P value generates irreproducible results

Vital statistics

Experimental biology: Sometimes Bayesian statistics are better

A call for transparent reporting to optimize the predictive value of preclinical research

Power failure: why small sample size undermines the reliability of neuroscience

Basic statistical analysis in genetic case-control studies

Erroneous analyses of interactions in neuroscience: a problem of significance

Analyzing 'omics data using hierarchical models

Advantages and pitfalls in the application of mixed-model association methods

Quality control and conduct of genome-wide association meta-analyses

Circular analysis in systems neuroscience: the dangers of double dipping

A solution to dependency: using multilevel analysis to accommodate nested data

How does multiple testing correction work?

What is Bayesian statistics?

What is a hidden Markov model?

下面的这些文章，其实就是我们正常课本里面统计学的知识点，但是放在nature杂志发表，就顿时高大上了好多

Points of significance: Importance of being uncertain

Points of Significance: Error bars

Points of significance: Significance, P values and t-tests

Points of significance: Power and sample size

Points of Significance: Visualizing samples with box plots

Points of significance: Comparing samples part I

Points of significance: Comparing samples part II

Points of significance: Nonparametric tests

Points of significance: Designing comparative experiments

Points of significance: Analysis of variance and blocking

Points of Significance: Replication

Points of Significance: Nested designs

Points of Significance: Two-factor designs

Points of significance: Sources of variation

Points of Significance: Split plot design

Points of Significance: Bayes' theorem

Points of significance: Bayesian statistics

Points of Significance: Sampling distributions and the bootstrap

Points of Significance: Bayesian networks

A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

十 16

生物信息学学者学习mysql之路

Posted on 2015年10月16日 by ulwvfje

我一直都知道mysql其实很有用的，哪怕是在bioinformatics领域。也断断续续的看过不少mysql教程，只是苦于没有机会应用。毕竟应用才是最好的学习方法，正好这些天需要用了，我就又梳理了一遍作为一个生物信息学学者，该如何学习mysql数据库。

先看中文教程：http://www.cnblogs.com/mr-wid/archive/2013/05/09/3068229.html

然后再搜搜一堆技巧

https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html

http://www.w3schools.com/sql/sql_func_count.asp

https://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html

http://hahaxiao.techweb.com.cn/archives/477.html

差不多就可以开始啦。

我们不拿数据库来做网页，所以需要的仅仅是查询公共数据库的数据，当然，一般人都会选择直接去网页可视化的查询，或者去ftp批量下载后自己写脚本来查询，我以前也是这样想的，所以感觉mysql没什么用，因为它能做的，我写一个脚本都能做到。但是任何事物能发展到如此流行的程度毕竟还是有它的优点的。

而在我看来，mysql的优点就是，不需要存储大量的文件信息，随查随用，如果我们想把数据库备份到本地，就要建立一大堆的文件夹，存放各种refgene信息呀，entrez gene信息呀，转录本，外显子等等各个文件夹，每个文件夹下面还有一堆文件，而且还要分物种存储，总之就是很麻烦，但是在数据库就不一样啦。

比如我们可以连接UCSC的数据库（前提是你的机器里面可以允许mysql这个命令，而且你可以联网）

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

就这么简单，你就用mysql远程登录了UCSC的数据库，可以show databases;或者use database hg19 ; 等等

里面有两百多个数据库，主要是多物种多版本，然后如果我们看hg19这个数据库，里面还有一万多个数据表，包含着hg19的全面信息。

还有很多其它的公共数据库可以练习
来自于：https://www.biostars.org/p/474/#9095

for example, I would cite:

UCSC http://genome.ucsc.edu/FAQ/FAQdownloads#download29
ENSEMBL http://uswest.ensembl.org/info/data/mysql.html
GO http://www.geneontology.org/GO.database.shtml#mirrors

1000 Genomes: since June 16, 2011: http://www.1000genomes.org/public-ensembl-mysql-instance

mysql -h mysql-db.1000genomes.org -u anonymous -P 4272

Flybase has direct access to its postgres chado database.
http://flybase.org/forums/viewtopic.php?f=14&t=114
hostname: flybase.org port: 5432 username: flybase password: no password database name: flybase
e.g. psql -h flybase.org -U flybase flybase

mysql -h database.nencki-genomics.org -u public
mysql -h useastdb.ensembl.org -u anonymous -P 5306

你都可以登录进去看看里面有什么，也可以练习练习mysql的语法，但是增删改查种的查是可以用的

然后我们可以用R或者perl或者Python来连接数据库，也是蛮好用的，我现在比较倾向于R

所以我就简单看了一下这个包的说明书，然后成功连接了

#Connect to the MySQL server using the command:

#mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

#The -A flag is optional but is recommended for speed

library(RMySQL)

my.host="genome-mysql.cse.ucsc.edu";

my.port="";

my.user="genome";

my.password="";

my.db="hg19";

#there are 203 databases,such as hg18,hg38,mm9,mm10,ce10

con <- dbConnect(MySQL(), host=my.host, user=my.user,dbname=my.db)

dbListTables(con) # there are 11016 tables in this hg19 database;

是不是很简单呀，只有你认真的学习，其实这些应用的东西都还是蛮简单的。

下面这本书也比较好，就讲了R或者perl或者Python来连接数据库，很全面

http://bioinformatics.risha.me/category/mysql/

当然，如果想看mysql在bioinformatics方面的应用，下面还有很多学习资料

http://www.biomedcentral.com/1471-2105/11/342

http://bioinformatics.oxfordjournals.org/content/28/14/1947.full.pdf

https://rostlab.org/owiki/images/7/73/Protocol_goldberg.pdf

http://webdoc.nyumc.org/nyumc/files/sun-lab/attachments/CPBI.Ch9.Biol.DB.pdf

http://www.bsi.umn.edu/resources/perl3.pdf

http://www.cs.toronto.edu/~leijiang/ta/mie453/tutorial/tut5/

这个课程比较全面：Biological Databases in Bioinformatics (BioE 594)

http://bioinformatics.bioe.uic.edu/online/BioE594_db.shtml

进阶版还可以看看具体事例，GO数据库的设计：http://geneontology.org/page/lead-database-schema

从这个来看，python要比perl 好很多http://www.personal.psu.edu/iua1/courses/files/2010/week15.pdf

十 16

居然还可以出售TCGA的数据，只有你稍微进行分析一下即可

Posted on 2015年10月16日 by ulwvfje

亮瞎了我的双眼，原来还可以这样挣钱。

这个数据库的作者在2011年发了一篇如何寻找融合基因的文章：*Edgren, Henrik, et al. "Identification of fusion genes in breast cancer by paired-end RNA-sequencing." Genome Biol 12.1 (2011): R6.

然后基于此，把TCGA计划里面的所有癌症样本数据都处理了，并且得到了融合基因数据集，然后就以此出售

http://medisapiens.com/products/fusion-scout/fusionscout-cancer-datasets （网站好像需要翻墙才能打开）

价格高达一万欧元，折合人民币七万多，一本万利，而且人家TCGA计划的数据的公开而且免费的，他做了二次处理就可以拿来挣钱，让我感觉很不爽。

到目前为止他们处理了TCGA计划里面的7652个癌症样本的数据，建立了一个囊括28种癌症的融合基因数据集，并且打包成了一个叫做FusionSCOUT 的产品来出售。

价格如下：

Pricing of FusionSCOUT datasets:

Single gene in one cancer set 490€ / 580$ per dataset
Single gene fusions across all cancers 4900€ / 5800$ dataset
Individual cancer set 990 € / 1250 $ per dataset
Full TCGA dataset 9900€ / 12500$ per dataset

该网站是这样介绍他们的产品的，号称有3500个研究团体已经使用了他们的数据，但是我感觉纯粹是吹牛，毕竟他这篇文献也就一百多的引用量，再说3500次购买，就这一个产品就能让他成为亿万富翁了，想想都觉得可怕。而且这网站这么烂，中国访问速度是渣渣，也就是相当于失去了中国的所有土豪客户了，怎么可能还有3500的销量，搞笑！

One of the latest therapeutics angles in the fight against cancer is fusion genes and their regulation. To aid in fusion gene research and reveal the multitude of gene fusion event in cancer samples MediSapiens has developed a proprietary FusionSCOUT pipeline for identifying fusion genes from RNA sequencing datasets.

Currently we have analysed 7625 tumour samples from the TCGA project building a fusion gene dataset covering 28 different cancers within the TCGA project which can be accessed through our FusionSCOUT product.

Using this pipeline, we have discovered 3930 samples with gene fusions with 9667 different fusion genes. We´ve discovered numerous novel gene fusions as well as new cancer types in which previously known fusions appear.

You can now purchase these gene fusions datasets with few mouse clicks and get the worlds most comprehensive gene fusions from cancer sets within days

FusionSCOUT cancer Reports

With FusionSCOUT you can access the full listings of all fusion genes in specific cancer datasets. Find new leads for possible cause of the cancer, examine the pathways that are affected by different fusions, stratify patients by shared fusion genes or search for potential target for drugs and companion diagnostics.

Once you purchase a FusionSCOUT dataset we will send you a detailed report with information on the fused genes, sample ID from the TCGA dataset, fusion frequencies across the dataset as well as fusion mRNA sequences and lists of protein domains present in the fusion transcripts.

By ordering the MediSapiens FusionSCOUT dataset, you´ll get:

A list of all gene fusions that involve your gene of interest, across all TCGA cancer types
TCGA sample ID: s of the for the samples with fusions
Exact exon junctions for the fusions, including alternatively spliced variants and data on whether reading frame is retained
Detailed list of protein domains retained in the fusion genes
cDNA sequence for the fusion mRNAs

Contact us to access the most up-to-date and comprehensive datasets of fusion gene events in different cancers!contact@medisapiens.com

Check out also our Fusion Gene Detection pipeline service for your samples!

Dataset missing? Email us and well add your favorite dataset to FusionSCOUT!

FusionSCOUT Cancer sets, March 2015

Cancer type	Number of samples	Number of fusion genes
Acute Myeloid Leukemia, LAML	153	69
Adrenocortical carcinoma, ACC	79	115
Bladder Urothelial Carcinoma, BLCA	273	473
Brain Lower Grade Glioma, LGG	467	309
Breast Invasive Carcinoma, BRCA	1029	3267
Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma, CESC	195	190
Colon Adenocarcinoma, COAD	287	212
Glioblastoma multiforme, GBM	170	379
Head and Neck Squamous Cell Carcinoma, HNSC	412	386
Kidney Chromophobe, KICH	66	19
Kidney Renal Clear Cell Carcinoma, KIRC	523	217
Kidney Renal Papillary Cell Carcinoma, KIRP	226	145
Liver Hepatocellular Carcinoma, LIHC	198	317
Lung Adenocarcinoma, LUAD	456	991
Lung Squamous Cell Carcinoma, LUSC	482	1374
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma, DLBC	28	18
Mesothelioma, MESO	36	26
Ovarian Serous Cystadenocarcinoma, OV	420	1166
Pancreatic Adenocarcinoma, PAAD	84	46
Pheochromocytoma and Paraganglioma, PCPG	184	83
Prostate Adenocarcinoma, PRAD	336	859
Rectum Adenocarcinoma, READ	85	74
Sarcoma, SARC	161	799
Skin Cutaneous Melanoma, SKCM	355	620
Stomach Adenocarcinoma, STAD	190	311
Thyroid Carcinoma, THCA	506	195
Uterine Carcinosarcoma, UCS	57	229
Uterine Corpus Endometrial Carcinoma, UCEC	167	422

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

Daily Archives: 2015年10月16日

最全面的转录组研究软件收集

3000多份水稻全基因组测序数据共享-主要是突变数据

NGS数据比对工具持续收集

Features Comparison

根据染色体起始终止点坐标来获取碱基序列

nature发表的统计学专题Statistics in biology

生物信息学学者学习mysql之路

居然还可以出售TCGA的数据，只有你稍微进行分析一下即可

Pricing of FusionSCOUT datasets:

FusionSCOUT cancer Reports

FusionSCOUT Cancer sets, March 2015

2015年10月
一	二	三	四	五	六	日
« 九				十一 »
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31