Daily Archives: 2015年10月16日
3000多份水稻全基因组测序数据共享-主要是突变数据
感觉最近接触的生物信息学知识越多,越对大数据时代的到来更有同感了。现在的研究者,其实很多都可以自己在家里做了,大量的数据基本都是公开的, 但是一个人闭门造车成就真的有限,与他人交流的思想碰撞还是蛮重要的。
我以前也是用这样的流程 SNP Pipeline Commands 1. Index the reference genome using bwa index /software/bwa-0.7.10/bwa index /reference/japonica/reference.fa 2. Align the paired reads to reference genome using bwa mem. Note: Specify the number of threads or processes to use using the -t parameter. The possible number of threads depends on the machine where the command will run. /software/bwa-0.7.10/bwa mem -M -t 8 /reference/japonica/reference.fa /reads/filename_1.fq.gz /reads/filename_2.fq.gz > /output/filename.sam 3. Sort SAM file and output as BAM file java -Xmx8g -jar /software/picard-tools-1.119/SortSam.jar INPUT=/output/filename.sam OUTPUT=/output/filename.sorted.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE 4. Fix mate information java -Xmx8g -jar /software/picard-tools-1.119/FixMateInformation.jar INPUT=/output/filename.sorted.bam OUTPUT=/output/filename.fxmt.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE 5. Mark duplicate reads java -Xmx8g -jar /software/picard-tools-1.119/MarkDuplicates.jar INPUT=/output/filename.fxmt.bam OUTPUT=/output/filename.mkdup.bam METRICS_FILE=/output/filename.metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 6. Add or replace read groups java -Xmx8g -jar /software/picard-tools-1.119/AddOrReplaceReadGroups.jar INPUT=/output/filename.mkdup.bam OUTPUT=/output/filename.addrep.bam RGID=readname PL=Illumina SM=readname CN=BGI VALIDATION_STRINGENCY=LENIENT SO=coordinate CREATE_INDEX=TRUE 7. Create index and dictionary for reference genome /software/samtools-1.0/samtools faidx /reference/japonica/reference.fa java -Xmx8g -jar /software/picard-tools-1.119/CreateSequenceDictionary.jar REFERENCE=/reference/japonica/reference.fa OUTPUT=/reference/reference.dict 8. Realign Target java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -o /output/filename.intervals -fixMisencodedQuals -nt 8 9. Indel Realigner java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T IndelRealigner -fixMisencodedQuals -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -targetIntervals /output/filename.intervals -o /output/filename.realn.bam 10. Merge individual BAM files if there are multiple read pairs per sample /software/samtools-1.0/samtools merge /output/filename.merged.bam /output/*.realn.bam 11. Call SNPs using Unified Genotyper java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /reference/japonica/reference.fa -I /output/filename.merged.bam -o filename.merged.vcf -glm BOTH -mbq 20 --genotyping_mode DISCOVERY -out_mode EMIT_ALL_SITES
NGS数据比对工具持续收集
无意中看到了这个网站,比wiki的还有全面和专业。搜集了现有还算比较出名的比对软件,并且列出来了,还做了简单评价,里面对比对工具的收集,主要是基于2012年的一个综述《Tools for mapping high-throughput sequencing data》,相信应该是有不少人都看过这篇综述的,其实生物信息初学者应该自己去文献数据库找点感兴趣的关键词的综述多看看,广泛涉猎总没有坏处的。
<img src="http://www.ebi.ac.uk/~nf/hts_mappers/mappers_timeline.jpeg" alt="Mappers Timeline" width="800">
Features Comparison
The following Table enables a comparison of mappers based on different characteristics. The table can be sorted by column (just click on the column name). The data was collected from different sources and in some cases was provided by the developers. For execution times and memory requirements we refer to the above mentioned review (supplementary data is available here).
根据染色体起始终止点坐标来获取碱基序列
Try http://genome.ucsc.edu/cgi-bin/das/dsn for a list of databases.
X-DAS-Version: DAS/0.95 X-DAS-Status: 200 Content-Type:text Access-Control-Allow-Origin: * Access-Control-Expose-Headers: X-DAS-Version X-DAS-Status X-DAS-Capabilities UCSC DAS Server. See http://www.biodas.org for more info on DAS. Try http://genome.ucsc.edu/cgi-bin/das/dsn for a list of databases. See our DAS FAQ (http://genome.ucsc.edu/FAQ/FAQdownloads#download23) for more information. Alternatively, we also provide query capability through our MySQL server; please see our FAQ for details (http://genome.ucsc.edu/FAQ/FAQdownloads#download29). Note that DAS is an inefficient protocol which does not support all types of annotation in our database. We recommend you access the UCSC database by downloading the tab-separated files in the downloads section (http://hgdownload.cse.ucsc.edu/downloads.html) or by using the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) instead of DAS in most circumstances.
nature发表的统计学专题Statistics in biology
生物学里面,唯一还算有点技术含量,和有点门槛,就是生物统计了,而这也是绝大部分研究者的痛点,有能力的,可以看看nature上面关于统计学的专题讨论,而且主要是应用于自然科学的统计学讨论。
生物信息学学者学习mysql之路
来自于:https://www.biostars.org/p/474/#9095
for example, I would cite:
UCSC http://genome.ucsc.edu/FAQ/FAQdownloads#download29
ENSEMBL http://uswest.ensembl.org/info/data/mysql.html
GO http://www.geneontology.org/GO.database.shtml#mirrors
1000 Genomes: since June 16, 2011: http://www.1000genomes.org/public-ensembl-mysql-instance
mysql -h mysql-db.1000genomes.org -u anonymous -P 4272
Flybase has direct access to its postgres chado database.
http://flybase.org/forums/viewtopic.php?f=14&t=114
hostname: flybase.org port: 5432 username: flybase password: no password database name: flybase
e.g. psql -h flybase.org -U flybase flybase
mysql -h database.nencki-genomics.org -u public
mysql -h useastdb.ensembl.org -u anonymous -P 5306
是不是很简单呀,只有你认真的学习,其实这些应用的东西都还是蛮简单的。
居然还可以出售TCGA的数据,只有你稍微进行分析一下即可
然后基于此,把TCGA计划里面的所有癌症样本数据都处理了,并且得到了融合基因数据集,然后就以此出售
Pricing of FusionSCOUT datasets:
- Single gene in one cancer set 490€ / 580$ per dataset
- Single gene fusions across all cancers 4900€ / 5800$ dataset
- Individual cancer set 990 € / 1250 $ per dataset
- Full TCGA dataset 9900€ / 12500$ per dataset
One of the latest therapeutics angles in the fight against cancer is fusion genes and their regulation. To aid in fusion gene research and reveal the multitude of gene fusion event in cancer samples MediSapiens has developed a proprietary FusionSCOUT pipeline for identifying fusion genes from RNA sequencing datasets.
Currently we have analysed 7625 tumour samples from the TCGA project building a fusion gene dataset covering 28 different cancers within the TCGA project which can be accessed through our FusionSCOUT product.
Using this pipeline, we have discovered 3930 samples with gene fusions with 9667 different fusion genes. We´ve discovered numerous novel gene fusions as well as new cancer types in which previously known fusions appear.
You can now purchase these gene fusions datasets with few mouse clicks and get the worlds most comprehensive gene fusions from cancer sets within days
FusionSCOUT cancer Reports
With FusionSCOUT you can access the full listings of all fusion genes in specific cancer datasets. Find new leads for possible cause of the cancer, examine the pathways that are affected by different fusions, stratify patients by shared fusion genes or search for potential target for drugs and companion diagnostics.
Once you purchase a FusionSCOUT dataset we will send you a detailed report with information on the fused genes, sample ID from the TCGA dataset, fusion frequencies across the dataset as well as fusion mRNA sequences and lists of protein domains present in the fusion transcripts.
By ordering the MediSapiens FusionSCOUT dataset, you´ll get:
- A list of all gene fusions that involve your gene of interest, across all TCGA cancer types
- TCGA sample ID: s of the for the samples with fusions
- Exact exon junctions for the fusions, including alternatively spliced variants and data on whether reading frame is retained
- Detailed list of protein domains retained in the fusion genes
- cDNA sequence for the fusion mRNAs
Contact us to access the most up-to-date and comprehensive datasets of fusion gene events in different cancers!contact@medisapiens.com
Check out also our Fusion Gene Detection pipeline service for your samples!
Dataset missing? Email us and well add your favorite dataset to FusionSCOUT!
FusionSCOUT Cancer sets, March 2015
Cancer type | Number of samples | Number of fusion genes |
Acute Myeloid Leukemia, LAML | 153 | 69 |
Adrenocortical carcinoma, ACC | 79 | 115 |
Bladder Urothelial Carcinoma, BLCA | 273 | 473 |
Brain Lower Grade Glioma, LGG | 467 | 309 |
Breast Invasive Carcinoma, BRCA | 1029 | 3267 |
Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma, CESC | 195 | 190 |
Colon Adenocarcinoma, COAD | 287 | 212 |
Glioblastoma multiforme, GBM | 170 | 379 |
Head and Neck Squamous Cell Carcinoma, HNSC | 412 | 386 |
Kidney Chromophobe, KICH | 66 | 19 |
Kidney Renal Clear Cell Carcinoma, KIRC | 523 | 217 |
Kidney Renal Papillary Cell Carcinoma, KIRP | 226 | 145 |
Liver Hepatocellular Carcinoma, LIHC | 198 | 317 |
Lung Adenocarcinoma, LUAD | 456 | 991 |
Lung Squamous Cell Carcinoma, LUSC | 482 | 1374 |
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma, DLBC | 28 | 18 |
Mesothelioma, MESO | 36 | 26 |
Ovarian Serous Cystadenocarcinoma, OV | 420 | 1166 |
Pancreatic Adenocarcinoma, PAAD | 84 | 46 |
Pheochromocytoma and Paraganglioma, PCPG | 184 | 83 |
Prostate Adenocarcinoma, PRAD | 336 | 859 |
Rectum Adenocarcinoma, READ | 85 | 74 |
Sarcoma, SARC | 161 | 799 |
Skin Cutaneous Melanoma, SKCM | 355 | 620 |
Stomach Adenocarcinoma, STAD | 190 | 311 |
Thyroid Carcinoma, THCA | 506 | 195 |
Uterine Carcinosarcoma, UCS | 57 | 229 |
Uterine Corpus Endometrial Carcinoma, UCEC | 167 | 422 |