16

3000多份水稻全基因组测序数据共享-主要是突变数据

感觉最近接触的生物信息学知识越多,越对大数据时代的到来更有同感了。现在的研究者,其实很多都可以自己在家里做了,大量的数据基本都是公开的, 但是一个人闭门造车成就真的有限,与他人交流的思想碰撞还是蛮重要的。

这里面列出了3000多份水稻全基因组测序数据,都共享在亚马逊云上面,是全基因组的双端测序数据,共3,024个水稻数据,比对到了五种不同的水稻参考基因组上面,而且主要是用GATK来找差异基因的。
而且,数据收集者还给出了一个snp calling的标准流程
我以前也是用这样的流程
SNP Pipeline Commands

1. Index the reference genome using bwa index

   /software/bwa-0.7.10/bwa index /reference/japonica/reference.fa

2. Align the paired reads to reference genome using bwa mem. 
   Note: Specify the number of threads or processes to use using the -t parameter. The possible number of threads depends on the machine where the command will run.

   /software/bwa-0.7.10/bwa mem -M -t 8 /reference/japonica/reference.fa /reads/filename_1.fq.gz /reads/filename_2.fq.gz > /output/filename.sam

3. Sort SAM file and output as BAM file

   java -Xmx8g -jar /software/picard-tools-1.119/SortSam.jar INPUT=/output/filename.sam OUTPUT=/output/filename.sorted.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

4. Fix mate information

   java -Xmx8g -jar /software/picard-tools-1.119/FixMateInformation.jar INPUT=/output/filename.sorted.bam OUTPUT=/output/filename.fxmt.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

5. Mark duplicate reads

   java -Xmx8g -jar /software/picard-tools-1.119/MarkDuplicates.jar INPUT=/output/filename.fxmt.bam OUTPUT=/output/filename.mkdup.bam METRICS_FILE=/output/filename.metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000

6. Add or replace read groups

   java -Xmx8g -jar /software/picard-tools-1.119/AddOrReplaceReadGroups.jar INPUT=/output/filename.mkdup.bam OUTPUT=/output/filename.addrep.bam RGID=readname PL=Illumina SM=readname CN=BGI VALIDATION_STRINGENCY=LENIENT SO=coordinate CREATE_INDEX=TRUE

7. Create index and dictionary for reference genome

   /software/samtools-1.0/samtools faidx /reference/japonica/reference.fa
   
   java -Xmx8g -jar /software/picard-tools-1.119/CreateSequenceDictionary.jar REFERENCE=/reference/japonica/reference.fa OUTPUT=/reference/reference.dict

8. Realign Target 

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -o /output/filename.intervals -fixMisencodedQuals -nt 8

9. Indel Realigner

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T IndelRealigner -fixMisencodedQuals -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -targetIntervals /output/filename.intervals -o /output/filename.realn.bam 

10. Merge individual BAM files if there are multiple read pairs per sample

   /software/samtools-1.0/samtools merge /output/filename.merged.bam /output/*.realn.bam

11. Call SNPs using Unified Genotyper

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /reference/japonica/reference.fa -I /output/filename.merged.bam -o filename.merged.vcf -glm BOTH -mbq 20 --genotyping_mode DISCOVERY -out_mode EMIT_ALL_SITES
16

NGS数据比对工具持续收集

无意中看到了这个网站,比wiki的还有全面和专业。搜集了现有还算比较出名的比对软件,并且列出来了,还做了简单评价,里面对比对工具的收集,主要是基于2012年的一个综述《Tools for mapping high-throughput sequencing data》,相信应该是有不少人都看过这篇综述的,其实生物信息初学者应该自己去文献数据库找点感兴趣的关键词的综述多看看,广泛涉猎总没有坏处的。

<img src="http://www.ebi.ac.uk/~nf/hts_mappers/mappers_timeline.jpeg" alt="Mappers Timeline" width="800">

Features Comparison

The following Table enables a comparison of mappers based on different characteristics. The table can be sorted by column (just click on the column name). The data was collected from different sources and in some cases was provided by the developers. For execution times and memory requirements we refer to the above mentioned review (supplementary data is available here).

The Data column indicates if the mapper is specifically tailored for DNA, RNA, miRNA, or bisulfite sequences.The Seq.Plat. column indicates if the mapper supports natively reads from a specific sequencing platform or not (N). The version column indicates the version of the mapper considered. Read length limits are showed in two columns: minimum read length (Min. RL) and maximum read length (Max. RL.). Unless otherwise stated the unit is base pairs. The support for mismatches and short indels is also presented including, when possible, the maximum number of allowed mismatches and indels: by default the value is presented in bases; in some cases the value is presented as a percentage of the read size; or as score, meaning that mapper uses a score function. The alignments reported column indicate the alignments reported when a read maps to multiple locations. The alignment column indicates if the reads are aligned end-to-end (Globally) or not (Locally). The Parallel column indicates if the mapper can be run in parallel and, if yes, how: using a shared-memory (SM) or/and a distributed memory (DM) computer. The QA (quality awareness) column indicates if the mapper uses read quality information during the mapping. The support for paired-end/mate-pair reads is indicated in the PE column. The Splicing column indicates, for the RNA mappers, if the detection of splice junctions is made de novo or/and through user provided libraries (Lib). The Index column indicates if the reads or/and the reference are indexed. The number of citations was obtained from Google Scholar on 13 June 2015.
16

根据染色体起始终止点坐标来获取碱基序列

这次要介绍一个非常实用的工具,很多时候,我们有一个染色体编号已经染色体起始终止为止,我们想知道这段序列是什么样的碱基。当然我们一般用去UCSC的genome browser里面去查询,而且可以得到非常多的信息,多到正常人根本就无法完全理解。但是我如果仅仅是想要一段序列呢?
诚然,我们可以下载3G的那个hg19.fa文件,然后写一个脚本去拿到序列,但是毕竟太麻烦,而且一般这种需求都是临时性的需要,我们当然想要一个非常简便的方法咯。
我这里介绍一个非常简单的方法,是基于perl的cgi编程,当然,不需要你编程了。人家UCSC已经写好了程序,你只需要把网页地址构造好即可,比如chr17:7676091,7676196 ,那么我只需要构造下面一个网页地址
hg38可以更换成hg19,dna?segment= 后面可以按照标准格式更换,既可以返回我们想要的序列了。
网页会返回 一个xml格式的信息,解析一下即可。
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<DASDNA>
<SEQUENCE id="chr17" start="7676091" stop="7676196" version="1.00">
<DNA length="106">
aggggccaggagggggctggtgcaggggccgccggtgtaggagctgctgg tgcaggggccacggggggagcagcctctggcattctgggagcttcatctg gacctg
</DNA>
</SEQUENCE>
</DASDNA>
很明显里面的aggggccaggagggggctggtgcaggggccgccggtgtaggagctgctgg tgcaggggccacggggggagcagcctctggcattctgggagcttcatctg gacctg 就是我们想要的序列啦。
赶快去试一试吧
当然你不仅可以搜索DNA,还可以搜索很多其它的,你也不只是可以搜索人类的
See http://www.biodas.org for more info on DAS.
Try http://genome.ucsc.edu/cgi-bin/das/dsn for a list of databases.
X-DAS-Version: DAS/0.95
X-DAS-Status: 200
Content-Type:text
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: X-DAS-Version X-DAS-Status X-DAS-Capabilities

UCSC DAS Server.
See http://www.biodas.org for more info on DAS.
Try http://genome.ucsc.edu/cgi-bin/das/dsn for a list of databases.
See our DAS FAQ (http://genome.ucsc.edu/FAQ/FAQdownloads#download23)
for more information.  Alternatively, we also provide query capability
through our MySQL server; please see our FAQ for details
(http://genome.ucsc.edu/FAQ/FAQdownloads#download29).

Note that DAS is an inefficient protocol which does not support
all types of annotation in our database.  We recommend you
access the UCSC database by downloading the tab-separated files in
the downloads section (http://hgdownload.cse.ucsc.edu/downloads.html)
or by using the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables)
instead of DAS in most circumstances.

 

16

nature发表的统计学专题Statistics in biology

生物学里面,唯一还算有点技术含量,和有点门槛,就是生物统计了,而这也是绝大部分研究者的痛点,有能力的,可以看看nature上面关于统计学的专题讨论,而且主要是应用于自然科学的统计学讨论。

里面有几句统计学名言警句:
Statistics does not tell us whether we are right. It tells us the chances of being wrong.
统计学并不会告诉我们是否正确,而只是说明我们错误的可能性是多少。
Quality is often more important than quantity.
数据的质量远比数量要重要的多
The meaning of error bars is often misinterpreted, as is the statistical significance of their overlap.
Good experimental designs mitigate experimental error and the impact of factors not under study.
文章列表:
Research methods: Know when your numbers are significant
Scientific method: Statistical errors
Weak statistical standards implicated in scientific irreproducibility
The fickle P value generates irreproducible results
Vital statistics
Experimental biology: Sometimes Bayesian statistics are better
A call for transparent reporting to optimize the predictive value of preclinical research
Power failure: why small sample size undermines the reliability of neuroscience
Basic statistical analysis in genetic case-control studies
Erroneous analyses of interactions in neuroscience: a problem of significance
Analyzing 'omics data using hierarchical models
Advantages and pitfalls in the application of mixed-model association methods
Quality control and conduct of genome-wide association meta-analyses
Circular analysis in systems neuroscience: the dangers of double dipping
A solution to dependency: using multilevel analysis to accommodate nested data
How does multiple testing correction work?
What is Bayesian statistics?
What is a hidden Markov model?
下面的这些文章,其实就是我们正常课本里面统计学的知识点,但是放在nature杂志发表,就顿时高大上了好多
Points of significance: Importance of being uncertain
Points of Significance: Error bars
Points of significance: Significance, P values and t-tests
Points of significance: Power and sample size
Points of Significance: Visualizing samples with box plots
Points of significance: Comparing samples €”part I
Points of significance: Comparing samples part II
Points of significance:  Nonparametric tests
Points of significance: Designing comparative experiments
Points of significance: Analysis of variance and blocking
Points of Significance:  Replication
Points of Significance:  Nested designs
Points of Significance: Two-factor designs
Points of significance: Sources of variation
Points of Significance: Split plot design
Points of Significance: Bayes' theorem
Points of significance: Bayesian statistics
Points of Significance: Sampling distributions and the bootstrap
Points of Significance: Bayesian networks
A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.
16

生物信息学学者学习mysql之路

我一直都知道mysql其实很有用的, 哪怕是在bioinformatics领域。也断断续续的看过不少mysql教程,只是苦于没有机会应用。毕竟应用才是最好的学习方法,正好这些天需要用了,我就又梳理了一遍作为一个生物信息学学者,该如何学习mysql数据库。
然后再搜搜一堆技巧
差不多就可以开始啦。
我们不拿数据库来做网页,所以需要的仅仅是查询公共数据库的数据,当然,一般人都会选择直接去网页可视化的查询,或者去ftp批量下载后自己写脚本来查询,我以前也是这样想的,所以感觉mysql没什么用,因为它能做的, 我写一个脚本都能做到。但是任何事物能发展到如此流行的程度毕竟还是有它的优点的。
而在我看来,mysql的优点就是,不需要存储大量的文件信息,随查随用,如果我们想把数据库备份到本地,就要建立一大堆的文件夹,存放各种refgene信息呀,entrez gene信息呀,转录本,外显子等等各个文件夹,每个文件夹下面还有一堆文件,而且还要分物种存储,总之就是很麻烦,但是在数据库就不一样啦。
比如我们可以连接UCSC的数据库 (前提是你的机器里面可以允许mysql这个命令,而且你可以联网)
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
就这么简单, 你就用mysql远程登录了UCSC的数据库,可以show databases;或者use database hg19 ; 等等
里面有两百多个数据库,主要是多物种多版本,然后如果我们看hg19这个数据库,里面还有一万多个数据表,包含着hg19的全面信息。
还有很多其它的公共数据库可以练习
来自于:https://www.biostars.org/p/474/#9095

for example, I would cite:

UCSC http://genome.ucsc.edu/FAQ/FAQdownloads#download29
ENSEMBL http://uswest.ensembl.org/info/data/mysql.html
GO http://www.geneontology.org/GO.database.shtml#mirrors

1000 Genomes: since June 16, 2011: http://www.1000genomes.org/public-ensembl-mysql-instance

mysql -h mysql-db.1000genomes.org -u anonymous -P 4272

Flybase has direct access to its postgres chado database.
http://flybase.org/forums/viewtopic.php?f=14&t=114
hostname: flybase.org port: 5432 username: flybase password: no password database name: flybase
e.g. psql -h flybase.org -U flybase flybase

mysql -h database.nencki-genomics.org -u public
mysql -h useastdb.ensembl.org -u anonymous -P 5306

你都可以登录进去看看里面有什么,也可以练习练习mysql的语法,但是增删改查种的查是可以用的
然后我们可以用R或者perl或者Python来连接数据库,也是蛮好用的, 我现在比较倾向于R
所以我就简单看了一下这个包的说明书,然后成功连接了
#Connect to the MySQL server using the command:
#mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
#The -A flag is optional but is recommended for speed
library(RMySQL)
my.port="";
my.user="genome";
my.password="";
my.db="hg19";
#there are 203 databases,such as hg18,hg38,mm9,mm10,ce10
con <- dbConnect(MySQL(), host=my.host, user=my.user,dbname=my.db)
dbListTables(con) # there are 11016 tables in this hg19 database;

是不是很简单呀,只有你认真的学习,其实这些应用的东西都还是蛮简单的。

下面这本书也比较好,就讲了R或者perl或者Python来连接数据库,很全面

当然,如果想看mysql在bioinformatics方面的应用,下面还有很多学习资料
进阶版还可以看看具体事例,GO数据库的设计:http://geneontology.org/page/lead-database-schema
从这个来看,python要比perl 好很多http://www.personal.psu.edu/iua1/courses/files/2010/week15.pdf
16

居然还可以出售TCGA的数据,只有你稍微进行分析一下即可

亮瞎了我的双眼,原来还可以这样挣钱。
这个数据库的作者在2011年发了一篇如何寻找融合基因的文章:*Edgren, Henrik, et al. "Identification of fusion genes in breast cancer by paired-end RNA-sequencing." Genome Biol 12.1 (2011): R6.

然后基于此,把TCGA计划里面的所有癌症样本数据都处理了,并且得到了融合基因数据集,然后就以此出售

价格高达一万欧元,折合人民币七万多,一本万利,而且人家TCGA计划的数据的公开而且免费的,他做了二次处理就可以拿来挣钱,让我感觉很不爽。
到目前为止他们处理了TCGA计划里面的7652个癌症样本的数据,建立了一个囊括28种癌症的融合基因数据集,并且打包成了一个叫做FusionSCOUT 的产品来出售。
价格如下:

Pricing of FusionSCOUT datasets:

  • Single gene in one cancer set                        490€    /  580$ per dataset
  • Single gene fusions across all cancers          4900€  /  5800$ dataset
  • Individual cancer set                                       990 €   /  1250 $ per dataset
  • Full TCGA dataset                                          9900€  /  12500$ per dataset
该网站是这样介绍他们的产品的,号称有3500个研究团体已经使用了他们的数据,但是我感觉纯粹是吹牛,毕竟他这篇文献也就一百多的引用量,再说3500次购买,就这一个产品就能让他成为亿万富翁了,想想都觉得可怕。而且这网站这么烂,中国访问速度是渣渣,也就是相当于失去了中国的所有土豪客户了,怎么可能还有3500的销量,搞笑!

One of the latest therapeutics angles in the fight against cancer is fusion genes and their regulation. To aid in fusion gene research and reveal the multitude of gene fusion event in cancer samples MediSapiens has developed a proprietary FusionSCOUT pipeline for identifying fusion genes from RNA sequencing datasets.

Currently we have analysed 7625 tumour samples from the TCGA project building a fusion gene dataset covering 28 different cancers within the TCGA project which can be accessed through our FusionSCOUT product.

Using this pipeline, we have discovered 3930 samples with gene fusions with 9667 different fusion genes. We´ve discovered numerous novel gene fusions as well as new cancer types in which previously known fusions appear.

You can now purchase these gene fusions datasets with few mouse clicks and get the worlds most comprehensive gene fusions from cancer sets within days

FusionSCOUT cancer Reports

With FusionSCOUT you can access the full listings of all fusion genes in specific cancer datasets. Find new leads for possible cause of the cancer, examine the pathways that are affected by different fusions, stratify patients by shared fusion genes or search for potential target for drugs and companion diagnostics.

Once you purchase a FusionSCOUT dataset we will send you a detailed report with information on the fused genes, sample ID from the TCGA dataset, fusion frequencies across the dataset as well as fusion mRNA sequences and lists of protein domains present in the fusion transcripts.

By ordering the MediSapiens FusionSCOUT dataset, you´ll get:

  • A list of all gene fusions that involve your gene of interest, across all TCGA cancer types
  • TCGA sample ID: s of the for the samples with fusions
  • Exact exon junctions for the fusions, including alternatively spliced variants and data on whether reading frame is retained
  • Detailed list of protein domains retained in the fusion genes
  • cDNA sequence for the fusion mRNAs

Contact us to access the most up-to-date and comprehensive datasets of fusion gene events in different cancers!contact@medisapiens.com

Check out also our Fusion Gene Detection pipeline service for your samples!

Dataset missing? Email us and well add your favorite dataset to FusionSCOUT!

FusionSCOUT Cancer sets, March 2015

Cancer type Number of samples Number of fusion genes
Acute Myeloid Leukemia, LAML 153 69
Adrenocortical carcinoma, ACC 79 115
Bladder Urothelial Carcinoma, BLCA 273 473
Brain Lower Grade Glioma, LGG 467 309
Breast Invasive Carcinoma, BRCA 1029 3267
Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma, CESC 195 190
Colon Adenocarcinoma, COAD 287 212
Glioblastoma multiforme, GBM 170 379
Head and Neck Squamous Cell Carcinoma, HNSC 412 386
Kidney Chromophobe, KICH 66 19
Kidney Renal Clear Cell Carcinoma, KIRC 523 217
Kidney Renal Papillary Cell Carcinoma, KIRP 226 145
Liver Hepatocellular Carcinoma, LIHC 198 317
Lung Adenocarcinoma, LUAD 456 991
Lung Squamous Cell Carcinoma, LUSC 482 1374
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma, DLBC 28 18
Mesothelioma, MESO 36 26
Ovarian Serous Cystadenocarcinoma, OV 420 1166
Pancreatic Adenocarcinoma, PAAD 84 46
Pheochromocytoma and Paraganglioma, PCPG 184 83
Prostate Adenocarcinoma, PRAD 336 859
Rectum Adenocarcinoma, READ 85 74
Sarcoma, SARC 161 799
Skin Cutaneous Melanoma, SKCM 355 620
Stomach Adenocarcinoma, STAD 190 311
Thyroid Carcinoma, THCA 506 195
Uterine Carcinosarcoma, UCS 57 229
Uterine Corpus Endometrial Carcinoma, UCEC 167 422
14

几个国外出名的跟生物信息学相关的会议

会议列表如下:
ASGH会议-Annual Meeting of the American Society of Human Genetics
AGBT会议-Advances in Genome Biology & Technology (AGBT)
ASM会议-annual meeting of the American Society for Microbiology
ASHI会议-The American Society for Histocompatibility and Immunogenetics 
BOSC-生物信息开放会议:Bioinformatics Open Source Conference
ISMB/ECCB会议
ACMG会议-The ACMG Annual Clinical Genetics Meeting
annual Biology of Genomes (BoG) meeting at Cold Spring Harbor
以上排名不分先后,
一年一度的美国人类遗传学协会(ASHG)年会是遗传学界的盛事,也是目前规模最大的人类遗传学会议。2015年的年会于10月6-10日在马里兰州的巴尔的摩举行,吸引了6500多名科学家参与。他们将在会议上介绍和讨论人类遗传学各个方面的最新进展。
会议官网是:http://www.ashg.org/
非常隆重,也受到业界追捧!
会议ppt均可下载,但是要翻墙

https://storify.com/andrewsu/ashg14-speaker-slides

基因组生物学技术进展大会(AGBT)

中文介绍:

The 23rd Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2015)
14th Annual European Conference on Computational Biology (ECCB 2015)
这也是一个老牌会议了,会议官网是:https://www.iscb.org/
2015年的会议资料可以直接下载了:
ASM会议就比较水一点:
会议官网是:http://www.asm.org/

WHO WE ARE

The American Society for Microbiology (ASM) is the oldest and largest single life science membership organization in the world. Membership has grown from 59 scientists in 1899 to more than 39,000 members today, with more than one third located outside the United States. The members represent all aspects of the microbial sciences including microbiology educators.

The mission of ASM is to promote and advance the microbial sciences.

ASM accomplishes this mission through a variety of products, services and activities.

  • We provide a platform for sharing the latest scientific discoveries through our books, journals, meetings and conferences.
  • We help strengthen sustainable health systems around the world though our laboratory capacity building and global engagement programs.
  • We advance careers through our professional development programs and certifications.
  • We train and inspire the next generation of scientists through our outreach and educational programs.

ASM members have a passion for the microbial sciences, a desire to connect with their colleagues and a drive to be involved with the profession. Whether it is publishing in an ASM Journal, attending an ASM meeting or volunteering on one of the Society's many boards and committees.

Big parts of our everyday lives, from energy production, waste recycling, new sources of food, new drug development and infectious diseases to environmental problems and industrial processes-are studied in the microbial sciences.

Microbiology boasts some of the most illustrious names in the history of science--Pasteur, Koch, Fleming, Leeuwenhoek, Lister, Jenner and Salk--and some of the greatest achievements for mankind. Within the 20th century, a third of all Nobel Prizes in Physiology or Medicine have been awarded to microbiologists.

ASHI主要是免疫学相关的:

官网是:http://www.ashi-hla.org/

The ASHI 41st Annual Meeting site is now live, for the latest updates visit 2015.ashi-hla.org.
About ASHI

The American Society for Histocompatibility and Immunogenetics (ASHI) is a not-for-profit association of clinical and research professionals including immunologists, geneticists, molecular biologists, transplant physicians and surgeons, pathologists and technologists. As a professional society involved in histocompatibility, immunogenetics and transplantation, ASHI is dedicated to advancing the science and application of histocompatibility and immunogenetics; providing a forum for the exchange of information; and advocating the highest standards of laboratory testing in the interest of optimal patient care.

BOSC-生物信息开放会议

在wiki里面有详细的介绍:

The Bioinformatics Open Source Conference (BOSC) is an academic conference on open source programming in bioinformatics organised by the Open Bioinformatics Foundation. The conference has been held annually since 2000 and is run as a two-day satellite meeting preceding the Intelligent Systems for Molecular Biology (ISMB) conference.

annual Biology of Genomes (BoG) meeting

ACMG会议是临床相关的,报道的比较少
会议官网是;http://www.acmgmeeting.net/

ABOUT

The ACMG Annual Clinical Genetics Meeting provides genetics professionals with the opportunity to learn how genetics and genomics are being integrated into medical or clinical practice. The ACMG Annual Meeting Program Committee has developed a high caliber scientific program that will present the latest developments and research in clinical genetics and genomics

 

14

用wget批量下载需要认证的网页或者ftp站点里面的pdf文档

这是一个终极命令,比写什么爬虫简单多了
wget -c -r -np -k -L -p  -A.pdf --http-user=CS374-2011 --http-passwd=AlgorithmsInBiology http://ai.stanford.edu/~serafim/CS374_2011/papers/
我这里的例子是斯坦福大学的生物信息学算法课程里面推荐阅读的的所有pdf格式的paper
课程的网址是:http://ai.stanford.edu/~serafim/CS374_2011/
可以看到,这个网站推荐的文献分成8大类,本身这个网站打开就需要登录用户名和密码
--http-user=CS374-2011 --http-passwd=AlgorithmsInBiology
每一篇文献的单独地址是http://ai.stanford.edu/~serafim/CS374_2011/papers/Miscellaneous_topics/Self-assembly_of_DNA/self_healing_and_proofreading.pdf 类似的格式。
我这里简单解释一下这些参数的意思:
-c -r -np -k -L -p  -A.pdf
-c 断点续传
-r 递归下载,下载指定网页某一目录下(包括子目录)的所有文件
-nd 递归下载时不创建一层一层的目录,把所有的文件下载到当前目录(特殊要求会选择这个参数)
-np 递归下载时不搜索上层目录,如wget -c -r www.xxx.org/pub/path/
没有加参数-np,就会同时下载path的上一级目录pub下的其它文件 (所以一定要加上这个参数,不然会下载太多东西的)
-k 将绝对链接转为相对链接,下载整个站点后脱机浏览网页,最好加上这个参数
-L 递归时不进入其它主机,如wget -c -r www.xxx.org/
-p 下载网页所需的所有文件,如图片等
-A 指定要下载的文件样式列表,多个样式用逗号分隔
至于最后的--http-user=CS374-2011 --http-passwd=AlgorithmsInBiology 就是登录该课程网站需要的用户名和密码
12

生物信息学工程师在美帝的工资水平

今天逛论坛的时候,我看了一个宾夕法尼亚大学的生物信息学招聘启事:https://psu.jobs/job/60050

很有趣的是,我看到了他们的工资层级,而他们要招聘的生物信息学工程师的待遇是K,L级别的,也就是最低也是5万美金的年薪,折合成人民币还是蛮可观的,虽然我不是很清楚这个待遇在美帝属于什么样的水平,当然跟美帝的程序员肯定是没得比的,但是比国内的大部分程序员都还有好了。
Salary Band Minimum Midpoint Maximum
A $16,104 $23,748 $31,392
B $17,712 $26,124 $34,524
C $19,152 $28,728 $38,304
D $21,072 $31,620 $42,156
E $23,604 $35,400 $47,196
F $26,436 $39,660 $52,872
G $29,136 $44,412 $59,712
H $33,192 $50,616 $68,040
I $37,848 $57,696 $77,580
J $42,444 $65,772 $89,136
K $49,236 $76,308 $103,392
L $57,120 $88,524 $119,928
M $66,240 $102,672 $139,116
N $78,168 $121,152 $164,148
O $90,768 $142,968 $195,168
P $107,124 $168,696 $230,280
Q $126,396 $199,056 $271,728
R $151,668 $238,872 $326,088
A Bioinformatics Analyst position is available within the Bioinformatics Consulting Center at The Pennsylvania State University.
 The position is supported by the Huck Institutes for the Life Sciences and requires the candidate to work with multiple project investigators to design and implement computational pipelines for data produced by high throughput sequencing instruments and others, with particular emphasis on metagenomics and microbiome analyses.
 Responsibilities include the following: developing and/or maintaining existing software pipelines for analyzing high throughput sequencing data; identifying, evaluating and documenting new methodologies to support ongoing research needs; writing code and developing solutions to computational biology problems, with particular emphasis on microbiome and related samples. The Bioinformatics Analyst will become part of an interdisciplinary team composed of other bioinformatics staff, students and researchers and is expected to interact with other life scientists at Penn State and our international partner institutions in Africa and Asia to assist them with identifying research goals, analytical support needs, while carrying out computational data analysis as needed. It is anticipated that approximately 50% of your effort will initially be dedicated to providing bioinformatics support and microbiome analysis pipeline development for high-profile collaborative infectious disease surveillance research and training projects in Tanzania as well as other countries in East Africa and South Asia and may involve a limited amount of international travel (once per year). This job will be filled as a level 3, or level 4, depending upon the successful candidate's competencies, education, and experience. Typically requires a Master's degree or higher in a field of study with focus on computational research methods or higher plus four years of related experience, or an equivalent combination of education and experience for a level 3. Additional experience and/or education and competencies are required for higher level jobs. In-depth understanding of the computational analysis required for processing data from genomic technologies and their applications: Microbiome, metagenomics, RNA-Seq, genome assembly, genomic data visualization, or others. Expertise in handling and processing data in common bioinformatics formats; knowledge of available bioinformatics tools and genomic data repositories; proven track record of delivering bioinformatics solutions; demonstrated programming skills in one or more programming languages: Python, Perl, Java, C and/or numerical platforms: R, MATLAB, Mathematica. Experience handling large data sets generated from sequencing instruments. Excellent communication skills. This is a fixed-term appointment funded for one year from date of hire with excellent possibility of re-funding.
09

对vcf突变数据与公开发表的进行比对

当我们对NGS数据call了snp之后一般会输出成vcf格式的数据,一行代表一个突变,例如
20      2451451 .       G       T       1939.77 .
AC=1;AF=0.500;AN=2;BaseQRankSum=-10.134;DP=239;Dels=0.00;FS=2.276;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.258;QD=8.12;ReadPosRankSum=0.823;SOR=0.870
GT:AD:DP:GQ:PL  0/1:150,89:239:99:1968,0,3874
#前几列记录着该突变发生在第几号染色体以及该染色体的哪个坐标,我们的参考基因组在该位点是什么碱基,我们测到的突变成了什么碱基。
最后两列是测序深度以及正负测序深度,或者ref和allele的测序深度。
只有第8列是最复杂的,可以有高达几百个数据信息,取决于我们用什么样的软件来call的snp,以及call了snp之后用什么样的软件做的注释。
接下来我们还需要探究我们找到的突变是否在其它以及公开发表的数据库里面被找到过,所以可以下载非常多的公共数据库进行比对,我所见过的有一下一些,估计完全下载有0.5T
dbsnp144 (这个是ncbi提供的最权威的啦)
cgi69
ExAC.vcf.gz(这个是broadinstitute提供的外显子联盟)
Cosmic_v73.ann.vcf.gz (这个是癌症突变信息集)
finalTCGA.vcf.gz (TCGA计划也是癌症相关的)
icgc.vcf.gz
dbNSFP2.6vcf
SCLP.ann.vcf.gz
CCLE.ann.vcf.gz
ESP6500-SIv2.vcf.gz (Variants from the Exome Sequencing Project (ESP))
adni-sum
safs-sum.indel.vcf.gz
gonl.vcf.gz
ssm.vcf.gz
ssi.vcf.gz
uk10k.vcf.gz
1000g-ph3v5.gff.gz  (千人基因组计划)
gwasCatalog.gff.gz  \
phewascatalog.gff.gz  \
dbgap-gwas.gff.gz  \
interproDomain.gff.gz \
clinvar.gff.gz \
RegulomeDB.gff.gz \
CancerGAMAdb.gff.gz \
29

推荐5个生物信息学领域的教授

排名不分先后:

推荐宾夕法尼亚州立大学的一个教授Istvan Albert

他写了一本书是: https://www.biostarhandbook.com/
他还可以授予网上课程学位:http://www.personal.psu.edu/iua1/certificate.html
他还推荐了一本R语言书籍:http://onepager.togaware.com/

关注一下华盛顿大学医学院的教授Obi L. Griffith

他的主页:http://www.obigriffith.org/

他的一个比较出名的的贡献是 www.rnaseq.wiki
他在 Biostars bioinformatics forum 非常活跃
他的课程包括Molecular Basis of Cancer (BIO5288) and Genetics and Genomics of Disease (BIO5487) at Washington University School of Medicine.
I was a TA for Genome Analysis (MEDG505) and the bioinformatics section of Advanced Human Molecular Genetics (MEDG520) and a guest instructor for Cell Biology For Biomedical Engineering Graduate Students (APSC552), Cell and Organismal Biology (BIOL111) and Cell Biology (BIOL200) at UBC.

关注一下华盛顿大学医学院的教授Malachi Griffith

他的个人主页是:http://www.malachigriffith.org/index.htm

他的github主页是:https://github.com/malachig

WashU TGI Faculty page: Profile
Linked In: Profile
Twitter: Feed
Google Scholar: Citations
Research Gate: Profile
Scopus: Profile
Open Research ID: Profile
Github: Profile
BioStar: Profile
SeqAnswers: Profile
Code Academy: Profile
Iterative Genomics Consulting: Company website
Flickr: Photostream
www.dgidb.org
www.alexaplatform.org

关注一下麦吉尔大学的Pablo Cingolani教授

他是snpeff的作者

他的github是:https://github.com/pcingola
现就职于McGill University

推荐弗吉尼亚大学的stephen教授

他是个人主页:http://stephenturner.us/

他所有公开的ppt : https://speakerdeck.com/stephenturner
stephen教授我要重点提一下,因为他的教育资源特别多。
29

关于qsub和condor两种在集群上面提交任务的方式比对

毕竟不是计算机科班出身,很多计算机概念我其实并不清楚,很多时候对一个任务的解决只是凭着自己的悟性来模仿接触到的做法,大多数情况下并无可厚非,能完成任务即可,但是碰到多人协作的环境,往往就出了问题。

我一直都知道可以 cat /proc/cpuinfo 来查看cpu数量,用free -g 来查看内存的总量及消耗,用top来查看当前运行的任务情况。
以前都是三到五人的环境公用一个服务器,一般都有几十个cpu和几百G的内存,而且服务器还经常空闲着,所以不会面临资源的问题。
而最近接触的项目是多个小组公用服务器,所以面临计算资源的分配,接触了qsub和condor两种不同的任务提交模式。
qsub和condor其实都是用来运行一个脚本的,它与我们用bash或者sh来运行一个脚本的区别在于是否立即执行我们的脚本。
比如我写了一个脚本myscript.sh
我可以 bash myscript.sh 然后里面看到这个脚本被运行了
但是用 qsub myscript.sh, 就只是向我们的集群提交了一个任务而已,这个脚本什么时候运行,占用多少内存和cpu来运行,需要特定的设置好。
这里有个并行计算的ppt:http://www.marquette.edu/mugrid/bootcamp/2010Summer/Parallel.pdf 从计算机专业的角度来讲解了两者的区别。
其实只要看了这个https://wikis.nyu.edu/display/NYUHPC/Tutorial+-+Submitting+a+job+using+qsub 就能很容易明白qsub的用法,毕竟它只是一个命令而已,但是我们需要记住的是下面几个命令,因为经常用
PBS commands covered in this section
qsub
submit a job for execution
qstat
examine the status of a job (we have discussed what this status may be)
qhold
put a job on hold
qrls
release a job
qsig
send a signal to a job
qdel
delete a job

而condor其实很少见的。

An example Condor submission file is shown below:
Executable = ./myexecutable
Universe = vanilla
Log = condor.log
Output = out.log
Error = err.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = mydata.txt
Save your script to a file with a name such as mysubmitfile.condor.
Once you have saved your submission script you can submit using the following command:
/dcs/condor/condor/bin/condor_submit mysubmitfile.condor

condor_q 可以用来查看任务提交情况

condor_rm 可以用来杀掉提交的任务。

任务提交模式还是挺有用的

25

RNA-seq流程需要进化啦!

Tophat 首次被发表已经是6年前

Cufflinks也是五年前的事情了

Star的比对速度是tophat的50倍,hisat更是star的1.2倍。

stringTie的组装速度是cufflinks的25倍,但是内存消耗却不到其一半。

Ballgown在差异分析方面比cuffdiff更高的特异性及准确性,且时间消耗不到cuffdiff的千分之一

Bowtie2+eXpress做质量控制优于tophat2+cufflinks和bowtie2+RSEM

Sailfish更是跳过了比对的步骤,直接进行kmer计数来做QC,特异性及准确性都还行,但是速度提高了25倍

kallisto同样不需要比对,速度比sailfish还要提高5倍!!!

参考:https://speakerdeck.com/stephenturner/rna-seq-qc-and-data-analysis-using-the-tuxedo-suite

25

Shell里面的各种括号的区别

[],[[]],(),(()),{},{{}},以及在前面加上$的区别,以及它们互相杂交组合的区别!!!

[[ ]] double brackets

(())Double parentheses

{{}}double curly brackets

我们必须要记住的是下面

[] 相当于test,作逻辑判断

$( ) 与` ` (反引号) 都是用来做命令替换用

${ } 吧... 它其实就是用来作变量替换用的啦

(())就是用来计算的,相当于expr函数。

参考:http://sayle.net/book/

http://tldp.org/LDP/abs/html/index.html

 

我们首先看看一对的括号

首先[]是用来逻辑判断的,必须有空格

if [ -f binom.py ]

then

echo 'binom.py exists'

fi

或者

nub=$((i%4))

#echo $nub

if [ $nub == 0 ];then

echo "we need to sleep 4 hours"

sleep 14000

fi

这个[]操作符等价于test函数

if test $1 -gt 0
then
echo "$1 number is positive"
fi

但是都必须有空格!!!

参考:http://www.freeos.com/guides/lsst/ch03sec02.html

关于shell的test操作符还有很多http://tldp.org/LDP/abs/html/fto.html

( ) 将command group 置于 sub-shell 去执行,也称 nested sub-shell。

{ } 则是在同一个 shell 内完成,也称为non-named command group。

补充一个: {} 还可以做变量扩展 {5..9}  或者 {abcd}e, 自己运行一下就知道效果啦

这两个差异很小,而且一般用不着,就不讲了。

那么这一对的括号加上了$符号后又变成了上面鬼东西呢?

当然,只有$( ) ${ }才是合法的。

在 bash shell 中,$( ) 与` ` (反引号) 都是用来做命令替换用(command substitution)的。

在操作上,用$( ) 或` ` 都无所谓,用$( )的优点是:

1, ` ` 很容易与' ' ( 单引号)搞混乱,尤其对初学者来说

2, 在多层次的复合替换中,` ` 须要额外的跳脱( \` )处理,而$( ) 则比较直观

再让我们看${ } 吧... 它其实就是用来作变量替换用的啦。

一般情况下,$var 与${var} 并没有啥不一样。

但是用${ } 会比较精确的界定变量名称的范围,比方说:

[code][/code]

$ A=B

$ echo $AB

还可以用来截取变量,这个就很多花样啦

# 是去掉左边(在鉴盘上# 在$ 之左边)

% 是去掉右边(在鉴盘上% 在$ 之右边)

单一符号是最小匹配﹔两个符号是最大匹配

 

然后我们看看两对的括号:

nub=$((i%4)) 等价于$nub=`expr $i % 1` ;

((i++)) 等价于$i=`expr $i + 1` ;

所以(())就是用来计算的,而且里面的变量不需要$来标记啦

(在 $(( )) 中的变量名称,可于其前面加$ 符号来替换,也可以不用)

在(())前面加上$只是为了把计算结果给保存而已。

而两个中括号和两个大括号都是不合法的!

 

24

用 GMAP/GSNAP软件进行RNA-seq的alignment

软件的解说ppt :http://www.mi.fu-berlin.de/wiki/pub/ABI/CompMethodsWS11/MHuska_GSNAP.pdf

软件的下载地址: http://research-pub.gene.com/gmap/
有研究者认为这个软件的比对效果要比tophat要好,虽然现在已经多出来了非常多的RNA-seq的alignment软件,我还是简单看看这个软件吧,它本来是2005就出来的一个专门比对低通量的est序列,叫GMAP,后来进化成了GSNAP
step1:下载安装GMAP/GSNAP
是一个标准的linux源码程序,安装之前一定要看readme  ,http://research-pub.gene.com/gmap/src/README
解压进去,然后源码安装三部曲,首先 ./configu  然后make 最后make install
会默认安装在 /usr/local/bin 下面,这里需要修改,因为你可能没有 /usr/local/bin 权限,安装到自己的目录,然后把它添加到环境变量!
step2 :准备数据
比对一般都只需要两个数据,一是索引好的参考基因组,另一个是需要比对的测序数据。
但是这个GSNAP,还需要对应的GTF注释文件。
首先需要参考基因组:虽然软件本身提供了一个hg19的参考基因组,并且已经索引好了Human genome, version hg19 (5.5 GB)(http://research-pub.gene.com/gmap/genomes/hg19.tar.gz) ,但是下载很慢,而且不是对所有版本的GSNAP都适用。所以我这里对我自己的参考基因组进行索引。
gmap_build -D ./ -d  my_hg19.fa
然后取ensemble下载hg19的gtf文件。
然后还需要把自己下载的gtf文件也构建索引,需要两个步骤
cat my_hg19.gtf |  ~/software/gmap-2011-10-16/util/gtf_splicesites > my_hg19.splicesites
cat  my_hg19.splicesites  |   iit_store -o my_hg19.gtf.index
然后拷贝需要比对的RNA-seq测序文件
step3: 运行程序
就是一步比对而已
gsnap
-D /home/jschnable/gsnap_indexes/
-d arabidopsisv10
--nthreads=50
-B 5
-s  /home/jschnable/gsnap_indexes/arabidopsisv10.iit
-n 2
-Q
--nofails
--format=sam temp.fastq
> results.sam
参数有点多,自己看看说明书吧http://qteller.com/RNAseq-analysis-recipe.pdf 讲的非常详细。
24

用freebayes来call snps

step1:,软件安装
wget http://clavius.bc.edu/~erik/freebayes/freebayes-5d5b8ac0.tar.gz
tar xzvf freebayes-5d5b8ac0.tar.gz
cd freebayes
make
一个小插曲,安装的过程报错:/bin/sh: 1: cmake: not found
所以我需要自己下载安装cmake,然后把cmake添加到环境变量

首先下载源码包http://www.cmake.org/cmake/resources/software.html

wget http://cmake.org/files/v3.3/cmake-3.3.2.tar.gz

 解压进去,然后源码安装三部曲,首先 ./configu  然后make 最后make install  

cmake 会默认安装在 /usr/local/bin 下面,这里需要修改,因为你可能没有 /usr/local/bin 权限,安装到自己的目录,然后把它添加到环境变量!

step2:准备数据

an alignment file (in BAM format)
a reference genome in (uncompressed) FASTA format.
正好我的服务器里面有很多
不过,该软件也可以出了一个测试数据集
wget http://bioinformatics.bc.edu/marthlab/download/gkno-cshl-2013/chr20.fa
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam.bai

用这个代码就可以下载千人基因组计划的NA12878样本的第20号染色体相关数据啦

step3:运行命令
网站给出的实例是:
freebayes -f chr20.fa \
    NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam >NA12878.chr20.freebayes.vcf

一般就用默认参数即可

 
step4:输出结果解读
没什么好解读的了,反正是vcf文件,都看烂了,就那些东西
不过该软件的作者倒是拿该软件与broad用GATK做出的NA12878样本的突变数据做了比较

 

24

broad_institute收集的癌症数据

肾上腺皮质 Adrenocortical carcinoma ACC 92 Browse Browse
膀胱,尿路上皮 Bladder urothelial carcinoma BLCA 412 Browse Browse
乳腺癌 Breast invasive carcinoma BRCA 1098 Browse Browse
子宫颈 Cervical and endocervical cancers CESC 307 Browse Browse
胆管癌 Cholangiocarcinoma CHOL 36 Browse Browse
结肠腺癌 Colon adenocarcinoma COAD 460 Browse Browse
大肠腺癌 Colorectal adenocarcinoma COADREAD 631 Browse Browse
淋巴肿瘤弥漫性大B细胞淋巴瘤 Lymphoid Neoplasm Diffuse Large B-cell Lymphoma DLBC 58 Browse Browse
食管 Esophageal carcinoma ESCA 185 Browse Browse
FFPE试点二期 FFPE Pilot Phase II FPPP 38 None Browse
胶质母细胞瘤 Glioblastoma multiforme GBM 613 Browse Browse
脑胶质瘤 Glioma GBMLGG 1129 Browse Browse
头颈部鳞状细胞癌 Head and Neck squamous cell carcinoma HNSC 528 Browse Browse
肾嫌色 Kidney Chromophobe KICH 113 Browse Browse
泛肾 Pan-kidney cohort (KICH+KIRC+KIRP) KIPAN 973 Browse Browse
肾透明细胞癌 Kidney renal clear cell carcinoma KIRC 537 Browse Browse
肾乳头细胞癌 Kidney renal papillary cell carcinoma KIRP 323 Browse Browse
急性髓系白血病 Acute Myeloid Leukemia LAML 200 Browse Browse
脑低级神经胶质瘤 Brain Lower Grade Glioma LGG 516 Browse Browse
肝癌 Liver hepatocellular carcinoma LIHC 377 Browse Browse
肺腺癌 Lung adenocarcinoma LUAD 585 Browse Browse
肺鳞状细胞癌 Lung squamous cell carcinoma LUSC 504 Browse Browse
间皮瘤 Mesothelioma MESO 87 Browse Browse
卵巢浆液性囊腺癌 Ovarian serous cystadenocarcinoma OV 602 Browse Browse
胰腺癌 Pancreatic adenocarcinoma PAAD 185 Browse Browse
嗜铬细胞瘤和副神经节瘤 Pheochromocytoma and Paraganglioma PCPG 179 Browse Browse
前列腺癌 Prostate adenocarcinoma PRAD 499 Browse Browse
直肠腺癌 Rectum adenocarcinoma READ 171 Browse Browse
肉瘤 Sarcoma SARC 260 Browse Browse
皮肤皮肤黑色素瘤 Skin Cutaneous Melanoma SKCM 470 Browse Browse
胃腺癌 Stomach adenocarcinoma STAD 443 Browse Browse
胃和食管癌 Stomach and Esophageal carcinoma STES 628 Browse Browse
睾丸生殖细胞肿瘤 Testicular Germ Cell Tumors TGCT 150 Browse Browse
甲状腺癌 Thyroid carcinoma THCA 503 Browse Browse
胸腺瘤 Thymoma THYM 124 Browse Browse
子宫内膜癌 Uterine Corpus Endometrial Carcinoma UCEC 560 Browse Browse
子宫癌肉瘤 Uterine Carcinosarcoma UCS 57 Browse Browse
葡萄膜黑色素瘤 Uveal Melanoma UVM 80 Browse Browse

看起来癌症很多呀,任重道远

24

perl的模块组织方式

如何使用自己写的私人模块

模块通俗来讲,就是一堆函数的集合。

Personally I prefer to keep my modules (those that I write for myself or for systems I can control) in a certain directory, and also to place them in a subdirectory. As in:

/www/modules/MyMods/Foo.pm
/www/modules/MyMods/Bar.pm

And then where I use them:

use lib qw(/www/modules);useMyMods::Foo; 

useMyMods::Bar;

As reported by "perldoc -f use":

It is exactly equivalent to
BEGIN { require Module; import Module LIST; }
except that Module must be a bareword.

Putting that another way, "use" is equivalent to:

  • running at compile time,
  • converting the package name to a file name,
  • require-ing that file name, and
  • import-ing that package.

So, instead of calling use, you can call require and import inside a BEGIN block:

BEGIN{require'../EPMS.pm';
  EPMS->import();}

And of course, if your module don't actually do any symbol exporting or other initialization when you call import, you can leave that line out:

BEGIN{require'../EPMS.pm';}
比如我的一个模块如下,命名为my_stat.pm
package my_stat;
sub mean{
my $sum=0;
$sum+=$_ foreach @_;
$sum/($#_+1);
}
#print &mean(1..10),"\n";
sub stddev{
$avg=&mean(@_);
#print "$avg\n";
my $sum=0;
$sum+=($_-$avg)**2 foreach @_;
sqrt($sum/($#_));
#It will be different if you use $#_+1;
#sqrt($sum/($#_+1));
}
#print &stddev(1..10),"\n";
1;
里面有我定义好的两个函数 mean 和 stddev , 那么我就可以在我的其它perl程序里面直接引用这个模块,从而使用我的两个自定义函数。
use lib "./";  #取决于你把自定义模块my_stat.pm放在哪个目录
use my_stat;
print my_stat::stddev(1..10),"\n";
18

一个基因坐标定位到具体基因的程序的改进

这是为了回答以前的一个疑问:任意给定基因组的 chr:pos, 判断它在哪个基因上面?这个程序难吗?

基因的chr,start,end都是已知的

学术一点讲述这个问题:已知CNV数据在染色体上的position如chr1:2075000-2930999,怎样批量获取其对应的Gene Symbol呢(批量)

数据如下:

head gene_position.hg19 //21629

1 chr19 58858171 58874214 A1BG ENSG00000121410

2 chr12 9220303 9268558 A2M ENSG00000175899

3 chr12 9381128 9386803 A2MP1 ENSG00000256069

9 chr8 18027970 18081198 NAT1 ENSG00000171428

10 chr8 18248754 18258723 NAT2 ENSG00000156006

12 chr14 95058394 95090390  ENSG00000273259

13 chr3 151531860 151546276 AADAC ENSG00000114771

14 chr2 219128851 219134893 AAMP ENSG00000127837

15 chr17 74449432 74466199 AANAT ENSG00000129673

16 chr16 70286296 70323412 AARS ENSG00000090861

head pfam.df.hg19.bed   //340960

chr1  12190          12689          Helicase_C_2     0        +        12190          12689          255,255,0

chr1  69157          69220          7tm_4        0        +        69157          69220          255,255,0

chr1  69184          69817          7TM_GPCR_Srsx         0        +        69184          69817          255,255,0

chr1  69190          69931          7tm_1        0        +        69190          69931          255,255,0

chr1  69490          69910          7tm_4        0        +        69490          69910          255,255,0

现在需要对我们的pfam数据进行注释,根据每一行的chrpos来看看是属于哪一个基因

总共会有338879 pfam记录可以注释上基因。

注释之后应该是 head pfam.gene.df.hg19  这个样子

CDK11B       chr1  1571423     1573930     Pkinase        0        -         1571423     1573930     255,255,0

CDK11B       chr1  1572048     1573921     Pkinase_Tyr          0        -         1572048     1573921     255,255,0

CDK11B       chr1  1572120     1572823     Kinase-like  0        -         1572120     1572823     255,255,0

CDK11B       chr1  1572120     1572820     Kinase-like  0        -         1572120     1572820     255,255,0

CDK11B       chr1  1572120     1572817     Kinase-like  0        -         1572120     1572817     255,255,0

CDK11B       chr1  1573173     1573918     Kinase-like  0        -         1573173     1573918     255,255,0

CDK11B       chr1  1575747     1577317     Daxx  0        -         1575747     1577317     255,255,0

CDK11B       chr1  1576417     1577347     Nop14         0        -         1576417     1577347     255,255,0

CDK11B       chr1  1576423     1577332     Mitofilin      0        -         1576423     1577332     255,255,0

CDK11B       chr1  1576432     1577317     SAPS  0        -         1576432     1577317     255,255,0

我的第一个程序用的是全基因位点扫描到hash的方法。这样需要扫描13,1390,4974个位点,多于三分之一的基因组,这样是非常浪费内存的,尤其是keys需要多个字节。

我用了256G的服务器都没有运行完。

后来我取巧了把我的gene_position.hg19文件用split命令分成了25个,然后循环25次对pfam.df.hg19.bed  文件进行注释。

 

这样的确可以解决问了,而且只需要32G的内存的服务器即可,时间也很快,就十多分钟吧。

但这只是取巧的方法,应该要从算法上面优化,首先我仅仅做一个改动,就是不再扫描全基因的位点,对每个基因,我以1K的窗口来取位点进行扫描。这样我判断pfam的坐标时候,也以1K为最小单位进行判断。

这样只需要不到30s就可以出结果,总共注释了303474pfam记录,还不是最终的338879,因为我这次只注释了基因的1000整数倍基因区间,这样如果pfam记录落在一个基因的起始终止点不到1K位置时就不会被注释。这时候需要对代码进行继续优化。

 脚步懒得上传了,在我的有道云笔记里面。

http://note.youdao.com/share/?id=58e66d138e9434284ffa61c53b65abdc&type=note