23

生物信息小白如何自学编程

这本来是我在知乎上面看到的问题,所以就抽空回答了一下:http://www.zhihu.com/question/36701137/answer/68928111

首先,你懂得想去看源码,这是一个很好的兆头,一些非常正规的源码的确是编程进阶的的捷径,毕竟我们大部分人都不可能得到别人的手把手指导,所以只能靠自己的悟性了。

我就以我自己的经历来回答这个问题吧,我作为一个纯生物出身的小白,现在编程技术应该还算可以了!

首先,不管是哪个语言,perl,python,R,matlab都好,它们都有一堆的基础书籍,你必须以囫囵吞枣的心态看完一两本书(书没有好坏,别要我给你推荐书名),必须看完,了解编程基础。

接下来的步骤最重要,就是实践,不停的实践,在实践中运用编程技术,这样是学的最快的,不然你看再多的书也只是一个概念。

我这里重点推荐一个工具集,它实现了很多生物信息学需要的常用操作,网址是:Bioinformatics Tools
包含以下64中工具,而且网页也很清楚的描述了它们的功能,其实非常简单,但是这样写程序非常有效。
"Combines multiple FASTA entries into a single sequence."
"Returns the entire sequence contained in an EMBL file in FASTA format."
"Parses the feature table of an EMBL file and returns the feature sequences."
"Parses the feature table of an EMBL file and returns the protein translations."
"Removes non-DNA characters from text."
"Removes non-protein characters from text."
"Returns the entire sequence contained in a GenBank file in FASTA format."
"Parses the feature table of a GenBank file and returns the feature sequences."
"Parses the feature table of a GenBank file and returns the protein translations."
"Converts single letter amino acid codes to three letter codes."
"Reads a list of positions and ranges and returns those parts of a DNA sequence."
"Reads a list of positions and ranges and returns those parts of a protein sequence."
"Determines the reverse-complement, reverse, or complement of the sequence you enter."
"Separates bases according to codon position."
"Converts a FASTA sequence into multiple sequences."
"Converts three letter amino acid codes to one letter codes."
"Returns DNA sequence segments specified by a position and window size."
"Returns protein sequence segments specified by a position and window size."
"Plots codon frequency (according to the codon table you enter) for each codon in a DNA sequence."
"Returns a standard codon usage table."
"Returns a list of potential CpG islands."
"Calculates the molecular weight of DNA sequences."
"Returns positions of the patterns you enter."
"Returns basic sequence statistics."
"Returns sequences that are identical or similar to a query sequence."
"Returns sequences that are identical or similar to a query sequence."
"Accepts aligned sequences in FASTA format and calculates the identity and similarity of each sequence pair."
"Can be used to predict a DNA sequence in another species using a protein sequence alignment."
"Finds DNA sequences that can easily be converted to a restriction site."
"Determines the positions of open reading frames."
"Returns the optimal global alignment for two coding DNA sequences."
"Returns the optimal global alignment for two DNA sequences."
"Returns the optimal global alignment for two protein sequences."
"Returns a report describing PCR primer properties"
"Generates PCR products from a template and two primer sequences."
"Returns the grand average of hydropathy value of protein sequences."
"Returns the predicted isoelectric point of protein sequences."
"Calculates the molecular weight of protein sequences."
"Returns positions of the patterns you enter."
"Returns basic sequence statistics."
"Converts the sequence you enter into restriction fragments."
"Returns the number and positions of restriction sites."
"Can be used to convert protein into DNA."
"Returns the translation in the reading frame you specify."
"Colors a sequence alignment based on sequence conservation."
"Colors a protein alignment based on biochemical properties of residues."
"Numbers and groups DNA according to your specifications."
"Numbers and groups amino acids according to your specifications."
"Shows PCR primer annealing sites, translations, and restriction sites."
"Shows restriction sites and protein translations."
"Shows protein translations."
"Introduces random mutations into DNA sequences."
"Introduces random mutations into protein sequences."
"Generates a random coding sequence of the length you specify."
"Generates a random DNA sequence of the length you specify."
"Replaces regions of the DNA sequences you enter with random bases."
"Generates a random protein sequence of the length you specify."
"Replaces regions of the protein sequences you enter with random residues."
"Samples bases from a DNA sequence with replacement."
"Samples residues from a protein sequence with replacement."
"Randomly shuffles the DNA sequences you enter."
"Randomly shuffles the protein sequences you enter."
"IUPAC codes for DNA and protein."
"The genetic codes used in the Sequence Manipulation Suite."
当你实现完了这些需求,你不仅仅学会了编程,而且是学会了编程该如何应用在生物信息学里面!
用perl,python,R,matlab中的任何一种都可以实现,它们没有任何区别的,别纠结语言的问题。
不推荐初学者看源代码,因为源代码太正规了,定义变量就几十行代码了,再定义函数又是几百行代码,而真正学生物信息学的压根写代码都不超过五十行的,比如我上面提到那64个生物数据处理需求,一般就七八行代码就可以(在perl里面)
不信你可以看看这个github里面托管的代码:trinityrnaseq/util/misc at master · trinityrnaseq/trinityrnaseq · GitHub
里面有很多perl代码,都是实现各种数据转换的,写的非常正规,甚至能把一行代码就能解决的问题写成几百甚至上千行,除非你想把自己的代码拿去发文章或者出售,否则正常的生物信息学研究根本用不着!
当然,回到你最初的问题,哪里能找到源码呢?
首先,你可以去图书馆看一堆书籍,它们都会有光盘,下载既有视频又有源码,或者书上一般会说源码在哪里下载,比如这个pleac/include/perl at master · pleac/pleac · GitHub
然后,你可以找一大堆的生物信息学软件,它们一般都托管在github上面,这个链接里面有三百多个生物信息学转录组领域的软件:List of RNA-Seq bioinformatics tools
这个链接有几百个生物信息学里面做alignment的软件:
甚至连常见的生物信息学数据库也有自己的源码包:例如NCBI,ensembl,UCSC
下面就是ENSEMBL数据库的:NGS数据比对工具持续收集
(记住,这些软件都是人家发表文章的,非常难,你一辈子能搞定一个就很了不起了,比如我,就搞了一下bowtie,也是一知半解的)
分享了所有的代码,实在是太方便了:Ensembl Project · GitHub
可以跟着这些代码学习编程:Ensembl/ensembl-pipeline · GitHub
它的官网的帮助文档也特别详细:Help & Documentation
你现在还缺资料吗?

12

生物信息学工程师在美帝的工资水平

今天逛论坛的时候,我看了一个宾夕法尼亚大学的生物信息学招聘启事:https://psu.jobs/job/60050

很有趣的是,我看到了他们的工资层级,而他们要招聘的生物信息学工程师的待遇是K,L级别的,也就是最低也是5万美金的年薪,折合成人民币还是蛮可观的,虽然我不是很清楚这个待遇在美帝属于什么样的水平,当然跟美帝的程序员肯定是没得比的,但是比国内的大部分程序员都还有好了。
Salary Band Minimum Midpoint Maximum
A $16,104 $23,748 $31,392
B $17,712 $26,124 $34,524
C $19,152 $28,728 $38,304
D $21,072 $31,620 $42,156
E $23,604 $35,400 $47,196
F $26,436 $39,660 $52,872
G $29,136 $44,412 $59,712
H $33,192 $50,616 $68,040
I $37,848 $57,696 $77,580
J $42,444 $65,772 $89,136
K $49,236 $76,308 $103,392
L $57,120 $88,524 $119,928
M $66,240 $102,672 $139,116
N $78,168 $121,152 $164,148
O $90,768 $142,968 $195,168
P $107,124 $168,696 $230,280
Q $126,396 $199,056 $271,728
R $151,668 $238,872 $326,088
A Bioinformatics Analyst position is available within the Bioinformatics Consulting Center at The Pennsylvania State University.
 The position is supported by the Huck Institutes for the Life Sciences and requires the candidate to work with multiple project investigators to design and implement computational pipelines for data produced by high throughput sequencing instruments and others, with particular emphasis on metagenomics and microbiome analyses.
 Responsibilities include the following: developing and/or maintaining existing software pipelines for analyzing high throughput sequencing data; identifying, evaluating and documenting new methodologies to support ongoing research needs; writing code and developing solutions to computational biology problems, with particular emphasis on microbiome and related samples. The Bioinformatics Analyst will become part of an interdisciplinary team composed of other bioinformatics staff, students and researchers and is expected to interact with other life scientists at Penn State and our international partner institutions in Africa and Asia to assist them with identifying research goals, analytical support needs, while carrying out computational data analysis as needed. It is anticipated that approximately 50% of your effort will initially be dedicated to providing bioinformatics support and microbiome analysis pipeline development for high-profile collaborative infectious disease surveillance research and training projects in Tanzania as well as other countries in East Africa and South Asia and may involve a limited amount of international travel (once per year). This job will be filled as a level 3, or level 4, depending upon the successful candidate's competencies, education, and experience. Typically requires a Master's degree or higher in a field of study with focus on computational research methods or higher plus four years of related experience, or an equivalent combination of education and experience for a level 3. Additional experience and/or education and competencies are required for higher level jobs. In-depth understanding of the computational analysis required for processing data from genomic technologies and their applications: Microbiome, metagenomics, RNA-Seq, genome assembly, genomic data visualization, or others. Expertise in handling and processing data in common bioinformatics formats; knowledge of available bioinformatics tools and genomic data repositories; proven track record of delivering bioinformatics solutions; demonstrated programming skills in one or more programming languages: Python, Perl, Java, C and/or numerical platforms: R, MATLAB, Mathematica. Experience handling large data sets generated from sequencing instruments. Excellent communication skills. This is a fixed-term appointment funded for one year from date of hire with excellent possibility of re-funding.
29

推荐5个生物信息学领域的教授

排名不分先后:

推荐宾夕法尼亚州立大学的一个教授Istvan Albert

他写了一本书是: https://www.biostarhandbook.com/
他还可以授予网上课程学位:http://www.personal.psu.edu/iua1/certificate.html
他还推荐了一本R语言书籍:http://onepager.togaware.com/

关注一下华盛顿大学医学院的教授Obi L. Griffith

他的主页:http://www.obigriffith.org/

他的一个比较出名的的贡献是 www.rnaseq.wiki
他在 Biostars bioinformatics forum 非常活跃
他的课程包括Molecular Basis of Cancer (BIO5288) and Genetics and Genomics of Disease (BIO5487) at Washington University School of Medicine.
I was a TA for Genome Analysis (MEDG505) and the bioinformatics section of Advanced Human Molecular Genetics (MEDG520) and a guest instructor for Cell Biology For Biomedical Engineering Graduate Students (APSC552), Cell and Organismal Biology (BIOL111) and Cell Biology (BIOL200) at UBC.

关注一下华盛顿大学医学院的教授Malachi Griffith

他的个人主页是:http://www.malachigriffith.org/index.htm

他的github主页是:https://github.com/malachig

WashU TGI Faculty page: Profile
Linked In: Profile
Twitter: Feed
Google Scholar: Citations
Research Gate: Profile
Scopus: Profile
Open Research ID: Profile
Github: Profile
BioStar: Profile
SeqAnswers: Profile
Code Academy: Profile
Iterative Genomics Consulting: Company website
Flickr: Photostream
www.dgidb.org
www.alexaplatform.org

关注一下麦吉尔大学的Pablo Cingolani教授

他是snpeff的作者

他的github是:https://github.com/pcingola
现就职于McGill University

推荐弗吉尼亚大学的stephen教授

他是个人主页:http://stephenturner.us/

他所有公开的ppt : https://speakerdeck.com/stephenturner
stephen教授我要重点提一下,因为他的教育资源特别多。