五 05

国外最出名的R语言大会-useR

Posted on 2015年5月5日 by ulwvfje

这是2014年的会议报告以及ppt，但是好像很多ppt都是需要翻墙才能下载

http://user2014.stat.ucla.edu/#tutorials

Morning Tutorials Monday, 9:15

Room	Presenter	Title
Palisades Salon A+B	Max Kuhn	Applied Predictive Modeling in R
Palisades Salon C+F	Winston Chang	Interactive graphics with ggvis
Palisades Salon D+E	Yihui Xie	Dynamic Documents with R and knitr [Slides] [Examples]
Hermosa	Romain Francois	C++ and Rcpp11 for beginners [slides]
Venice	Bob Muenchen	Managing Data with R
Sproul-Landing building, 3rd floor	Matt Dowle	Introduction to data.table [Tutorial] [Talk]
Sproul-Landing building, 4th floor	Virgilio Gomez Rubio	Applied Spatial Data Analysis with R
Sproul-Landing building, 5th floor	Martin Morgan	Bioconductor

Afternoon Tutorials Monday, 14:00

Room	Presenter	Title
Palisades Salon A+B	Hadley Wickham	Data manipulation with dplyr
Palisades Salon C+F	Garrett Grolemund	Interactive data display with Shiny and R
Palisades Salon D+E	Drew Schmidt	Programming with Big Data in R
Hermosa	S繪ren H繪jsgaard	Graphical Models and Bayesian Networks with R
Venice	John Nash	Nonlinear parameter optimization and modeling in R [slides]
Sproul-Landing building, 3rd floor	Dirk Eddelbuettel	An Example-Driven Hands-on Introduction to Rcpp [slides]
Sproul-Landing building, 4th floor	Ramnath Vaidyanathan	Interactive Documents with R
Sproul-Landing building, 5th floor	Thomas Petzoldt	Simulating differential equation models in R

然后2015年的也要开始了，有兴趣的朋友可以June 30 - July 3, 2015
Aalborg, Denmark看看，有很多干货分享！

http://user2015.math.aau.dk/#BN

2015的内容如下

五 04

topGO简单使用

Posted on 2015年5月4日 by ulwvfje

首先载入这个包

source("http://bioconductor.org/biocLite.R")

biocLite("topGO")

biocLite("ALL")

library(topGO)

library(ALL)

data(ALL)

data(geneList)

data(GOdatat)

这样就载入了很多变量， ls()查看如下

[1] "affyLib" "ALL" "geneList" "topDiffGenes"

其中ALL这个数据我在另一篇日志里面重点介绍了一下。

然后我简单提一下"geneList"

head(geneList)

1095_s_at 1130_at 1196_at 1329_s_at 1340_s_at 1342_g_at

1.0000000 1.0000000 0.6223795 0.5412240 1.0000000 1.0000000

str(geneList) 是一个向量，有323个数字。

Named num [1:323] 1 1 0.622 0.541 1 ...

- attr(*, "names")= chr [1:323] "1095_s_at" "1130_at" "1196_at" "1329_s_at" ...

然后简单查询该包的安装地址和一些文件

system.file(package = "topGO")

[1] "C:/Program Files/R/R-3.1.1/library/topGO"

在这个目录下面可以找到文件"examples/geneid2go.map"

里面的内容格式如下，第一列是gene的ID号，一般是entrez ID ，第二列是该基因所对应的GO所有的条目，用逗号分隔。

068724 GO:0005488, GO:0003774, GO:0001539, GO:0006935, GO:0009288

119608 GO:0005634, GO:0030528, GO:0006355, GO:0045449, GO:0003677, GO:0007275

此处省略一万行。

readMappings(file = system.file("examples/geneid2go.map", package = "topGO"))

这个函数可以读取我们的文件，返回一个list。是gene到go的映射，每个基因都有一个或者多个go条目。

这个list可以用inverseList这个函数反转，变成每个go条目到基因的映射。

构建topGO这个大全，需要的数据包括：

基因identifier，可以附上某种分数以便后面施用某种统计处理，分数可以是t检验的p值或者与某个表型的correlation等；
identifier和GO term间的map，如果是芯片数据的话BioC里有多种注释包，声明包的名称即可。至于我等蛋白界苦人，也能自己构建map，见下；
GO的层级结构，由GO.db提供，目前这个包只支持GO.db提供的结构：GOslim就再说了。

topGOdata对象构建函数的参数包括：

ontology，可指定要分析的GO term的类型，即BP、CC之类；
description：topGOdata的描述，可选；
allGenes：基因identifier的原始列表，和后面的geneSelectionFun联合作用，得出参与分析的基因，可以是numeric或factor。
geneSelectionFun：基因选择函数，如果前面allGenes是numeric的话就必须得指明此参数；
nodeSize：被认为富集的GO term辖下基因的最小数目（>=），默认为1。
annotationFun：基因identifier map到GO term的函数。

代码如下

BPterms <- ls(GOBPTerm)

geneID2GO=readMappings(file = system.file("examples/geneid2go.map", package = "topGO"))

直接使用系统自带的data(GOdata)数据，自己构建太麻烦了！

其实就是这就对ALL这个数据集来进行分析！！！

GOdata包含实例topGOdata对象。它可以用来直接运行富集分析。

topGOdata对象构建好后，即可利用这个包里的各种方法和函数做分析。

numGenes(GOdata) 查看对象包含的基因的数目

[1] 318

> description(GOdata)

[1] "Simple topGOdata object"

genes(GOdata)可以得到该对象里面所有的318个基因

geneScore() 可以得到想318个基因的分数

函数sigGenes()返回一个character vector，为各显著变化基因identifier。函数numSigGenes()则用于查看显著变化基因的数目。

函数updateGenes()可以修改topGOdata对象里所包含的基因。

想要看全部基因都对应上了哪些GO term，可用函数usedGO()得到一个character

基因集富集分析（gene set enrichment analysis）

首先看看GSEA的三种方式：

基于count，即仅要求输入一组基因，此种方式最为流行，可用Fisher's exact test、Hypegeometric test和binomial test进行检验；
基于基因的score或rank，可用Kolmogorov-Smirnov like tests（即05年那篇PNAS的GSEA文章所用方法），Gentleman's Category、t-test等方法；
基于基因的表达，可从expression matrix直接分析，如Goeman's globaltest，以及GlobalAncova。

topGO提供两种分析方法，一种自由度更高但上手不易，本菜鸟还是跟着第二种走吧，较为用户友好但集成度较高。简单来说，就是用runTest()这个函数，要求三个主要的argument，一个是之前构建好的topGOdata类实例，第二个参数algorithm用于指定生成GO graph的方法，而参数statistic用于指定所用的检验方法，比如：

> resultFis <- runTest(GOdata, algorithm = "classic", statistic = "fisher")

> resultWeight <- runTest(GOdata, algorithm = "weight", statistic = "fisher")

> resultKS <- runTest(GOdata, algorithm = "classic", statistic = "ks")

> resultKS.elim <- runTest(GOdata, algorithm = "elim", statistic = "ks.elim")

runTest这一锤子买卖敲定后就能开始解读和展示结果了，得到的结果是topGOresult类的一个实例，其组成很简单，为对象的基本信息，以及各基因的分数（p值或者其他统计参数

我这里随便挑一个富集结果来看看

resultFis <- runTest(GOdata, algorithm = "classic", statistic = "fisher")

-- Classic Algorithm --

the algorithm is scoring 590 nontrivial nodes

parameters:

test statistic: fisher

resultWeight <- runTest(GOdata, algorithm = "weight", statistic = "fisher")

-- Weight Algorithm --

The algorithm is scoring 590 nontrivial nodes

parameters:

test statistic: fisher : ratio

然后我们对这两种富集方式来画图

pvalFis=score(resultFis) 得到矫正的P值

pvalWeight <- score(resultWeight , whichGO = names(pvalFis))

返回resultFis和resultWeight共有的基因在resultWeight中的分数。有了这两组分数，可以做一些比较，比如关联分析：

cor(pvalFis, pvalWeight)

[1] 0.370151

library(lattice)

xyplot(pvalWeight ~ pvalFis) 画了一个散点图

五 04

R语言里面的一个数据集ALL(Acute Lymphoblastic Leukemia)简介

Posted on 2015年5月4日 by ulwvfje

这个数据内容太多了，我感觉自己也理解的不是很清楚！

非常多的R的bioconductor包都是拿这个数据集来举例子的，所以我简单的介绍一下这个数据集。

这个数据集是对ALL这个病的研究数据，共涉及到了128个ALL病人，其中95个是B细胞的ALL，剩余33个是T细胞的ALL。

是一个芯片数据，同时还包含有其它的病人信息。

大家首先要在R里面安装这个数据集

source("http://bioconductor.org/biocLite.R")

biocLite("ALL")

library(ALL)

data(ALL)

data(geneList)

在R里面输入str(ALL)可以看到这个数据的具体信息，但是非常多!

ALL

ExpressionSet (storageMode: lockedEnvironment)

assayData: 12625 features, 128 samples

element names: exprs

protocolData: none

phenoData

sampleNames: 01005 01010 ... LAL4 (128 total)

varLabels: cod diagnosis ... date last seen (21 total)

varMetadata: labelDescription

featureData: none

experimentData: use 'experimentData(object)'

pubMedIds: 14684422 16243790

Annotation: hgu95av2

我们首先它的BT变量记录的是什么

可以看出它记录的是数据病人的分组信息。

bcell = grep("^B", as.character(ALL$BT))通过这句话可以挑选出B细胞病人

然后我们看看它的ALL$mol.biol变量记录是是什么

可以看到它记录的是这些病人的几种突变情况（molecular biology testing）

types = c("NEG", "BCR/ABL")

moltyp = which(as.character(ALL$mol.biol) %in% types)

用这个命令就能挑选出我们想研究的两组突变的病人。

然后我们还可以把刚才的两个标准用来从ALL数据集里面取想要的子集

ALL_bcrneg = ALL[, intersect(bcell, moltyp)]

同理我们可以查看这个数据集的非常多的变量信息。

包括sex，age，cod，diagnosis，等等，这个'data.frame':共有128 obs. of 21 variables:

我们除了可以查看这个ALL数据集自带的变量，还可以通过一些方法来访问它的其它信息。

Exprs这个方法可以查看它的表达数据，可以看到有128个变量，12625行的探针数据。

str(exprs(ALL))

num [1:12625, 1:128] 7.6 5.05 3.9 5.9 5.93 ...

- attr(*, "dimnames")=List of 2

..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...

..$ : chr [1:128] "01005" "01010" "03002" "04006" ...

还有很多很多函数都可以操作这个数据集，这样可以得到非常多的信息！我就不一一列举了

对这个数据的一系列操作可以画热图，见下面的教程！！！

http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/heatmap/

五 03

生物信息入门视频推荐-新一代测序数据分析

Posted on 2015年5月3日 by ulwvfje

新一代测序数据分析-在优酷里面可以搜索到，一下是配套视频的讲义及下载地址！
Lecture Notes
Lectures will appear below as they are presented. Homeworks are specified in each handout.
Lecture 1 - slides, handouts. course information, homework and project information, introduction to computing, setting up you computer, basic unix command line usage, organizing your projects, homework 1.
- Suggested reading: A Quick Guide to Organizing Computational Biology Projects, Plos Comp Bio, 2009,
- Other blog posts on organizing projects and research workflow.
Lecture 2 - slides, handouts, The GFF format, sequence ontologies, basic Unix commands: wc, grep, cut, sort, redirecting input and output streams, piping commands, processing a tabular file with UNIX tools, homework 2
- dataset for this lecture, the latest yeast genome feature file: saccharomyces_cerevisiae.gff
- Unix flags explained: Explain Shell
- Suggested reading: Unix for Biologists
Lecture 3 - slides, handouts. programming languages, download and install an proper editor, introduction to the AWK programming language, tabular file processing, filtering by feature types, Awk onliners explained, another collections of AWK oneliners, homework 3.
- Recommended editors Komodo Edit or Sublime Text
- Suggested reading The Art of Unix Programming,
- Biostar question of the day, iterating through files: Very Bad Things
Lecture 4 - slides, handouts, sequencing technologies, sequence representations, the FASTA format, processing FASTA files at the command line, homework 4.
- dataset for this lecture, sequencing reads from a 454 instrument: lec4.fa (6Mb)
- The FASTA format and the IUPAC codes
- Suggested reading: Molecular Biology for Computer Scientists,
- Biostar question of the day: Sequencing Technology Reviews
Lecture 5 - slides, handouts, string matching, edit distances, regular expressions, local and global alignments, homework 5.
- Regular Expression tester
- Pairwise sequence alignment tools at EMBL
- Book: comprehensive discussion of basic bioinformatics concepts: Understanding Bioinformatics by Dr. Marketa Zvelebil and Jeremy Baum.
Lecture 6 - slides, handouts, introduction to using blast, legacy blast and blast+, preparing blast databases, performing a blastn query, formatting blast output, homework 6.
- Download blast executables
- Download: saccharomyces_cerevisiae.gff
- Book: detailed information on "legacy" blast (published in 2003): BLAST by Ian Korf, Mark Yandell and Joseph Bedell
Lecture 7 - slides, handouts, using blast, formatting databases, using the blastdbcmd, extract sequences, batch operations, formatting blast queries, homework 7.
- Official Blast documentation: BLAST Command Line Applications User Manual
Lecture 8 - slides, handouts, blast score and E-values, search strategies, usage examples for blastn, blastp, blastx, tblastn, and tblastx, homework 8.
- Download the required dataset
- Optional: Join the Applied Bioinformatics Study Room find the course called Applied Bioinformatics at Penn State
- NCBI guide to BLAST E-Values: The Statistics of Sequence Similarity Scores
Lecture 9 - slides, handouts, quality encodings, phred scales, the FASTQ format, homework 9.
- The FASTQ format wiki wiki page
- The Phred quality score
Lecture 10 - slides, handouts, file compression, gzip, zip, bz2, file archives, tarbombs, plotting fastq qualities homework 10.
- Download the lecture-10.tar.gz dataset
- A tarbomb, handle with care ...
- The FastQC toolkit
Lecture 11 - slides, handouts installing tools, quality control, adapter trimming, error corrections
- SeqTK gitHub page
- Biostar Question of the Day: Fastq Quality Control Shootout
- Package manager Homebrew for Mac
- Suggested reading: ${link('illumina')}.
- Simple sequence related utilities in the Sequence Manipulation Suite
- Software tools for adapter trimming:
  - CutAdapt application note in Embnet Journal, 2011
  - fastq-mcf published in The Open Bioinformatics Journal, 2013
  - PrinSeq application note in Bioinformatics, 2011
  - Trimmomatic application note in Nucleic Acid Research, 2012, web server issue
  - Trim Galore
  - NGS Toolkit published in Plos One, 2012
  - Fastx Toolkit
  - BioPieces a suite of programs for sequence preprocessing
  - Scythe a bayesian adaptor trimmer
  - FlexBar, Flexible barcode and adapter removal published in Biology, 2012
  - SeqPrep
  - TagDust published in Bioinformatics, 2009
  - TagCleaner published in BMC Bioinformatics, 2010
  - Libraries via R (Bioconductor): PIQA, ShortRead
Lecture 12 - slides, handouts paired end sequencing, quality control for paired end sequencing, the bioawk language
- download the dataset for lecture 12: lec12.tar.gz (20Mb)
- bioawk GitHub page
- Paried end aware software tools for adapter trimming:
  - Trimmomatic application note in Nucleic Acid Research, 2012, web server issue
  - fastq-mcf published in The Open Bioinformatics Journal, 2013
Lecture 13 - slides, handouts paired end sequencing, read stiching, automating tasks with shell scripts
- for the homework use the dataset for lecture 12: lec12.tar.gz (20Mb)
- Trimmomatic application note in Nucleic Acid Research, 2012, web server issue
- Fast Length Adjustment of SHort reads
- Musket - a multistage k-mer spectrum based corrector
- Bash programming HOWTO
Lecture 14 - slides, handouts short read alignments, bwa, bowtie and other tools.
- official bwa website
- the bwa-mem paper currently on arxiv
- bwa-mem rejection from Bioinformatics and bwa-mem open review
- Heng Li on wikipedia and on sourceforge
- The bowtie1 aligner and the bowtie2 aligner
Lecture 15 - slides, handouts the sequence alignment map SAM format
- dataset for the homework: lec15.fq.gz (3.8Mb)
- official SAM specification
- What do flags mean: Explain SAM flags
Lecture 16 - slides, handouts the SAM/BAM format, sorting and indexing BAM files, using the samtools program
- The Samtools page
- What do SAM flags mean: Explain SAM flags
Lecture 17 - slides, handouts aligning paired end reads, comparing and evaluating aligners, simulating sequencing reads with the wgsim tool
- The wgsim github page
- The dwgsim sourceforge page
- The simNGS tool
- ART: a next-generation sequencing read simulator Bioinformatics, 2012
- Benchmarking short sequence mapping tools, BMC Bioinformatics. 2013
- pIRS: Profile-based Illumina pair-end reads simulator, Bioinformatics. 2012
- GemSIM: general, error-model based simulator of next-generation sequencing data, BMC Genomics. 2012
- Optimizing information in Next-Generation-Sequencing (NGS) reads for improving de novo genome assembly
Lecture 18 - slides, handouts read duplication, visualizing alignments with IGV and IGB
- FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads, PLoS One. 2012;
- The old Pileup format
- Samtools text alignment viewer
- Integrative Genomics Viewer by the Broad Institute, Integrated Genome Browser by UNC Charlotte, Tablet - Next Generation Sequence Assembly Visualization
- Visualizing genomes, techniques and challenges, Nat Methods. 2010
- Neat visualizations via DNA Skittle
- Biostar question of the day What tools/libraries do you use to visualize genomic feature data?
Lecture 19, guest lecture by Nicholas Stoler - slides, the variant call format (VCF), calling variants with samtools mpileup
- The multi sample pileup format
- 5 Things to Know About SAMtools Mpileup
- The VCF format 4.1 specification
- The The Variant Call Format and VCFtools Poster
- Suggested reading: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data., Bioinformatics, 2011
- Biostar question of the day How to distinguish heterozygotes and homozygotes from variants in VCF format
Lecture 20,- slides, handouts origins of genome variations, more on SNP calling, successes and failures
- Genome Analysis Toolkit
- Recommended reading: Genomes Unzipped blog:
Lecture 21,- slides, handouts interval representation, BED and GFF formats, representing data
- USCS Format FAQ
- The GFF3 format
Lecture 22,- slides, handouts interval operations: complement, extension, flanking, Using the BedTools package
- Bedtools homepage and the latest documentation
- Bedtools old pdf documentation contains many details that the new documentation does not
- Biostar question of the day What are the most common stupid mistakes in bioinformatics?
Lecture 23,- slides, handouts interval operations: intersect, window, selecting closest features
- Bedtools homepage and the latest documentation
- The bedops tool and a nice usage example
Lecture 24,- slides, handouts an introduction to genome assembly, using the velvet assembler, evaluating genome assemblies with QUAST
- The Velvet assembler, see also: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs
- The Minia assembler, see also Space-efficient and exact de Bruijn graph representation based on a Bloom filter, WABI 2012
- The Fermi assembler, see also Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly, Bioinformatics, 2012
- Quast assembly evaluator, see also QUAST: quality assessment tool for genome assemblies
Lecture 25,- slides, handouts, meta.tar.gz (25MB) an introduction to metagenomics, software packages mothur, QIIME and MetaSim, online tools RDP, MG-RAST
- So how many bacteria are in the human body
- the mothur package
- QIIME: Quantitative Insights Into Microbial Ecology
- MEGAN: the Metagenome Analyzer
- MetaSim: A Sequencing Simulator for Genomics and Metagenomics
- RDP: classifier
Lecture 26,- slides, handouts, lec26.tar.gz (25MB) an introduction to Chip-Seq technology, peak calling concepts, preprocessing and peak calling methods (part 1)
- Search google for "Introduction to Chip-Seq", quite a few resources
- Recommended reading Applications of next-generation sequencing (Nature, resources)
- the bioawk-tools utilities
Lecture 27,- slides, handouts, Chip-Seq peak calling sofware, preprocessing and peak calling methods (part 2)
- The peak caller used in the Encode Project Model-based Analysis for ChIP-Seq and Identifying ChIP-seq enrichment using MACS in Nature Protocols, 2012
- Site Identification from Short Sequence Reads
- A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs, Nature Protocols, 2012
Lecture 28,- slides, handouts, lec28.tar.gz basic RNA-Seq data analysis concepts, split read mapping
- The Tuxedo suite: Bowtie, Tophat, Cufflinks
- The rlsim simulation package and related paper Realistic simulations reveal extensive sample-specificity of RNA-seq biases
- The FLUX read simulator and related paper: Modelling and simulating generic RNA-Seq experiments with the flux simulator NAR, 2012
Lecture 29, slides, handouts, lec29.tar.gz RNA-Seq (part 2)
- BLOG: RPKM measure is inconsistent among samples, Damian Kao provides a simple and logical description of a paper that talks about the problems with RPKM
- CSHL Keynote 2013; Dr. Lior Pachter, UC Berkeley TLDR: use TPM (Transcript per million) instead of RPKM, FPKM
Lecture 30, slides, handouts, bioinformatics beyond the command line: using R for data analysis
- The Bioconductor package
- A brief introduction to "apply" in R
- Hadley Wickham's homepage
- The ggplot2 visualiztion library
Final Project 30, final-project, data for final project pony.tar.gz (17Mb) BMB 597D: Final project, 50% of the final grade, due 5pm Saturday Dec 14th, 2013

五 01

脚本作业-解读NCBI的ftp里面关于人的一些基因信息

Posted on 2015年5月1日 by ulwvfje

为了感谢大家对我博客的关注，我在这里发布一个作业，适合菜鸟做的。里面有十几个类似的问题，大家可以下载数据自行处理，如果是问这些问题，我优先回答！

NCBI的ftp里面关于人的一些基因信息

我在NCBI的ftp服务器里面下载了这些数据，时间是2015年，大多是hg19系列的，文件名如下：

CDS.fa 这个是ensembl中人的CDS碱基序列文件，hg38

entrez2go.gene 这个是有go注释的基因情况，有一万八的基因都有go注释

entrez2name.gene 这个是NCBI的entrez ID号对应着基因名的文件

entrez2pubmed.gene 这个是NCBI的entrez ID号对应着该基因发表过的文章的ID号

entrez2refseq2ensembl.gene 这个是NCBI的entrez ID号对应着基因名的refseq的ID号和ensembl数据库的ID号

human_gene_info这个是基因的详细信息，包括基因的起始终止点坐标等等

Protein.fa 这个是ensembl中人的蛋白的氨基酸序列文件，有十万多个蛋白hg38

ref2ensembl.txt 这个是基因名的refseq的ID号和ensembl数据库的ID号

自行去NCBI的ftp服务器里面下载这些数据。

然后好好熟悉这些数据信息，回答一下几个问题：

人总的基因有多少个，它们分别分布在哪些染色体上面，基因的转录本分布情况如何，基因的长度分布如何，基因的外显子个数如何。

CD分子的基因有多少个，它们分别分布在哪些染色体上面，基因的转录本分布情况如何，基因的长度分布如何，基因的外显子个数如何。它们有没有氨基酸偏好性？？

MHC系列基因信息？CCL系列基因信息如何？CXCL系列信息如何？或者你感兴趣的基因家族信息？

现在研究最热门的基因是什么？发表文章最多的前十个基因是什么？

基因长度情况如何？最长的基因多长？最短的基因多少bp，可靠吗？

蛋白质长度情况如何？

每条染色体的基因分别情况？基因在染色体那个地方分别最多？

请用图形展示你的结论！！！

如果你能回答以上问题，证明你的脚本水平不错了。

如果找不到我，看旁边的公告，加入生信菜鸟群，我就在里面！！！

五 01

CHIP-seq第三讲之使用MACS软件寻找peaks

Posted on 2015年5月1日 by ulwvfje

在使用Bowtie比对于完Chip-Seq的结果后，就需要用到MACS或者ERANGE来找出峰所在的位置了。但是由于ERANGE的设置比较复杂，所以最为流行的还是MACS。

一．首先安装MACS软件

MACS有两个版本，分别是MACS14和MACS2。MACS2在很多方面都对MACS14做了重大改进，但目前还在测试阶段。我们依然以MACS14为例进行说明。

MACS软件的下载地址在wget https://codeload.github.com/taoliu/MACS/zip/master

这是一个python软件，有152M，已经算是很大了！所以需要按照安装python的方法来安装它！但是，好像这个是最新版的，我们还是用1.4版本吧

wget http://github.com/downloads/taoliu/MACS/MACS-1.4.2-1.tar.gz

其实它的readme已经把这个软件的各种安装使用方法讲的很清楚了。

https://github.com/taoliu/MACS/blob/master/README.rst

MACS软件的具体原理，大家去看文献，或者参考这篇文章

http://www.plob.org/2014/05/08/7227.html

很简单的一个python命令即可安装该软件python setup.py install --user

二．然后准备该软件所需要的数据

是我们在前两篇文章中提到的数据

三．接着运行MACS的命令

/home/jmzeng/.local/bin/macs14 -t Xu_WT_rep2_BAF155.fastq.trimmed.single.bam \

> -c Xu_WT_rep2_Input.fastq.trimmed.single.bam \

> -f BAM -g hs --bw 300 -w -S -n Xu_WT_rep2

四．最后解读一下结果

56K Apr 30 21:54 Xu_WT_rep2_model.r

5.5K Apr 30 22:21 Xu_WT_rep2_negative_peaks.xls

783K Apr 30 22:21 Xu_WT_rep2_peaks.bed

865K Apr 30 22:21 Xu_WT_rep2_peaks.xls

766K Apr 30 22:21 Xu_WT_rep2_summits.bed

唉，反正这也不是我的课题，懒得解释这些结果啦，等后来有机会再慢慢玩吧

参考 http://www.plob.org/2014/05/08/7227.html

附录：我们现在来了解如何设置参数。

参考自 http://www.plob.org/2014/01/26/7118.html

-t TFILE, –treatment=TFILE 输入文件名

-c CFILE, –control=CFILE 输入阴对文件名

-n NAME, –name=NAME 输入出文件名前缀

-f FORMAT, –format=FORMAT 输入文件格式，默认值为AUTO，可选的值为”BEG”,”ELAND”,”ELANDMULTI”,”ELANDMULTIPER”,”ELANDEXPORT”,”SAM”,”BAM”,”BOWTIE”等。

-g GSIZE, –gsize=GSIZE 比对模板大小。格式可以是：1.0e+9，或者1000000000，也可以缩写：’hs’ for 人类 (2.7e9), ‘mm’ for 大鼠(1.87e9), ‘ce’ for 线虫 (9e7) and ‘dm’ for 果蝇 (1.2e8), 默认值:hs

-s TSIZE, –tsize=TSIZE 设置为短序列的长度，默认值为25

-p PVALE, –pvalue=PVALUE 非峰可能性截取值，默认值为1e-5，这个值不能大太，超过0.9的话，可能无法输出正确的结果

-m MFOLD, –mfold=MFOLD 峰值高度相对于本底的比值，默认值为10,30。也就是说，最低值不能少于10，但比值超过30也不认为它是正常的一个峰。一般而言，低值设置为10是一个很好的区分点。如果这个值还是无法得到满意的结果，那么可以设置得更低，但最好还是使用–nomodel参数，使–nomodel设置为True，然后再传递–shiftsize及–bw参数给MACS。–shiftsize默认值为100，而–bw的默认值为300。

–diag 生成完整报表，会包括是否为真峰的可能性，但会严重拖累运算速度。

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

Monthly Archives: 5月 2015