生信菜鸟团 » 芯片数据

用oligo包来读取affymetix的基因表达芯片数据-CEL格式数据

ulwvfje — Sat, 23 Apr 2016 14:58:31 +0000

前面讲到affy处理的芯片平台是有限的，一般是hgu 95系列和133系列，[HuGene-1_1-st] Affymetrix Human Gene 1.1 ST Array这个平台虽然也是affymetrix公司的，但是affy包就无法处理了，这时候就需要oligo包了！

oligo包是R语言的bioconductor系列包的一个，就一个功能，读取affymetix的基因表达芯片数据-CEL格式数据，处理成表达矩阵！！！

同理，我们也是要下载原始数据：一个例子：GSE48452

下载之后，解压到指定目录，就可以直接用oligo包啦！

geneCELs=list.celfiles('/path/GSE48452/cel_files/',listGzipped=T,full.name=T)

#用全路径，一般cel文件也是压缩包形式，没必要解压

affyGeneFS <- read.celfiles(geneCELs) ##读取ｃｅｌ文件

geneCore <- rma(affyGeneFS, target = "core")　 ##这一步是normalization，会比较耗时

genePS <- rma(affyGeneFS, target = "probeset")

#两种normlization的方法，##一般我们会选择transcript相关的

## 这个芯片平台还需要自己把探针ID赋值给表达矩阵

featureData(genePS) <- getNetAffx(genePS, "probeset")

featureData(geneCore) <- getNetAffx(geneCore, "transcript")

## 探针ID还需要注释到基因ID，这里就不讲了！

处理之后得到的表达矩阵应该是与GEO官网的一致，大家可以自己对照检查一下：

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE48nnn/GSE48452/matrix/GSE48452_series_matrix.txt.gz

用affy包读取affymetix的基因表达芯片数据-CEL格式数据

ulwvfje — Sat, 23 Apr 2016 14:50:46 +0000

Affymetrix的探针（proble）一般是长为25碱基的寡聚核苷酸；探针总是以perfect match 和mismatch成对出现，其信号值称为PM和MM，成对的perfect match 和mismatch有一个共同的affyID。
CEL文件：信号值和定位信息。
CDF文件：探针对在芯片上的定位信息

affy包是R语言的bioconductor系列包的一个，就一个功能，读取affymetix的基因表达芯片数据-CEL格式数据，处理成表达矩阵！！！

一般我们都是去GEO数据库里面知道找到CEL文件的下载地址~~~比如GSE1438，测了10 young (19-25 years old) and 12 older (70-80 years old) male的样品，然后找差异基因，从GEO数据库我们找到cel文件下载地址是：

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1428/suppl/GSE1428_RAW.tar

我们是为了讲解affy才下载原始数据的，其实GEO也提供处理好的表达矩阵供下载

下载后压缩到指定目录即可

下载到本地之后就可以用代码读取它了！

library(affy)
dir_cels='D:\\test_analysis\\TNBC\\cel_files'
affy_data = ReadAffy(celfile.path=dir_cels)
eset.mas5 = mas5(affy_data)

读取的过程还是蛮耗时间的，也可以选择rma函数而不是mas5函数对表达数据进行normalization

读取之后的表达矩阵如图所示：

理论上，处理得到的数据应该与直接在GEO官网下载的表达量是一样的，下载链接都是有规律的！

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1428/matrix/GSE1428_series_matrix.txt.gz

当然这个affy包支持的芯片平台是有限的！

一般是hgu 95系列和133系列~~

其实严格来说，这个芯片得到的表达矩阵，是需要过滤的。

比如像下面的代码：

setwd('../')
library(affy)
dir_cels='GSE34824_RAW'
data <- ReadAffy(celfile.path=dir_cels)
eset <- rma(data)
calls <- mas5calls(data) # get PMA calls
calls <- exprs(calls)
absent <- rowSums(calls == 'A') # how may samples are each gene 'absent' in all samples
absent <- which (absent == ncol(calls)) # which genes are 'absent' in all samples
rmaFiltered <- eset[-absent,] # filters out the genes 'absent' in all samples

54675 features 经过过滤后，剩下 42482 features

一个表达芯片数据处理实例

ulwvfje — Fri, 25 Sep 2015 14:53:39 +0000

这个实例上部分包括：

如何用R包下载GEO数据(只限单一平台，其余平台需要修改下面的代码)

如何对GEO的芯片数据归一化并且得到表达量矩阵，

如何用limma包做差异分析，

对找到的差异基因如何做GO和KEGG注释

首先下载两个GEO数据：

平台是：Affymetrix U133 gene chips

67 diseased triple negative breast cancer samples（GSE31519 ）and 42 control samples (GSE20437）

都是表达量数据，同一种芯片。分成两组，正好做差异表达。

数据来源的文献是：

文章title：A clinically relevant gene signature in triple negative and basal-like breast cancer

结论（We describe a ratio of high B-cell presence and low IL-8 activity as a powerful new prognostic marker for TNBC. ）

地址：http://www.breast-cancer-research.com/content/13/5/R97

GEO数据地址: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE31519

Platform: GPL96 67 Samples

Download data: GEO (CEL, TXT)

SeriesAccession: GSE31519ID: 200031519

文章title：Histologically normal epithelium from breast cancer patients and cancer-free prophylactic mastectomy patient

GEO数据地址: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20437

Platform: GPL96 67 Samples

Download data: GEO (CEL, TXT)

SeriesAccession: GSE31519 ID: 200031519

Platform: GPL96 Series: GSE20437 42 Samples

Download data: GEO (CEL)

DataSetAccession: GDS3716 ID: 3716

我首先用R的GEOquery包来下载。(其实你完全可以直接去GEO网站下载数据，然后解压的)

suppressMessages(library(GEOquery))

setwd("D:\\test_analysis\\TNBC")

gse31519=getGEO("GSE31519",GSEMatrix = T,destdir = "./")

getGEOSuppFiles("GSE31519",baseDir = "./")

gse31519=getGEO("GSE20437",GSEMatrix = T,destdir = "./")

getGEOSuppFiles("GSE20437",baseDir = "./")

这样下载之后的数据都存在D:\\test_analysis\\TNBC下面

接下来我们就用affy包和limma来进行差异分析：

library(affy)

library(limma)

affy.data=ReadAffy(celfile.path="./cel_files")

请先搞清楚，ReadAffy 这个函数的用法！

当前工作目录下面有没有cel_files文件夹?

cel_files文件夹下面有没有文件？

eset.rma=rma(affy.data)

exprSet=exprs(eset.rma)

write.table(exprSet,"expr_rma_matrix.txt",quote=F,sep="\t")

group=factor(c(rep("control",42),rep("case",67)))

design = model.matrix(~0+group)

colnames(design)=c("case","control")

rownames(design)=sampleNames(affy.data)

fit=lmFit(exprSet,design)

cont.matrix = makeContrasts(contrasts="case-control",levels=design)

fit2=contrasts.fit(fit,cont.matrix)

fit2=eBayes(fit2)

diff_dat=topTable(fit2,coef=1,n=Inf)

write.table(diff_dat,"diff_dat.txt",quote=F)

这样得到的diff_dat就是我们差异分析的结果啦

we choose the log fold cut off change to be “2” to get a manageable set of genes.
原文说：we were able to get a list of 2567 genes after removing the duplicates and the not available genes

我们仅仅根据一个标准来挑选差异基因， the log fold cut off change to be “2”，我只挑出来了782个探针

接下来对这些探针进行注释，得到基因名，我这里用biomart包来进行注释

我们的平台是：Affymetrix U133 gene chips，虽然有22283个探针，但是只有13908个基因

所以代码如下：

ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")

gene_probe=getBM(attributes=c("hgnc_symbol","affy_hg_u133a"),filter="affy_hg_u133a",value=rownames(diff_dat),mart=ensembl)

diff_probe=rownames(diff_dat[abs(diff_dat[,1])>2,])

diff_gene=gene_probe[match(diff_probe,gene_probe[,2]),1]

diff_gene=na.omit(diff_gene)

diff_gene=unique(diff_gene)

length(diff_gene)

这样会得到604个差异基因

然后我做一下GO和KEGG的富集分析

gene_entrez=getBM(attributes=c("hgnc_symbol","entrezgene"),filter="hgnc_symbol",value=diff_gene,mart=ensembl)

require(DOSE)

require(clusterProfiler)

gene_entrez=na.omit(gene_entrez)

gene=as.character(gene_entrez[,2])

ego <- enrichGO(gene=gene,organism="human",ont="CC",pvalueCutoff=0.01,readable=TRUE)

ekk <- enrichKEGG(gene=gene,organism="human",pvalueCutoff=0.01,readable=TRUE)

write.csv(summary(ekk),"KEGG-enrich.csv",row.names =F)

write.csv(summary(ego),GO-enrich.csv,row.names =F)

懒得上传图片了，大家可以用同样的代码自己实现所有的流程