单细胞差异分析之pseudobulk的3种实现方法

之前分享了:单细胞层面的表达量差异分析到底如何做,提到了pseudobulks方法,因为找各个单细胞亚群特异性高表达量基因(FindAllMarkers函数)以及两个亚群针对性差异分析(FindMarkers函数)都不符合需求,所以才有pseudobulks的流行。之前我们在《单细胞天地》公众号分享过一个文献 ,解读在:https://cloud.tencent.com/developer/article/1901064

里面提到的目前主流的单细胞差异分析方法都是Wilcoxon rank−sum test,但是它其实表现还不如pseudobulks 的方法。。。

所以有必要从代码角度看看单细胞差异分析之pseudobulk的3种实现方法。

首先是rowSums方法

这个是非常容易理解的,我在之前分享了:单细胞层面的表达量差异分析到底如何做,也是这样举例:

前面的 compSce是一个seurat对象 ,它里面的comp是表型是两个分组,然后mice是表型是十几个小鼠。也就是说十几个小鼠各自的单细胞转录组样品是两分组,需要做差异分析。我实际上是创造了一个do.call( cbind,lapply 的复杂语法,熟悉这些函数的小伙伴就容易理解。

bs = split(colnames(compSce),compSce$mice)
names(bs)
ct = do.call(
 cbind,lapply(names(bs), function(x){ 
 # x=names(bs)[[1]]
 kp =colnames(compSce) %in% bs[[x]]
 rowSums( as.matrix( compSce@assays$RNA@counts[, kp] ))
 })
)

phe = unique(compSce@meta.data[,c('mice','comp')])
phe

# 每次都要检测数据 
group_list = phe[match(names(bs),phe$mice),'comp']
table(group_list) 
exprSet = ct
exprSet[1:4,1:4] 
dim(exprSet) 
exprSet=exprSet[apply(exprSet,1, function(x) sum(x>1) > 1),]
dim(exprSet) 
table(group_list)
save(exprSet,group_list,file = 'input_for_deg.Rdata')

load(file = 'input_for_deg.Rdata')

有了表达量矩阵以及分组信息,后面就是常规的转录组差异分析啦。

其实 https://jef.works/blog/2020/04/06/quickly-creating-pseudobulks/ 也是提出来了类似的代码实现,居然跟我说一模一样的!!!

## pseudobulk gene expression per cell-type
getPseudobulk <- function(mat, celltype) {
 mat.summary <- do.call(cbind, lapply(levels(celltype), function(ct) {
 cells <- names(celltype)[celltype==ct]
 pseudobulk <- rowSums(mat[, cells])
 return(pseudobulk)
 }))
 colnames(mat.summary) <- levels(celltype)
 return(mat.summary)
}

## test runtime
start_time1 <- Sys.time()
## call function
mat.summary <- getPseudobulk(mat.sparse, celltype)
end_time1 <- Sys.time()

## take a look
dim(mat.summary)

然后 https://github.com/neurorestore/DE-analysis/blob/master/R/functions/run_DE.R 就是使用了另外一套语法体系:

# process data into gene X replicate X cell_type matrices 
 mm = model.matrix(~ 0 + replicate:label, data = meta_sub)
 mat_mm = GetAssayData(sc_sub, slot = 'counts') %*% mm
 keep_genes = rowSums(mat_mm > 0) > 0
 mat_mm = mat_mm[keep_genes, ] %>% as.data.frame()
 mat_mm %<>% as.data.frame()
 colnames(mat_mm) = gsub("replicate|label", "", colnames(mat_mm))
 # drop empty columns
 keep_samples = colSums(mat_mm) > 0
 mat_mm %<>% extract(, keep_samples)
 return(mat_mm)
 }) %>%
 setNames(keep)
 # drop NAs
 pseudobulks %<>% extract(!is.na(.))

这个代码实在是太复杂了,我仅仅是节选部分给大家看看,因为它考虑到的各种因素非常多,但是本质上还是表达量矩阵的提取和加和,是rowSums方法。。。

scater包的aggregateAcrossCells函数

我在文章看到了scater包的aggregateAcrossCells函数也可以做类似的:《Transcriptional changes in the mammary gland during lactation revealed by single cell sequencing of cells from human milk》,链接 https://www.nature.com/articles/s41467-021-27895-0

  • DEGs were identified between subsetted groups by first generating pseudobulk samples using “aggregateAcrossCells” function in the scater package.
  • edgeR version 3.34.3 was used to compute DEGs between groups by filtering and scaling sample library size using the “filterByExpr” and “calcNormFactors” functions.
  • Next the common, trended and tagwise negative binomial dispersions of the genes were calculated using “estimateDisp”.
  • Quasi-likelihood negative binomial generalized log-linear models were fitted using “glmQLFit” and “glmQLFTest”.
  • FDR corrections were applied to the resulting p values using the Benjamini–Hochberg method.
  • To visualize the DEGs, volcano plots were generated using the EnhancedVolcano package the FDR corrected p value cut off FDR < 1 × 10−8.

我还没有测试scater包的aggregateAcrossCells函数,因为我比较喜欢自己的代码。

Matrix.utils包的aggregate.Matrix函数

这个是出自于 https://hbctraining.github.io/scRNA-seq/lessons/pseudobulk_DESeq2_scrnaseq.html

如下所示:

image-20231112152410668

 

Comments are closed.