10X genomics单细胞数据集探索

其官网公布了非常多的数据集: https://support.10xgenomics.com/single-cell-gene-expression/datasets 只需要简单填写邮箱即可下载,如下:

每一个数据集都公布了原始的fastq数据以及比对好的bam文件,和定量后的表达矩阵以及聚类分析结果,用的是10X genomics公司自己的生物信息学分析流程,Single Cell Gene Expression Dataset by Cell Ranger 2.1.0

1k Brain Cells from an E18 Mouse

Cells from a combined cortex, hippocampus and sub ventricular zone of an E18 mouse.

  • 931 cells detected
  • Sequenced on Illumina HiSeq2500 with approximately 56,000 reads per cell
  • 26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode
  • Analysis run with —cells=2000

原始数据非常大,这里就选择上面这个接近1000个小鼠大脑细胞的数据集来测试 Cell Ranger 2.1.0 流程。

直接下载处理好的结果

因为细胞数量较多,哪怕是纯粹的表达矩阵,也很大,我下载了几个准备去探索,如下:

├── [ 76M] neuron_9k_analysis.tar
├── [112M] neuron_9k_cloupe.cloupe
├── [284M] neuron_9k_filtered_gene_bc_matrices.tar
├── [507M] neuron_9k_raw_gene_bc_matrices.tar
├── [ 67M] pbmc4k_analysis.tar
├── [ 35M] pbmc4k_cloupe.cloupe
├── [ 69M] pbmc4k_filtered_gene_bc_matrices.tar
├── [133M] pbmc4k_raw_gene_bc_matrices.tar
├── [ 78M] pbmc8k_analysis.tar
├── [ 63M] pbmc8k_cloupe.cloupe
├── [143M] pbmc8k_filtered_gene_bc_matrices.tar
├── [253M] pbmc8k_raw_gene_bc_matrices.tar
├── [ 58M] t_4k_analysis.tar
├── [ 29M] t_4k_cloupe.cloupe
├── [ 60M] t_4k_filtered_gene_bc_matrices.tar
└── [131M] t_4k_raw_gene_bc_matrices.tar

下载原始fastq格式的测序数据

这里仍然是下载1k Brain Cells from an E18 Mouse,最小的数据集,做测试用:

├── [237M] neurons_900_S1_L001_I1_001.fastq.gz
├── [642M] neurons_900_S1_L001_R1_001.fastq.gz
├── [1.8G] neurons_900_S1_L001_R2_001.fastq.gz
├── [238M] neurons_900_S1_L002_I1_001.fastq.gz
├── [646M] neurons_900_S1_L002_R1_001.fastq.gz
└── [1.8G] neurons_900_S1_L002_R2_001.fastq.gz

可以看到左右端测序数据大小不一致,而且每次测序是有3个数据,因为26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode ,只有reads2的fastq里面是真正的转录本序列,另外的两个文件都是barcode!

比对并且定量

可以直接用 Cell Ranger 来做分析,代码如下:

/home/jianmingzeng/biosoft/10xgenomic/cellranger-2.1.0/cellranger count --id=neurons \
--localcores 5 \
--transcriptome=/home/jianmingzeng/biosoft/10xgenomic/db/refdata-cellranger-mm10-1.2.0 \
--fastqs=/home/jianmingzeng/data/public/10x/neurons_900_fastqs \
--expect-cells=900

得到的结果如下:

├── [ 18M] cloupe.cloupe
├── [ 17] filtered_gene_bc_matrices
│   └── [ 58] mm10
│   ├── [ 15K] barcodes.tsv
│   ├── [723K] genes.tsv
│   └── [ 29M] matrix.mtx
├── [4.1M] filtered_gene_bc_matrices_h5.h5
├── [ 680] metrics_summary.csv
├── [ 96M] molecule_info.h5
├── [5.4G] possorted_genome_bam.bam
├── [3.5M] possorted_genome_bam.bam.bai
├── [ 17] raw_gene_bc_matrices
│   └── [ 58] mm10
│   ├── [ 13M] barcodes.tsv
│   ├── [723K] genes.tsv
│   └── [ 70M] matrix.mtx
├── [ 10M] raw_gene_bc_matrices_h5.h5
└── [3.2M] web_summary.html

其中analysis文件夹里面的东西比较多,就不列出了。其中比较占空间的就是比对好的bam文件而已,其它的都可以下载到本地电脑查看。

其中比较重要的就是 filtered_gene_bc_matrices文件夹下面的表达矩阵了,可以直接被R包Seurat读入进行一系列的处理

library(Seurat)
library(dplyr)
library(Matrix)
neurons.data <- Read10X(data.dir = "~/outs/filtered_gene_bc_matrices/mm10/")
# Examine the memory savings between regular and sparse matrices
dense.size <- object.size(x = as.matrix(x = neurons.data))
dense.size
sparse.size <- object.size(x = neurons.data)
sparse.size
dense.size / sparse.size
neurons <- CreateSeuratObject(raw.data = neurons.data, min.cells = 3, min.genes = 200, 
 project = "10X_neurons")
neurons

完整笔记见:单细胞转录组3大R包之Seurat

 

Comments are closed.