生信菜鸟团 » 转录组

单细胞转录组数据分析CNV

ulwvfje — Sat, 17 Feb 2018 10:17:02 +0000

单细胞转录组数据分析CNV

都来aviv Regev自于实验室，一系列文章都利用了单细胞转录组数据分析CNV。

2014年关于GBM的science文章

首先是2014年关于GBM的science文章；PMID: 24925914 ，提到了这个分析点，然后还用了CCLE数据库验证可靠性。

该文章自己的单细胞转录组数据建库选用了 SMART-seq 方法，公布在 GSE57872

430(576) single glioblastoma cells isolated from 5 individual tumors
102(192) single cells from gliomasphere cells lines

这个单细胞转录组建库方式有点落后了：

SMART-seq protocol was implemented to generate single cell full length transcriptomes (modified from Shalek, et al Nature 2013) and sequenced using 25 bp paired end reads. Single cell cDNA libraries for MGH30 were resequenced using 100 bp paired end reads to allow for isoform and splice junction reconstruction (96 samples, annotated MGH30L).

所以作者过滤的比较严格，可以直接下载其分析好的表达矩阵，也可以下载原始测序数据自己走一波转录组流程。

第一次提出的公式如下：

2016年关于melanoma的science文章

然后是2016年关于melanoma的science文章：PMID: 27124452 也应用了单细胞转录组数据分析CNV，该文章的数据公布在 GSE72056 这次使用的Smart-seq2建库技术，共计 4645 个细胞，仅仅是表达矩阵就由71Mb，但是原始的测试数据在 dbGaP 数据库，需要申请才能下载。

Supplementary file	Size	Download	File type/resource
GSE72056_melanoma_single_cell_revised_v2.txt.gz	71.6 Mb	(ftp)(http)	TXT

we applied single-cell RNA sequencing (RNA-seq) to 4645 single cells isolated from 19 patients, profiling malignant, immune, stromal, and endothelial cells.

值得注意的是作者还做了bulk的转录组测序，针对6个处理 RAF or RAF+MEK inhibitors 前后供12个数据，公布在 GSE77940

这个时候的计算公式稍微有点变化了，如下：

2016年CELL杂志发表的关于头颈癌

接着是2016年CELL杂志发表的关于头颈癌的文章：Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer 测序如下；

We profiled transcriptomes of ∼6,000 single cells from 18 head and neck squamous cell carcinoma (HNSCC) patients, including five matched pairs of primary tumors and lymph node metastases.

同时也对这些病人测了whole-exome sequencing (WES) and targeted genotyping (SNaPshot) data，但是这些数据公布在 phs001474.v1.p1 ，不是很方便下载。

单细胞转录组建库用的Smart-seq2方法，所有的数据公布在 GSE103322 ，仅仅是表达矩阵都有近100Mb了。

GSE103322_HNSCC_all_data.txt.gz | 86.0 Mb |

下载地址是： (ftp)(http)

用CCLE数据做验证

2014年关于GBM的science文章；PMID: 24925914 ，文章提到：

We downloaded the CCLE gene-centric RMA-normalized Affymetrix data (http://www.broadinstitute.org/ccle/), and centered the expression of each gene across all cell lines at zero.

需要简单注册后才能下载：https://portals.broadinstitute.org/ccle/users/sign_in

理论上要得到下面的图：

](http://www.bio-info-trainee.com/wp-content/uploads/2018/02/highly-correlated-CNV-by-SNP6array-and-RNA-seq.png)

说明使用转录组数据分析到的CNV情况和SNP6.0芯片的结果差异不大。

还有GTEx数据库的验证

To compare these patterns to an external reference of normal cells we downloaded RNA-Seq data from the GTEX portal (http://www.gtexportal.org/; gene read counts file from Jan. 2013), and estimated CNV values as above: we normalized the read counts into log2(TPM+1), averaged all brain samples, restricted the data to the ~6,000 analyzed genes, subtracted for each gene the average normalized expression from the GBM single-cell data (this step is comparable to the centering of the single cell data) and then used a moving average of 100 genes over the genomically-ordered list of genes to define CNV-cont.

总结

上述文章及数据都是有表达矩阵可以下载，所以仅仅是根据这些文章的补充材料公布的公式即可重复整个流程啦。

一个RNA-seq实战-超级简单-2小时搞定！

ulwvfje — Fri, 30 Dec 2016 08:38:33 +0000

请不要直接拷贝我的代码，需要自己理解，然后打出来，思考我为什么这样写代码。

软件请用最新版，尤其是samtools等被我存储在系统环境变量的，考虑到读者众多，一般的软件我都会自带版本信息的！

我用两个小时，不代表你是两个小时就学会，有些朋友反映学了两个星期才学会，这很正常，没毛病，不要异想天开两个小时就达到我的水平。

转录组如果只看表达量真的是超级简单，真是超级简单，而且人家作者本来就测是SE50，这种破数据，也就是看表达量用的！

首先作者分析结果是：

数据在GEO地址是：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50177

我们需要下载的RNA-seq的数据：

https://www.ncbi.nlm.nih.gov//sra/?term=SRP029245

https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP029245

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP029/SRP029245

下载地址很容易获取啦！

for ((i=677;i<=680;i++)) ;do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP029/SRP029245/SRR957$i/SRR957$i.sra;done

ls *sra |while read id; do ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --split-3 $id;done

因为我用fastqc看了看数据质量，发现没有什么问题，代码如下：

ls *fastq |xargs ~/biosoft/fastqc/FastQC/fastqc -t 10

所以直接用hisat2软件把测序得到的fastq文件比对到hg19参考基因组上面

reference=/home/jianmingzeng/reference/index/hisat/hg19/genome

~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957677.fastq -S control_1.sam 2>control_1.log

~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957678.fastq -S control_2.sam 2>control_2.log

~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957679.fastq -S siSUZ12_1.sam 2>siSUZ12_1.log

~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR957680.fastq -S siSUZ12_2.sam 2>siSUZ12_2.log

而且查看log日志可以发现，比对效果杠杠的：

93.10% overall alignment rate
92.44% overall alignment rate
92.36% overall alignment rate
93.22% overall alignment rate

然后把sam文件根据reads name来排序并且转换为bam文件节省空间

ls *sam |while read id;do (nohup samtools sort -n -@ 5 -o ${id%%.*}.Nsort.bam $id &);done

最后用htseq-counts工具来对每一个样本进行基因的表达量定量！

ls *.Nsort.bam |while read id;do (nohup samtools view $id | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1>${id%%.*}.geneCounts 2>${id%%.*}.HTseq.log&);done

得到的文件如下：

这4个样本的基因的counts数据就可以用一系列的R包来做差异分析了，包括limma的voom，DEseq2，edgeR等等。这些包的用法都烂大街了，我就不赘述了。

做完差异分析，就可以跟作者的结果做对比，看看自己做的是不是对的。

hisat2+stringtie+ballgown

ulwvfje — Fri, 25 Nov 2016 15:06:23 +0000

早在去年九月，我就写个博文说 RNA-seq流程需要进化啦！ http://www.bio-info-trainee.com/1022.html ，主要就是进化成hisat2+stringtie+ballgown的流程，但是我一直没有系统性的讲这个流程，因为我觉真心木有用。我只用了里面的hisat来做比对而已！但是群里的小伙伴问得特别多，我还是勉为其难的写一个教程吧，你们之间拷贝我的代码就可以安装这些软件的！然后自己找一个测试数据，我的脚本很容易用的！

其实我最喜欢这样的文章了：http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html 而且人家还提供了所有的代码，不知道大家怎么还会有疑问的：http://www.nature.com/nprot/journal/v11/n9/extref/nprot.2016.095-S1.zip

人家已经把流程说得清清楚楚了，我还是说一个自己的体悟吧：

软件安装如下：

## Download and install HISAT

# https://ccb.jhu.edu/software/hisat2/index.shtml

cd ~/biosoft

mkdir HISAT && cd HISAT

#### readme: https://ccb.jhu.edu/software/hisat2/manual.shtml

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.4-Linux_x86_64.zip

unzip hisat2-2.0.4-Linux_x86_64.zip

ln -s hisat2-2.0.4 current

## ~/biosoft/HISAT/current/hisat2-build

## ~/biosoft/HISAT/current/hisat2

## Download and install StringTie

## https://ccb.jhu.edu/software/stringtie/ ## https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

cd ~/biosoft

mkdir StringTie && cd StringTie

wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.2.3.Linux_x86_64.tar.gz

tar zxvf stringtie-1.2.3.Linux_x86_64.tar.gz

ln -s stringtie-1.2.3.Linux_x86_64 current

# ~/biosoft/StringTie/current/stringtie

软件使用，我比较喜欢用shell脚本，而且是简单的那种：

while read id

do

sample=$(echo $id |cut -d" " -f 1 )

file1=$(echo $id |cut -d" " -f 2 )

file2=$(echo $id |cut -d" " -f 3 )

echo $sample

echo $file1

echo $file2

~/biosoft/HISAT/current/hisat2 -p 4 --dta -x ~/reference/index/hisat/hg19/genome -1 $file1 -2 $file2 -S $sample.hisat2.hg19.sam 2>$sample.hisat2.hg19.log &

done <$1

上面这个脚本需要一个3列的输入文件，分别是样本名，read1文件，read2文件，会产生以下的输出文件，sam文件。

while read id

do

file=$(basename $id )

sample=${file%%.*}

echo $id $sample

nohup samtools sort -@ 4 -o ${sample}.sorted.bam $id &

done <$1

最新版的samtools已经可以直接把sam文件变成排序好的bam文件啦~~~~

while read id

do

file=$(basename $id )

sample=${file%%.*}

echo $id $sample

nohup ~/biosoft/StringTie/current/stringtie -p 4 -G ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf -o $sample.hg19.stringtie.gtf -l $sample $id &

done <$1

stringTie的用法就是这样咯。没什么好讲的

~/biosoft/StringTie/current/stringtie --merge -p 8 -G ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf -o stringtie_merged.gtf mergelist.txt

while read id

file=$(basename $id )

sample=${file%%.*}

echo $id $sample

nohup ~/biosoft/StringTie/current/stringtie -e -B -G $2 -o ballgown/$sample/$sample.hg19.stringtie.gtf $id &

done <$1

我实在讲不下去了，因为真心不用这个东东，我都是拿到了sam/bam文件就直接去counts表达量矩阵了，而count reads数量是非常容易的事情，代码如下

nohup samtools view A.sorted.bam.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1>A.geneCounts 2>A.HTseq.log &

下面的这些文件，导入到R里面用ballgown处理吧，不要在问我这个问题了。

htseq-counts跟bedtools的区别

ulwvfje — Tue, 15 Nov 2016 03:55:21 +0000

我以前写过bedtools和htseq-counts的教程，它们都可以用来对比对好的bam文件进行计数，正好群里有小伙伴问我它们的区别，我就简单做了一个比较，大家可以先看看我以前写的软件教程。写的有的挫：

使用Bedtools对RNA-seq进行基因计数，

转录组HTseq对基因表达量进行计数

言归正传，我这里没精力去探究它们的具体原理，只是看看它们数一个read是否属于某个基因的时候，区别在哪里，大家看下图：

很明显，bedtools不管三七二十一，只要你的reads比对到基因组的坐标跟目的基因坐标有交叉，就算你一个reads，不需要管你是不是multiple mapping的。

但是htseq就谨慎很多，而且还可以挑选model，一般来说，它会把multiple mapping的reads归类到 not unique aligned里面。

而且，大家做完分析，一定要再三检查，很明显人家hisat告诉你的mapping rate高达90%以上，即使除去那15%左右的multiple mapping，你counts表达量的时候，至少也可以counts 百分之五六十吧！！！

如果出现大数量级的no_feature，你自己就应该明白有问题了！

最后htseq-counts使用的时候有一些参数尤其需要注意：

软件官网说明书： http://www-huber.embl.de/HTSeq/doc/count.html

参考gtf文件可以是gencode或者是ensembl数据库的，但是尤其要注释chr的问题，而且版本问题，gtf/gff格式无所谓。比对后的文件一定要进行sort，推荐一定要sort -n，根据reads的name来sort

-f sam/bam 这个一定要搞清楚，如果对bam文件进行counts，必须保证你服务器的python安装了正确的pysam模块

-r name/pos，一般情况下我们的bam都是按照参考基因组的pos来sort的，但是这个软件默认却是reads的name，很坑，一般建议重新把bam文件sort一下，而不是选择 -r pos，因为-r pos实在是太消耗内存了。

-s yes/no/reverse, 这也是巨坑的参数，默认是yes，一般人拿到的数据都是no，所以千万要注意！！！

-t 选择gff/gtf文件的第3列，一般是exon，也可以是gene，transcript ，这个很少调整的。

-i 这个需要修改，不然默认是ensembl的基因ID，一般人看不懂，可以改为gene_name，前提是你的gff文件里面有gene_name这个属性。

其余的就不需要修改了。

我的代码如下：

nohup samtools view control.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1>control.geneCounts 2>control.HTseq.log &

nohup samtools view G34V.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1>G34V.geneCounts 2>G34V.HTseq.log &

nohup samtools view K27M.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1>K27M.geneCounts 2>K27M.HTseq.log &

nohup samtools view WT.Nsort.bam | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1>WT.geneCounts 2>WT.HTseq.log &

用samtools idxstats来对de novo的转录组数据计算表达量

ulwvfje — Mon, 31 Oct 2016 09:16:48 +0000

de novo的转录组数据，比对的时候一般用的是自己组装好的trinity.fasta序列(挑选最长蛋白的转录本序列)来做参考，用bowtie2等工具直接将原始序列比对即可。所以比对 sam/bam文件本身就包含了参考序列的每一条转录本序列ID，直接对 sam/bam文件进行counts就知道每一个基因的表达量啦！

本来我是准备自己写脚本对sam文件进行counts就好，但是发现了samtools自带这样的工具：http://www.htslib.org/doc/samtools.html

如果是针对基因组序列，那么这个功能用处不大，但是针对转录本序列，统计出来的就是我们想要的转录本表达量。

samtools idxstats tmp.bowtie2.sorted.bam |head
TR3|c0_g1_i1 1276 418 0
TR6|c0_g1_i1 1271 10 0
TR6|c0_g1_i2 944 5 0
TR6|c0_g1_i3 1281 4 0
TR6|c0_g1_i4 1224 53 0
TR6|c0_g1_i5 855 16 0
TR19|c0_g1_i2 1428 19 0
TR19|c0_g1_i3 2536 624 0
TR19|c0_g1_i4 3072 105 0
TR19|c0_g1_i5 1685 0 0

软件官网说明书，说的很清楚：

samtools idxstats in.sam|in.bam|in.cram

Retrieve and print stats in the index file corresponding to the input file. Before calling idxstats, the input BAM file must be indexed by samtools index.

The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. It is written to stdout.

第三列，就是我们想要的表达量数据啦，比对到每个转录本序列的reads数量。

大家从我的转录本序列ID上面如果可以看出些什么问题，欢迎跟我交流，直接给我email就好了，jmzeng1314@163.com

现在知道了每个转录本的表达量，把每个样本都做一下，就知道表达矩阵了，做差异分析就很简单了。但是得到的是差异转录本列表，不明白这些ID背后的意义，需要取注释，才能做下一步分析。

ls *sorted.bam |while read id
do
echo $id ${id%%.*}.t.counts
nohup samtools idxstats $id 1>${id%%.*}.t.counts 2>/dev/null &
done

最全面的转录组研究软件收集

ulwvfje — Fri, 16 Oct 2015 11:40:49 +0000

能看到这个网站真的是一个意外，现在看来，还是外国人比较认真呀，这份软件清单，能看出作者的确是花了大力气的，满满的都是诚意。from: https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools

https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools软件主要涵盖了转录组分析的以下18个方向，看我我才明白自己的水平的确没到家，印象中的转录组分析也就是差异表达，然后注释以下，最多分析一下融合基因，要不然就看看那些miRNA，和lncRNA咯，没想到里面的学问也大着呢，怪不得生物是一个大坑，来再多的学者也不怕，咱有的是研究方向给你。

1 Quality control and pre-processing data

1.1 Quality control and filtering data

1.2 Detection of chimeric reads

1.3 Errors Correction

1.4 Pre-processing data

2 Alignment Tools

2.1 Short (Unspliced) aligners

2.2 Spliced aligners

2.2.1 Aligners based on known splice junctions (annotation-guided aligners)

2.2.2 De novo Splice Aligners

2.2.2.1 De novo Splice Aligners that also use annotation optionally

2.2.2.2 Other Spliced Aligners

3 Normalization, Quantitative analysis and Differential Expression

3.1 Multi-tool solutions

4 Workbench (analysis pipeline / integrated solutions)

4.1 Commercial Solutions

4.2 Open (free) Source Solutions

5 Alternative Splicing Analysis

5.1 General Tools

5.2 Intron Retention Analysis

6 Bias Correction

7 Fusion genes/chimeras/translocation finders/structural variations

8 Copy Number Variation identification

9 RNA-Seq simulators

10 Transcriptome assemblers

10.1 Genome-Guided assemblers

10.2 Genome-Independent (de novo) assemblers

10.2.1 Assembly evaluation tools

11 Co-expression networks

12 miRNA prediction

13 Visualization tools

14 Functional, Network & Pathway Analysis Tools

15 Further annotation tools for RNA-Seq data

16 RNA-Seq Databases

17 Webinars and Presentations

18 References

RNA-seq完整学习手册！

ulwvfje — Tue, 05 May 2015 04:57:08 +0000

需耗时两个月！里面网盘资料如果过期了，请直接联系我1227278128，或者我的群201161227，所有的资源都可以在 http://pan.baidu.com/s/1jIvwRD8 此处找到

搜索可以得到非常多的流程，我这里简单分享一些，我以前搜索到的文献。

北大也有讲RNA-seq的原理

链接：http://pan.baidu.com/s/1kTmWmv9 密码：6yaz

甚至，我还有个华大的培训课程！！！这可是5天的培训教程哦，好像当初还花了五千多块钱的资料！！！

链接：http://pan.baidu.com/s/1nt5OV5B 密码：gyul

优酷也有视频，可以自己搜索看看

然后还有几个pipeline，就是生信的分析流程，即使你啥都不会，按照pipeline来也不是问题啦

export PATH=/share/software/bin:$PATH

bowtie2-build ./data/GRCh37_chr21.fa chr21

tophat -p 1 -G ./data/genes.gtf -o P460.thout chr21 ./data/P460_R1.fq ./data/P460_R2.fq

tophat -p 1 -G ./data/genes.gtf -o C460.thout chr21 ./data/C460_R1.fq ./data/C460_R2.fq

cufflinks -p 1 -o P460.clout P460.thout/accepted_hits.bam

cufflinks -p 1 -o C460.clout C460.thout/accepted_hits.bam

samtools view -h P460.thout/accepted_hits.bam > P460.thout/accepted_hits.sam

samtools view -h C460.thout/accepted_hits.bam > C460.thout/accepted_hits.sam

echo ./P460.clout/transcripts.gtf > assemblies.txt

echo ./C460.clout/transcripts.gtf >> assemblies.txt

cuffmerge -p 1 -g ./data/genes.gtf -s ./data/GRCh37_chr21.fa assemblies.txt

cuffdiff -p 1 -u merged_asm/merged.gtf -b ./data/GRCh37_chr21.fa -L P460,C460 -o P460-C460.diffout P460.thout/accepted_hits.bam C460.thout/accepted_hits.bam

samtools index P460.thout/accepted_hits.bam

samtools index C460.thout/accepted_hits.bam

和另外一个

#!/bin/bash

# Approx 75-80m to complete as a script

cd ~/RNA-seq

ls -l data

tophat --help

head -n 20 data/2cells_1.fastq

time tophat --solexa-quals \

-g 2 \

--library-type fr-unstranded \

-j annotation/Danio_rerio.Zv9.66.spliceSites\

-o tophat/ZV9_2cells \

genome/ZV9 \

data/2cells_1.fastq data/2cells_2.fastq # 17m30s

time tophat --solexa-quals \

-g 2 \

--library-type fr-unstranded \

-j annotation/Danio_rerio.Zv9.66.spliceSites\

-o tophat/ZV9_6h \

genome/ZV9 \

data/6h_1.fastq data/6h_2.fastq # 17m30s

samtools index tophat/ZV9_2cells/accepted_hits.bam

samtools index tophat/ZV9_6h/accepted_hits.bam

cufflinks --help

time cufflinks -o cufflinks/ZV9_2cells_gff \

-G annotation/Danio_rerio.Zv9.66.gtf \

-b genome/Danio_rerio.Zv9.66.dna.fa \

-u \

--library-type fr-unstranded \

tophat/ZV9_2cells/accepted_hits.bam # 2m

time cufflinks -o cufflinks/ZV9_6h_gff \

-G annotation/Danio_rerio.Zv9.66.gtf \

-b genome/Danio_rerio.Zv9.66.dna.fa \

-u \

--library-type fr-unstranded \

tophat/ZV9_6h/accepted_hits.bam # 2m

# guided assembly

time cufflinks -o cufflinks/ZV9_2cells \

-g annotation/Danio_rerio.Zv9.66.gtf \

-b genome/Danio_rerio.Zv9.66.dna.fa \

-u \

--library-type fr-unstranded \

tophat/ZV9_2cells/accepted_hits.bam # 16m

time cufflinks -o cufflinks/ZV9_6h \

-g annotation/Danio_rerio.Zv9.66.gtf \

-b genome/Danio_rerio.Zv9.66.dna.fa \

-u \

--library-type fr-unstranded \

tophat/ZV9_6h/accepted_hits.bam # 13m

time cuffdiff -o cuffdiff/ \

-L ZV9_2cells,ZV9_6h \

-T \

-b genome/Danio_rerio.Zv9.66.dna.fa \

-u \

--library-type fr-unstranded \

annotation/Danio_rerio.Zv9.66.gtf \

tophat/ZV9_2cells/accepted_hits.bam \

tophat/ZV9_6h/accepted_hits.bam # 7m

head -n 20 cuffdiff/gene_exp.diff

sort -t$'\t' -g -k 13 cuffdiff/gene_exp.diff \

> cuffdiff/gene_exp_qval.sorted.diff

head -n 20 cuffdiff/gene_exp_qval.sorted.diff

转录组总结

ulwvfje — Thu, 19 Mar 2015 14:22:12 +0000

网站成立也快一个月了，总算是完全搞定了生信领域的一个方向，当然，只是在菜鸟层面上的搞定，还有很多深层次的应用及挖掘，仅仅是我所讲解的这些软件也有多如羊毛的参数可以变幻，复杂的很。其实我最擅长的并不是转录组，但是因为一些特殊的原因，我恰好做了三个转录组项目，所以手头上关于它的资料比较多，就分享给大家啦！稍后我会列一个网站更新计划，就好谈到我所擅长的基因组及免疫组库。我这里简单对转录组做一个总结：

首先当然是我的转录组分类网站啦

http://www.bio-info-trainee.com/?cat=18

同样的我用脚本总结一下给大家

http://www.bio-info-trainee.com/?p=370阅读更多关于《转录组-GO和KEGG富集的R包clusterProfiler》

http://www.bio-info-trainee.com/?p=359阅读更多关于《转录组-GO通路富集-WEGO网站使用》

http://www.bio-info-trainee.com/?p=346阅读更多关于《转录组-TransDecoder-对trinity结果进行注释》

http://www.bio-info-trainee.com/?p=271阅读更多关于《转录组cummeRbund操作笔记》

http://www.bio-info-trainee.com/?p=255阅读更多关于《转录组edgeR分析差异基因》

http://www.bio-info-trainee.com/?p=244阅读更多关于《转录组HTseq对基因表达量进行计数》

http://www.bio-info-trainee.com/?p=166阅读更多关于《转录组cufflinks套装的使用》

http://www.bio-info-trainee.com/?p=156阅读更多关于《转录组比对软件tophat的使用》

http://www.bio-info-trainee.com/?p=125阅读更多关于《Trinity进行转录组组装的使用说明》

http://www.bio-info-trainee.com/?p=113阅读更多关于《RSeQC对 RNA-seq数据质控》

同时我也讲了如何下载数据

http://www.bio-info-trainee.com/?p=32

原始SRA数据首先用SRAtoolkit数据解压，然后进行过滤，评估质量，然后trinity组装，然后对组装好的进行注释，然后走另一条路进行差异基因，差异基因有tophat+cufflinks+cummeRbund，也有HTseq 和edgeR等等，然后是GO和KEGG通路注释，等等。

在我的群里面共享了所有的代码及帖子内容，欢迎加群201161227，生信菜鸟团！

http://www.bio-info-trainee.com/?p=1

线下交流-生物信息学
同时欢迎下载使用我的手机安卓APP

http://www.cutt.com/app/down/840375

转录组cummeRbund操作笔记

ulwvfje — Tue, 17 Mar 2015 01:34:16 +0000

转录组cummeRbund操作笔记

这是跟tophat和cufflinks套装紧密搭配使用的一个R包，能出大部分文章要求的标准化图片。

一：安装并加装该R包

安装就用source("http://bioconductor.org/biocLite.R") ;biocLite("cummeRbund")即可，如果安装失败，就需要自己下载源码包，然后安装R模块。

然后把cuffdiff输出的文件目录拷贝到R的工作目录，或者自己设置工作目录

二：读取FN目录下面的所有文件。

可以看到把cuffdiff下面的文件夹所有的文件都读取到了，里面有如下文件，包括genes，isoforms，cds，tss这四种差异情况都读取了。

三：表达水平分布图

四、表达水平箱线图

csBoxplot(genes(cuff_data))

五、画基因表达差异热图

画出热图如下

六、得到差异的genes,isoforms,TSS,CDS等等

得到上调下调基因列表

diffData <- diffData(myGenes )

只有一百个有表达差异的基因

最后贴出一个综合性的代码，算了，太浪费空间了，把整个空间搞得不好看，就不贴了。

这个代码可以自动运行出图;