quantile normalization到底对数据做了什么？

ulwvfje — Wed, 23 Nov 2016 11:48:51 +0000

提到normalization很多人都烦了，几十种方法，而对于芯片或者其它表达数据来说，最常见的莫过于quantile normalization啦。那么它到底对我们的表达数据做了什么呢？首先要么要清楚一个概念，表达矩阵的每一列都是一个样本，每一行都是一个基因或者探针，值就是表达量咯。quantile normalization 就是对每列单独进行排序，排好序的矩阵求平均值，得到平均值向量，然后根据原矩阵的排序情况替换对应的平均值，所以normalization之后的值只有平均值了。具体看下面的图：

在R里面，推荐用preprocessCore 包来做quantile normalization，不需要自己造轮子啦！

但是需要明白什么时候该用quantile normalization，什么时候不应该用，就复杂很多了，自己看

http://biorxiv.org/content/biorxiv/early/2014/12/04/012203.full.pdf

nature发表的统计学专题Statistics in biology

ulwvfje — Fri, 16 Oct 2015 11:14:19 +0000

生物学里面，唯一还算有点技术含量，和有点门槛，就是生物统计了，而这也是绝大部分研究者的痛点，有能力的，可以看看nature上面关于统计学的专题讨论，而且主要是应用于自然科学的统计学讨论。

http://www.nature.com/collections/qghhqm

里面有几句统计学名言警句：

Statistics does not tell us whether we are right. It tells us the chances of being wrong.

统计学并不会告诉我们是否正确，而只是说明我们错误的可能性是多少。

Quality is often more important than quantity.

数据的质量远比数量要重要的多

The meaning of error bars is often misinterpreted, as is the statistical significance of their overlap.

Good experimental designs mitigate experimental error and the impact of factors not under study.

文章列表：

Research methods: Know when your numbers are significant

Scientific method: Statistical errors

Weak statistical standards implicated in scientific irreproducibility

The fickle P value generates irreproducible results

Vital statistics

Experimental biology: Sometimes Bayesian statistics are better

A call for transparent reporting to optimize the predictive value of preclinical research

Power failure: why small sample size undermines the reliability of neuroscience

Basic statistical analysis in genetic case-control studies

Erroneous analyses of interactions in neuroscience: a problem of significance

Analyzing 'omics data using hierarchical models

Advantages and pitfalls in the application of mixed-model association methods

Quality control and conduct of genome-wide association meta-analyses

Circular analysis in systems neuroscience: the dangers of double dipping

A solution to dependency: using multilevel analysis to accommodate nested data

How does multiple testing correction work?

What is Bayesian statistics?

What is a hidden Markov model?

下面的这些文章，其实就是我们正常课本里面统计学的知识点，但是放在nature杂志发表，就顿时高大上了好多

Points of significance: Importance of being uncertain

Points of Significance: Error bars

Points of significance: Significance, P values and t-tests

Points of significance: Power and sample size

Points of Significance: Visualizing samples with box plots

Points of significance: Comparing samples part I

Points of significance: Comparing samples part II

Points of significance: Nonparametric tests

Points of significance: Designing comparative experiments

Points of significance: Analysis of variance and blocking

Points of Significance: Replication

Points of Significance: Nested designs

Points of Significance: Two-factor designs

Points of significance: Sources of variation

Points of Significance: Split plot design

Points of Significance: Bayes' theorem

Points of significance: Bayesian statistics

Points of Significance: Sampling distributions and the bootstrap

Points of Significance: Bayesian networks

A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

生信菜鸟团 » 统计学

quantile normalization到底对数据做了什么？

nature发表的统计学专题Statistics in biology