GWAS宝刀未老 | 生信菜鸟团

今年（2020）2月，解放军总医院放射治疗科的研究团队，在国际期刊《Journal of Cancer》上发表了题为”Precise prediction of the radiation pneumonitis in lung cancer: an explorative preliminary mathematical model using genotype information”的科研论文。

该论文揭示了辐射敏感性在基因层面的可预测性，并建立了精准预测模型，该模型的灵敏度和特异性均超过90%。此项成果达世界领先水平，为临床上放疗患者的分类治疗提供了良好的科学依据。
我粗略浏览了一下，发现居然就是GWAS的科研思维，区别就是样本量超级小，基因分型芯片比较个性化，文章的突破点就是病人队列以及关心的临床问题。

研究大纲

非常容易理解，收集了一百多个肺癌病人（不同的stage，不同的histology），其中一半是RP grade≥2 ，在他们经过放射治疗前测基因型，约3万个合格位点，

Radiation pneumonitis (RP) is the most significant dose-limiting toxicity and is one major obstacle for lung cancer radiotherapy.
Grade ≥2 RP usually needs clinical interventions and serve RP could be life threatening.
The purpose of this study is to develop an approach for the personalized RP risk prediction.
a multiple linear regression model named Radiation Pneumonitis Index (RPI) was built, for the assessment of Grade ≥2RP risk.
这个RP值是研究的核心：Once diagnosed, RP was further graded by at least two radiation oncologists following the Common Toxicity Criteria for Adverse Events (CTCAE) version 4.03.

研究方法

使用的是Infinium® Global Screening Array system (Illumina, San Diego, CA, USA) 这个基因分型芯片，约7万个位点。
测的样品是：Peripheral blood leukocytes from patients before the radiotherapy was used for genomic DNA extraction using the Maxwell system (Promega, Madison, WI, USA).
病人队列是：Archived information of 118 lung cancer patients was obtained from the People’s Liberation Army General Hospital.
GWAS分析步骤是：
We excluded SNPs in each individual dataset that had a mean GenCall score < 0.7, missingness >5%, MAF < 0.01 or a Hardy-Weinberg equilibrium test P < 10-6 using PLINK.
We also excluded variants with multiple alleles. A total of 720,078 SNPs in the genotypic data set and 299,054 SNPs in the dataset passed this process for further prediction.
质控后仍然有 299,000 sites
基因型是0，1，2这样的野生型，杂合，纯合的3分类法（ We assigned value 0 to ‘WW’ genotype, value 1 to ‘WA’/‘AW’ genotype and value 2 to ‘AA’ genotype. ）
建模后；Thirty-nine effective SNP sites were discovered after applying the GLMNET regression on 90 sets of random training data.
重点就是GLMNET算法，全称是：Generalized Linear Models via Lasso and Elastic-Net Regularization。
出图如下：

100个病人的3万个基因型位点

其实这个时候，有点类似于传统的表达矩阵的生存分析了，只不过是它这个基因型数据呢，“表达量”只有0，1，2这3种形式，而不是真正RNA-seq或者基因芯片那样的表达矩阵。这些病人的临床结局事件也超级类似，我们生存分析的时候，病人通常是存活或者死亡两个状态，而这个病人的分类也是 Grade ≥2 RP与否。
关于这个GLMNET算法，作者写得很模糊，在R包glmnet里面有：
岭回归(Ridge Regression)
套索方法(LASSO:least absolute shrinkage and selection operator)
主要是在R里面实现， LASSO回归α=1 ，Ridge回归α=0 ，一般Elastic Net模型0<α<1 。其中参数α是控制应对高相关性(highly correlated)数据时模型的性状。学它之前需要先搞定简单的线性回归(Linear Regression)以及Logistic回归，一个资料推荐给大家：https://rstudio-pubs-static.s3.amazonaws.com/208326_53c039603b3c45619f9fb2c0baf5fa28.html
文章提到的GLMNET算法，全称是：Generalized Linear Models via Lasso and Elastic-Net Regularization ，也就是他并不指明具体的方法，难道是不希望我们重复出来吗？再说，他本来就没有提供100个病人的3万个基因型位点矩阵也没有具体病人临床信息，我们只能是看看，不说话。
最后，我们生信技能树确实没有GWAS相关教程，但是在生信菜鸟团，我们还有一个GWAS专题的：

全基因组数据分析目录

一	二	三	四	五	六	日
« 九
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

研究大纲

研究方法

100个病人的3万个基因型位点

全基因组数据分析目录