Statistics versus machine learning | Nature Methods
一. 前言
(图与文无关, 来源马斯克 X, 对 AI/机器学习 过度夸张营销的嘲讽)
文章从生物科研的视角阐述统计学和机器学习的差异和联系.
以下是正文的基本翻译, 仅供参考.
Statistics draws population inferences from a sample, and machine learning finds generalizable predictive patterns.
统计学从样本中得出总体推断, 机器学习发现可泛化的预测模式.
二. 正文
Two major goals in the study of biological systems are inference and prediction. Inference creates a mathematical model of the data generation process to formalize understanding or test a hypothesis about how the system behaves. Prediction aims at forecasting unobserved outcomes or future behavior, such as whether a mouse with a given gene expression pattern has a disease. Prediction makes it possible to identify best courses of action (e.g., treatment choice) without requiring understanding of the underlying mechanisms. In a typical research project, both inference and prediction can be of value- we want to know how biological processes work and what will happen next. For example, we might want to infer which biological processes are associated with the dysregulation of a gene in a disease, as well as detect whether a subject has the disease and predict the best therapy.
生物系统科研中的两个主要目标在于推断和预测. 推断在于创建一个数据生成过程的数学模型用于形式化理解或测试关于系统行为的假设; 预测, 旨在预测未观察到的结果或未来的行为, 如: 具有特定基因表达模式的小鼠是否患有疾病. 预测可能在不需要了解其潜在的机制的情况下确定最佳行动方案( 如, 治疗方案的选择). 在一个典型的研究项目中, 推理和预测具有其各自价值, 如, 我们想知道生物过程是如何工作的以及接下来可能会出现什么情况, 可能想推断哪些生物过程与疾病中基因的失调有关, 以及检测受试者是否患有该疾病并预测最佳治疗方法.
Many methods from statistics and machine learning (ML) may, in principle, be used for both prediction and inference. However, statistical methods have a long-standing focus on inference, which is achieved through the creation and fitting of a project-specific probability model. The model allows us to compute a quantitative measure of confidence that a discovered relationship describes a ‘true’ effect that is unlikely to result from noise. Furthermore, if enough data are available, we can explicitly verify assumptions (e.g., equal variance) and refine the specified model, if needed.
原则上, 统计学和机器学习 (ML) 中的许多方法都可以用于预测和推理. 但, 长期以来统计学方法一直专注于推理, 这是通过创建和拟合特定项的概率模型来实现的. 该模型允许我们计算定量测量的置信度, 发现其中关系, 这种关系不太可能由噪声引起的" 真实" 效应. 此外, 如果有足够的可用数据, 我们可以明确验证假设(例如, 等方差) , 并在有需要时完善指定的模型.
By contrast, ML concentrates on prediction by using general-purpose learning algorithms to find patterns in often rich and unwieldy data. ML methods are particularly helpful when one is dealing with ‘wide data’, where the number of input variables exceeds the number of subjects, in contrast to ‘long data’, where the number of subjects is greater than that of input variables. ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions. However, despite convincing prediction results, the lack of an explicit model can make ML solutions difficult to directly relate to existing biological knowledge.
相比之下, 机器学习专注于预测, 通过使用通用学习算法在丰富而"粗略"的数据中找到模式(译者注: 衡量机器学习模型质量的一个重要指标是模型的泛化能力, 所以机器学习算法的设计考虑了诸多的抗干扰能力). 在处理 "宽数据" ("宽数据" , 输入变量的数量超过了受试者的数量; "长数据" , 受试者数量大于输入变量数量)时, 机器学习方法尤为有用. ML对数据生成系统进行最少的假设. 即使在没有经过精心控制的实验设计和存在复杂非线性相互作用的情况下收集数据, 它们也可能是有效的. 然而, 在尽管有令人信服的预测结果的情况下, 由于缺乏明确的模型(模型的可解释性)可能会使ML解决方案难以与现有的生物学知识直接相关. (译者注: 这段的话的理解, 就在在很多情况下, 机器学习依赖着一些杂乱无章甚至于毫无相关的数据对结果做出了很好的预测, 如拟合的数值和实际的观测结果偏移/偏离很小, 但是这些还是无法解释为什么这些结果是可以这样拟合的, 缺乏现实的指导意义, 例如一些线性模型, 拟合结果可能是次要的, 更加需要明晰的是变量和结果之间存在着哪些联系.)
Classical statistics and ML vary in computational tractability as the number of variables per subject increases. Classical statistical modeling was designed for data with a few dozen input variables and sample sizes that would be considered small to moderate today. In this scenario, the model fills in the unobserved aspects of the system. However, as the numbers of input variables and possible associations among them increase, the model that captures these relationships
becomes more complex. Consequently, statistical inferences become less precise and the boundary between statistical and ML approaches becomes hazier.
随着每个受试者变量数量的增加, 经典统计学和ML在计算的可处理性方面有所差异. 经典统计建模一般是为具有几十个输入变量和样本量数据设计的, 这些数据在当下通常被认为是中小规模的. 在上述场景中, 模型可覆盖到系统中未观察到的方面. 然而, 随着输入变量的数量和它们之间可能存在的关联的增加, 捕捉这些关系的模型将变得为复杂. 因此, 统计推断将变得不那么精确, 并且统计方法和ML方法之间的边界将变得更加模糊.(译者注: 不同的数据规模之下适用工具不一样)
To compare traditional statistics to ML approaches, we’ll use a simulation of the expression of 40 genes in two phenotypes (−/+). Mean gene expression will differ between phenotypes, but we’ll set up the simulation so that the mean difference for the first 30 genes is not related to phenotype. The last ten genes will be dysregulated, with systematic differences in mean expression between phenotypes. To achieve this, we assign each gene an average log expression that is the same for both phenotypes. The dysregulated genes (31−40, labeled A−J) have their mean expression perturbed in the + phenotype (Fig. 1a). Using these average expression values, we simulate an RNA-seq experiment in which the observed counts for each gene are sampled from a Poisson distribution with mean exp(x + ε), where x is the mean log expression, unique to the gene and phenotype, and ε ~ N(0, 0.15) acts as biological variability that varies from subject to subject (Fig. 1b). For genes 1−30, which do not have differential expression, the z-scores are approximately N(0, 1). For the dysregulated genes, which do have differential expression, the z-scores in one phenotype tend to be positive, and the z-scores in the other tend to be negative.
为了将传统统计与ML方法进行比较, 我们将使用两种模拟表型( −/+)
中40个基因表达. 表型之间的平均基因表达会有所不同, 但我们将建立模拟, 使前30个基因的平均差异与表型无关. 使后10个基因失调, 表型之间的平均表达存在系统性差异. 为了实现这一点, 我们为每个基因分配一个平均对数表达, 该表达对于两种表型都是相同的. 失调基因( 31−40
, 标记为A−J
) 的平均表达在+
表型中受到干扰( Fig. 1a) . 使用这些平均表达值, 我们模拟了RNA-seq
实验, 在该实验中, 每个基因的观察计数是从具有平均exp( x+ε)
的泊松分布中采样的, 其中x
是基因和表型特有的平均对数表达, ε~N( 0, 0.15)
作为因受试者而异的生物变异性( Fig. 1b) . 对于没有差异表达的基因1 − 30
, z评分约为N( 0, 1)
. 对于确实具有差异表达的失调基因, 一种表型的z
评分趋于正的, 而另一种表型中的z
评分则趋于负的. (译者注: 这段描述了作者是如何设计实验的)
Our goal in the simulation is to identify which genes are associated with the abnormal phenotype. We’ll formally test the null hypothesis that the mean expression differs by phenotype with a widely used generalized linear negative binomial model that allows for biological variability among subjects with the same phenotype. We’ll perform a test for each gene and identify those that show statistically significant differences in mean expression, based on P values adjusted for multiple testing via the Benjamini–Hochberg method. In an alternative Bayesian approach, we would compute the posterior probability of having differential expression specific to the phenotype. Figure 2a shows the P values of the tests between phenotypes as a function of the log fold change in gene expression. The ten dysregulated genes are highlighted in red; our inference flagged nine out of the ten (except F, with the smallest log fold change) as significant with adjusted P < 0.05. We could use the fold change as a measure of effect size, with a confidence interval or highest posterior region used to indicate the uncertainty in the estimate. In a realistic setting, genes identified by the analysis would then be validated experimentally or compared with data from other sources such as proposed gene networks or annotations.
模拟的目标是确定哪些基因与异常表型相关, 这里将用使用广义线性负二项模型
检验平均表达因表型而异的零假设
. 该模型允许具有相同表型的受试者之间存在生物差异性. 我们将对每个基因进行测试, 使用Benjamini–Hochberg
方法, 并根据P
值来调整多重测试, 确定那些平均表达具有统计学显著差异的基因. 在另一种贝叶斯方法
中, 我们将计算表型特异性差异表达的后验概率
. fig-2a
显示了表型之间测试的P
值, 作为基因表达对数倍变化的函数. 图中对10
个失调基因以红色突出显示; 我们的推断将其中90%
( F
除外, 对数倍数变化最小) 标记为显著, 满足调整后P < 0.05
. 我们可以使用倍数变化作为效应大小的度量, 使用置信区间或最高后验区域来指示估计的不确定性. 在真实的环境中, 通过分析鉴定的基因将通过实验进行验证, 或与来自其他来源的数据进行比较, 如特定的基因网络或注释. (译者注: 本段用以描述传统的统计方法在这个实验的使用)
To ask a similar biological question using ML, we would typically try several algorithms evaluated by cross-validation on independent test subjects, or bootstrap methods with ‘out-of-sample’ evaluation to select one with good prediction accuracy. Let’s use a random forest (RF) classifier that will simultaneously consider all genes and grow multiple decision trees to predict the phenotype without assuming a probabilistic model for the read counts. The result of this RF classification with 100 trees is shown in Figure 2b, where the P values from the classical inference are plotted as a function of feature (gene) importance. This score quantifies a given gene’s contribution to the average classification improvement within a partition when the tree is split selecting that gene. Many ML algorithms have analogous measures that allow some quantification of the contribution of each input variable to the classification. In our simulation, eight of ten genes with the largest importance measures were from the dysregulated set. Not in the top ten were genes D and F, which had the smallest fold changes (Fig. 2a).
为了使用ML提出类似的生物学问题, 我们通常会尝试在独立测试对象上通过交叉验证评估的几种算法, 或者使用" 样本外" 评估的引导方法来选择一种具有良好预测准确性的算法. 这里使用的是随机森林( RF
) 分类器, 该分类器可同时考虑所有基因并产生多个决策树来预测表型, 而不需要假设读取计数的概率模型. 这种具有100
棵树的RF
分类的结果如图-2b
所示, 其中来自经典统计推理的P
值被绘制为特征(基因) 重要性的函数. 当树被分割选择某个基因时, 该得分量化了该基因对分区内的平均分类改进的贡献. 许多ML算法具有类似的度量, 这些度量允许对每个输入变量对分类的贡献进行一些量化. 在我们的模拟中, 具有最大重要性度量的十个基因中有八个来自失调的集合. 没在前十名的基因D
和F
, 它们的倍数变化最小( fig. 2a) . (译者注: 本段描述了机器学习算法, 随机森林在这个实验中的使用)
If we perform the simulation 1,000 times and count the number of dysregulated genes correctly identified by both approaches- on the basis of either classical null-hypothesis rejection with an adjusted P value cutoff or predictive pattern generalization with RF and top-ten feature importance ranking- then we find that the two methods yield similar results. The average number of dysregulated genes identified is 7.4/10 for inference and 7.7/10 for RF (Fig. 2c). Both methods have a median of 8/10, but we find more instances of simulations for which only 2−5 dysregulated genes were identified with inference. This is because the way we’ve designed the selection process is different for the two approaches: inference selects by an adjusted P value cutoff so that the number of selected genes varies, whereas in the RF we select the top ten genes. We could have applied a cutoff to the importance score, but the scores do not have an objective scale on which to base the threshold.
如果我们进行1000
次模拟, 并计算两种方法正确识别的失调基因的数量- - 基于经典的零假设拒绝和调整的P
值截断, 或者基于RF和前十个特征重要性排序的预测模式泛化- - 那么我们发现这两种方法产生了相似的结果. 所鉴定的失调基因的平均数量为7.4/10
( 推理) 和7.7/10
( RF) ( fig. 2c) . 这两种方法的中位数都是8/10
, 但我们发现了更多的模拟实例, 其中只有2 - 5
个失调的基因被推断出来. 这是因为我们设计选择过程的方式对两种方法不同: 通过调整P
值截断值进行推理选择, 从而使所选基因的数量发生变化, 而在随机森林(RF
)中, 我们选择前十个基因. 我们本可以对重要性分数设置一个截断值, 但分数没有一个客观的阈值依据.
(译者注: 上述实验旨在体现传统统计和机器学习在某些方面表现出差异和相似性.)
We’ve used pre-existing knowledge about RNA-seq data to design a statistical model of the process and draw inference to estimate unknown parameters in the model from the data. In our simulation, the model encapsulates the relationship between the mean number of reads (the parameter) for each gene for each phenotype and the observed read counts for each subject. The output of the statistical analysis is a test statistic for a specific hypothesis and confidence bounds of the parameter (mean fold change, in this example). In our example, the model’s parameters relate explicitly to aspects of gene expression- the counts of molecules produced at a certain rate in a cell can be directly interpreted.
我们使用了关于RNA-seq
数据的预先了解的知识来设计该过程的统计模型, 并根据数据推断出模型中的未知参数. 在我们的模拟中, 该模型概括了每个表型的每个基因的平均读取次数(参数) 与每个受试者观察到的读取计数之间的关系. 统计分析的输出是特定假设检验统计量和参数的置信区间( 在这里, 为平均倍数变化). 在我们的例子中, 该模型的参数与基因表达的各个方面明确相关- - 可以直接解释细胞中以一定速率产生的分子数量这一机制. (译者注: 本段内容强调模型的可解释性.)
To apply ML, we don’t need to know any of the details about RNA-seq measurements; all that matters is which genes are more useful for phenotype discrimination based on gene expression. Such generalization greatly helps when we have a large number of variables, such as in a typical RNA-seq experiment that may have hundreds to hundreds of thousands of features (e.g., transcripts) but a much smaller sample size.
在机器学习实践中, 并不需要知道有关RNA
序列测量的细节. 关键在于获悉哪些基因对基于基因表达的表型辨别更有用. 在存在大量变量的场景下, 这种泛化非常有帮助. 在典型的RNA
序列实验中, 实验可能有数百到数十万个特征(如, 转录物) , 但样本量要小得多. (译者注: 很多时候我们不需要知道模型各种变量之间的关系, 只需要得到最终的结果即可, 例如图像识别, 怎么实现这个过程并不重要, 重要的是模型是否具有很好的识别效果, 如相对理想的精确率, 召回率等)
Now consider a more complex experiment in which each subject contributes multiple observations from different tissues. Even if we only conduct a formal statistical test that compares the two phenotypes for each tissue, the multiple testing problem is greatly complicated. The increase in data complexity may make classical statistical inference less tractable. Instead we could use an ML approach such as clustering of genes or tissues or both to extract the main patterns in the data, classify subjects, and make inferences about the biological processes that give rise to the phenotype. To simplify the analysis, we could perform a dimension reduction such as averaging the measurements over the ten subjects with each phenotype for each gene and each tissue.
现在考虑一个更复杂的实验, 在这个实验中, 每个受试者都贡献了来自不同(生物)组织的多个观察结果. 即使我们只进行形式统计测试在比较每个组织的两种表型时, 多重测试也是个非常复杂的问题. 数据复杂度的增加可能会使经典统计推断变得不那么容易处理. 相反, 可以使用ML方法, 通过对基因或组织进行聚类来提取数据中的主要模式, 对受试者进行分类并对产生表型的生物过程进行推断. 为了简化分析, 可以对数据进行降维, 例如对每种基因和组织的每种表型的十名受试者的测量结果进行平均. (译者注: 实际生产环境中, 变量的数量可能是惊人的, 传统的统计方法将很难处理如此庞大的数据)
The boundary between statistical inference and ML is subject to debate- some methods fall squarely into one or the other domain, but many are used in both. For example, the bootstrap6 method can be used for statistical inference but also serves as the basis for ensemble methods, such as the RF algorithm. Statistics requires us to choose a model that incorporates our knowledge of the system, and ML requires us to choose a predictive algorithm by relying on its empirical capabilities. Justification for an inference model typically rests on whether we feel it adequately captures the essence of the system. The choice of pattern-learning algorithms often depends on measures of past performance in similar scenarios. Inference and ML are complementary in pointing us to biologically meaningful conclusions.
统计推断和机器学习(ML)之间的界限存在争议- - 有些方法完全属于其中一个领域, 但许多方法同时用于这两个领域. 例如, bootstrap
方法可以用于统计推断, 但也可以作为集成算法的基础, 例如应用在随机森林算法(RF, random forest)中. 统计学要求选择一个整合了我们的系统知识的模型, 而ML要求我们依靠其经验能力选择预测算法. 推理模型的合理性通常取决于我们是否认为它充分地抓住了系统的本质. 模式学习算法的选择通常取决于过去在类似场景中的表现. 推理和ML是相辅相成的, 为我们提供了具有生物学意义的结论.
三. 小结
基于上述内容, 对机器学习和传统统计的差异进行简单归纳.
常见概念在不同场景下的表述差异:
Statistics | Machine learning |
---|---|
Estimation | Learning |
Classifier | Hypothesis |
Data Point | Example/ Instance |
Regression | Supervised Learning |
Classification | Supervised Learning |
Covariate | Feature |
Response | Label |
两者常见的差异:
机器学习 | 统计学 | |
---|---|---|
假设前提 | 要求不高, 甚至于不需要 | 一般要求满足特定的假设前提 |
数据规模要求 | 理论上, 规模越大, 越好 | 在小型数据上也能有很好的表现, 反而当数据达到一定规模, 处理会变得困难 |
可解释性 | 很多模型表现为黑箱状态, 不可解释 | 一般解释性很好 |
适应场景 | 倾向于: 分类/离散(强调最终的预测结果) | 倾向于: 回归/连续(强调模型的可解释性) |
目标 | 预测 | 推断 |
简而言之, 脱离数理统计的机器学习是不靠谱的, 不考虑机器学习的统计学是不思进取的, 二者是相辅相成, 而不是二元割裂的.
四. 注释
-
差异表达基因分析:差异倍数(fold change), 差异的显著性(P-value) | 火山图 - Life·Intelligence - 博客园
-
bootstrap, 基本思想是对现有的数据, 不断再随机取小的样本, 对每个小样处理数据,得到estimator.从而来了解estimator 的分布. 它是一种重新取样的方法, 从现有的样本数据中独立取样, 并替换相同数量的样本, 在这些重新取样的数据中进行推断.
-
random forest, 随机森林, 是一种由决策树构成的集成算法, 在很多情况下都能有不错的表现.
Read Random Forest-Random Forest (4 implementation steps + 10 advantages and disadvantages)