FAFU机器学习 5-1-2中文.pptx-资源下载-163文库_上传原创PPT模板、课件、文档赚钱

FAFU机器学习 5-1-2中文.pptx

1、评价方法 Holdout方法（）方法（）K-折叠交叉验证(K)引导（）2020/12/3模型评估第4-1课评价方法 Holdout method（）:无疑是最简单的模型评估技术将数据集拆分为两个不相交的部分:训练集和测试集2020/12/3模型评估第4-2课评价方法 Holdout method（）:无疑是最简单的模型评估技术将数据集拆分为两个不相交的部分:训练集和测试集请记住:拆分数据集的方法有很多种，不同的方法带来不同的性能沿着特征轴的底层样本统计数据的变化仍然是一个问题，如果我们处理小数据集，这个问题就会变得更加明显2020/12/3模型评估第4-3课评价方法 Holdout m

2、ethod（）:无疑是最简单的模型评估技术将数据集拆分为两个不相交的部分:训练集和测试集请记住:拆分数据集的方法有很多种，不同的方法带来不同的性能l分层抽样（）l用不同的随机种子重复保持方法用不同的随机种子重复保持方法k次，并计算这次，并计算这k次重复的平均性能次重复的平均性能2020/12/3模型评估第4-4课kjjAcckA1avg1cc评价方法 Holdout method（）:无疑是最简单的模型评估技术将数据集拆分为两个不相交的部分:训练集和测试集请记住:拆分数据集的方法有很多种，不同的方法带来不同的性能l分层抽样（）l用不同的随机种子重复保持方法用不同的随机种子重复保持方法k

3、次，并计算这次，并计算这k次重复的平均性能次重复的平均性能l请记住:训练集的大小会影响性能2020/12/3模型评估第4-5课评价方法 Holdout method（）:无疑是最简单的模型评估技术将数据集拆分为两个不相交的部分:训练集和测试集请记住:拆分数据集的方法有很多种，不同的方法带来不同的性能l分层抽样（）l用不同的随机种子重复保持方法用不同的随机种子重复保持方法k次，并计算这次，并计算这k次重复的平均性能次重复的平均性能l请记住:训练集的大小会影响性能l以约以约2/34/5的数据集作为训练数据的数据集作为训练数据2020/12/3模型评估第4-6课评价方法lK-折叠交叉验证(K):

4、可能是一种最常见但计算量更大的方法l将数据集拆分为k个不相交的部分，称为折叠lk的典型选择是的典型选择是5，10或或20lK-fold交叉验证是交叉验证的一种特殊情况，我们在一个数据集上迭代k次l在每一轮中，一个部分用于验证，剩余的k-1个部分合并为一个训练子集用于模型评估2020/12/3模型评估第4-7课评价方法lK-fold Cross-validation（K折交叉验证法）:is probably a most common but more computationally intensive approach.l5-fold2023-11-4Model EvaluationLesso

5、n 4-8Evaluation MethodslK-fold Cross-validation（K折交叉验证法）:is probably a most common but more computationally intensive approach.lKeep in mind:there the larger the number of folds used in k-fold CV,the better the error estimates will be,but the longer your program will take to run.lSolution:use at lea

6、st 10 folds(or more)when you canlLeave-One-Out（留一法）:LOO,is a special case when k=number of datalLOOCV can be useful for small datasets2023-11-4Model EvaluationLesson 4-9Evaluation MethodslBootstrapping（自助法）:bootstrap sampling technique for estimating a sampling distributionlthe idea of the bootstrap

7、 method is to generate new data from a population by repeated sampling from the original dataset with replacement2023-11-4Model EvaluationLesson 4-10Evaluation MethodslBootstrapping（自助法）:bootstrap sampling technique for estimating a sampling distributionlthe idea of the bootstrap method is to genera

8、te new data from a population by repeated sampling from the original dataset with replacementlapproximately select 0.632n samples as bootstrap training sets and reserve 0.368n out-of-bag samples for testing in each iteration.2023-11-4Model EvaluationLesson 4-11Evaluation Metrics2023-11-4Model Evalua

9、tionLesson 4-12Metrics for Binary classificationlMeasuring model performance with accuracylFraction of correctly classified sampleslIt is really only suitable when there are an equal number of observations in each class(which is rarely the case)and that all predictions and prediction errors are equa

10、lly important,which is often not the caseDefinitionniiiyylnyyAcc1)(1,2023-11-4Model EvaluationLesson 4-13Metrics for Binary classificationlMeasuring model performance with accuracylFraction of correctly classified sampleslNot always a useful metric,may be misleadingExample:Email lSpam classification

11、l99%of email are real,1%of email are spamlCould build a model that predicts all email are reallaccurcy=99%lBut horrible at actually classifying spamlFails at its original purposeMetrics for Binary Classification2023-11-4Model EvaluationLesson 4-14Confusion matrixlOne of the most comprehensive ways t

12、o represent the result of evaluating binary classification2023-11-4Model EvaluationLesson 4-15Metrics for Binary ClassificationError rate&AccuracylThe error rate can be understood as the sum of all false predictions divided by the number of total predictions,and the accuracy is calculated as the sum

13、 of correct predictions divided by the total number of predictions,respectively:TNTPFNFPFNFPErrTNTPFNFPTNTPAcc2023-11-4Model EvaluationLesson 4-16Metrics from the confusion matrixPrecision(查准率查准率)lPrecision:measures how many of the samples predicted as positive are actually positiveFPTPTPPlPrecision

14、 is used as a performance metric when the goal is to limit the number of false positives.lExamples:lPredicting whether a new drug will be effective in treating a disease in clinical trials2023-11-4Model EvaluationLesson 4-17Metrics from the confusion matrixRecall(查全率，查全率，召回率召回率)lRecall:measures how

15、many of the positive are captured by the positive predictionsFNTPTPRlPrecision is used as a performance metric when we need to indentify all positive samples.lExamples:lFind people that are sick2023-11-4Model EvaluationLesson 4-18Metrics from the confusion matrixTradeoff between Precision and Recall

16、lTo get higher precision by increasing threshold lTo get higher recall by reducing threshold 2023-11-4Model EvaluationLesson 4-19Metrics from the confusion matrixTradeoff between Precision and RecallTradeoff between Precision and Recall2023-11-4Model EvaluationLesson 4-20Metrics from the confusion m

17、atrixTradeoff between Precision and RecallTradeoff between Precision and RecallF1:F-score or F-measurelF-score:is with the harmonic mean(调和平均数)of precision and recallRPRPF 21AlgorithmPRAverageF1A10.50.40.450.444A20.70.10.40.175A30.0210.510.03922023-11-4Model EvaluationLesson 4-21Metrics from the con

18、fusion matrixGeneral F-measure:FFPFNTPTPRPRPF22222)1()1()1(lWhen=1,becoming F1lWhen 1,placing more emphasis on false negative,and weighing recall higher than precisionlWhen 1,attenuating the influence of false negative,and weighing recall lower than precision2023-11-4Model EvaluationLesson 4-22Metri

19、cs for Binary ClassificationGeneral F-measure:FReceiver operating characteristics(ROC)lROC(受试者工作特征):considers all possible thresholds for a given classifier,and shows the false positive rate(FPR)against the true positive rate(TPR)FNTPTPTPRFPTNFPFPR2023-11-4Model EvaluationLesson 4-23Metrics for Bina

20、ry ClassificationArea Under ROC Curve(AUC)Model Selection 有了实验评估方法和性能度量，看起来就能对学习器的性能进行评估比较了:先使用某种实验评估方法测得学习器的某个性能度量结果，然后对这些结果进行比较.首先，我们希望比较的是泛化性能，然而通过实验评估方法我们获得的是测试集上的性能，两者的对比结果可能未必相同;第二，测试集上的性能与测试集本身的选择有很大关系，且不论使用不同大小的测试集会得到不同的结果，即使用相同大小的测试集?若包含的测试样例不同，测试结果也会有不同;第二，很多机器学习算法本身有一定的随机性，即便用相同的参数设置在同一个测

21、试集上多次运行，其结果也会有不同.2023-11-4Model EvaluationLesson 4-24Model Selection 统计假设检验(hypothesis test)为我们进行学习器性能比较提供了重要依据，基于假设检验结果我们可:对单个学习器泛化性能的假设进行检对多个学习器进行性能比较。若在测试集上观察到学习器A 比B 好，则A 的泛化性能是否在统计意义上优于B，以及这个结论的把握有多大.2023-11-4Model EvaluationLesson 4-25Model Selection 统计假设检验(hypothesis test)为我们进行学习器性能比较提供了重要依据

22、，基于假设检验结果我们可:对单个学习器泛化性能的假设进行检对多个学习器进行性能比较。若在测试集上观察到学习器A 比B 好，则A 的泛化性能是否在统计意义上优于B，以及这个结论的把握有多大.2023-11-4Model EvaluationLesson 4-26An hypothesis testing probleml Consider a model with hold out methodlSupport that the model was performed 5 times,and the accuracy are 0.99,0.98,0.99,0.94,0.95lCan we sa

23、y that the mean accuracy is different from 0.97?lConsider the grader of two modelslA had 15,10,12,19,5,7lB had 14,11,11,12,6,7lCan we say A had better grades than B?lA statistic test aims to answer such questionsconfidence interval（置信区间）点估计与区间估计点估计：用样本统计量来估计总体参数，因为样本统计量为数轴上某一点值，估计的结果也以一个点的数值表示，所以称为

24、点估计。点估计虽然给出了未知参数的估计值，但是未给出估计值的可靠程度，即估计值偏离未知参数真实值的程度。区间估计：给定置信水平，根据估计值确定真实值可能出现的区间范围，该区间通常以估计值为中心，该区间则为置信区间。2023-11-4Model EvaluationLesson 4-27confidence interval（置信区间）2023-11-4Model EvaluationLesson 4-28confidence interval（置信区间）点估计与区间估计标准差(standard deviation)与标准误差(standard error)95%的置信区间假设X服从正态分布

25、:X N(,2)不断进行采样，假设样本的大小为n，则样本的均值为:M=(X1+X2+Xn)/n 由大数定理与中心极限定理,M 服从：M N(,12)2023-11-4Model EvaluationLesson 4-29confidence interval（置信区间）2023-11-4Model EvaluationLesson 4-30Hypothesis Testing and Statistical Significance The process of hypothesis testing Null hypothesis:The null hypothesis is a model

26、of the system based on the assumption that the apparent effect was actually due to chance.p-value:The p-value is the probability of the apparent effect under the null hypothesis.Interpretation:Based on the p-value,we conclude that the effect is either statistically significant,or not.paired t-test T

27、he t-test is an example of a parametric test.It is applicable when the null hypothesis states that the difference between two responses has mean zero and unknown variance.The t-test actually assumes that data is distributed according to a Gaussian distribution.2023-11-4Model EvaluationLesson 4-31Hyp

28、othesis Testing and Statistical Significance For example,we might run 5-fold cross validation and compute f-score for every fold.Perhaps the f-scores are 92.4,93.9,96.1,92.2 and 94.4.This gives us an average f-score of 93.8 over the 5 folds.The standard deviation of this set of f-scores is:We can no

29、w assume that the distribution of scores is approximately Gaussian and calculate the 95%confidence interval.scipy.stats.ttest_ind,scipy.stats.ttest_relhttps:/docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind2023-11-4Model EvaluationLesson 4-32T检验的资料 t检验的工作原理和在Python中的实现https:/ 利用python进行T检验https:/ EvaluationLesson 4-33Summary Evaluation method Holdout Method（留出法）K-fold Cross-validation（K折交叉验证法）Bootstrapping（自助法）Evaluation metrics Accuracy Precision Recall F-score AUC Model selection2023-11-4Model EvaluationLesson 4-34

邮箱/手机：
温馨提示：	系统将以此处填写的邮箱或者手机号生成账号和密码，方便再次下载。如填写123，账号和密码都是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？