c)3折交叉验证的结果
n = 699; zz1 = 1:n;
zz2 = rep(1:3, ceiling(699/3))[1:n]; set.seed(100); zz2 = sample(zz2, n); nmse = list(NULL, NULL); c <- breast_cancer_median; for (i in 1:3) {
data.train = c[-c(which(zz2 == i)), ]; data.test = c[c(zz2 == i), ];
d.train <- svm(factor(btype)~., data = data.train);
table1 = table(predict(d.train, data.train, type = \ table2 = table(predict(d.train, data.test, type = \ nmse[[1]][i] = 1 - sum(diag(table1))/nrow(data.train); nmse[[2]][i] = 1 - sum(diag(table2))/sum(table2); cat(\第\折:\
11
cat(\训练集的错误率为:\
cat(\测试集的错误率为:\}
NMSE = array();
NMSE[1] = sum(nmse[[1]])/3; NMSE[2] = sum(nmse[[2]])/3;
cat(\训练集上的平均错误率为:\cat(\测试集上的平均错误率为:\
结果:
svm method第 1 折:
训练集的错误率为: 0.02575107 测试集的错误率为: 0.03433476
svm method第 2 折:
训练集的错误率为: 0.027897 测试集的错误率为: 0.04291845
svm method第 3 折:
训练集的错误率为: 0.02145923 测试集的错误率为: 0.03862661
svm method训练集上的平均错误率为: 0.02503577 svm method测试集上的平均错误率为: 0.03862661
4)随机森林方法 使用的包randomForest
a) 使用中位数填补后的数据进行建模分析并输出变量的重要性
#请先安装randomForest程序包 library(randomForest);
r.breast <- randomForest(factor(btype) ~ ., data = breast_cancer_median, ntree = 2000, importance = T, replace = TRUE, keep.inbag = TRUE, norm.votes=FALSE, oob.times=TRUE, proximity=T); r.breast;
summary(r.breast);
imp <- importance(r.breast); imp;
impvar <- imp[order(imp[, 3], decreasing = TRUE), ]; impvar; varImpPlot(r.breast);
getTree(r.breast, k = 1, labelVar = FALSE);
12
结果:
Confusion matrix: 2 4 class.error 2 446 12 0.02620087 4 9 232 0.03734440
2 4
MeanDecreaseAccuracy 重要性 bare_nuclei 2.3098674 3.9670653 2.2676480 uniformity_cell_size 1.8742531 2.7070189 1.9318088 clump_thickness 1.8751611 3.5029239 1.9280315 bland_chromatin 1.1095789 3.0180401 1.8088237 uniformity_cell_shape 1.2252772 2.8570378 1.7930345 normal_nucleoli 1.6026017 1.8652728 1.4735789 marginal_adhesion 0.9889558 2.0822943 1.3521871 single_epithelialcell_size 1.1464565 1.0688383 1.0714372 mitoses 1.0574280
0.9529673 0.9765416
以图示显示变量的重要性
13
b)使用三折交叉验证的结果
n = 699; zz1 = 1:n;
zz2 = rep(1:3, ceiling(699/3))[1:n]; set.seed(100); zz2 = sample(zz2, n); nmse = list(NULL, NULL); c <- breast_cancer_median; for (i in 1:3) {
data.train = c[-c(which(zz2 == i)), ]; data.test = c[c(zz2 == i), ];
d.train <- randomForest(factor(btype) ~ ., data = data.train, ntree = 2000, importance = T, replace = TRUE, keep.inbag = TRUE, norm.votes=FALSE, oob.times=TRUE, proximity=T);
table1 = table(predict(d.train, data.train, type = \ table2 = table(predict(d.train, data.test, type = \ nmse[[1]][i] = 1 - sum(diag(table1))/nrow(data.train); nmse[[2]][i] = 1 - sum(diag(table2))/sum(table2); cat(\第\折:\ cat(\训练集的错误率为:\
cat(\测试集的错误率为:\}
NMSE = array();
NMSE[1] = sum(nmse[[1]])/3; NMSE[2] = sum(nmse[[2]])/3;
cat(\训练集上的平均错误率为:\cat(\测试集上的平均错误率为:\
结果:
randomForest method第 1 折: 训练集的错误率为: 0
测试集的错误率为: 0.02575107
randomForest method第 2 折: 训练集的错误率为: 0
测试集的错误率为: 0.04291845
randomForest method第 3 折: 训练集的错误率为: 0
测试集的错误率为: 0.03433476
randomForest method训练集上的平均错误率为: 0
randomForest method测试集上的平均错误率为: 0.03433476
14
5)最近邻方法 用到的包有kknn
a)通过循环,以第一折测试集上的正确率最好为准则,选择k值。 library(igraph); library(kknn); corr = array();
m = 699; zz1 = 1:m;
zz2 = rep(1:3, ceiling(699/3))[1:m]; set.seed(100); zz2 = sample(zz2, m); data.test = breast_cancer_median[zz2 == 3, ];
data.train = breast_cancer_median[-c(which(zz2 == 3)), ]; for (i in 7:100) {
a = kknn(factor(btype) ~ ., test = data.test, train = data.train, k = i); table = table(data.test$btype, a$fit); corr[i] = sum(diag(table))/sum(table); }
plot(0, 0.3, xlim = c(6, 100), ylim = c(0.5, 1), ylab = \正确率\for(i in 7:100)
points(i,corr[[i]]); identify(corr);
结果显示k取11时正确率最高。
15