数据挖掘期末考试在线测试答案

2018-12-04 21:40

一个食品连锁店每周的事务记录如下表所示，其中每一条事务表示在一项收款机业务中卖出的项目，假定supmin=20%，confmin=40%，使用Apriori算法计算生成的关联规则，标明每趟数据库扫描时的候选集和大项目集。事务项目事务 T4 T5 项目啤酒、面包啤酒、牛奶 T1 面包、果冻、花 T2 生酱 T3 面包、花生酱面包、牛奶、花生酱

解：1) 扫描数据库对每个候选计算支持

项集支持度 {面包} {花生酱} {牛奶} {啤酒} {果冻} 4/5 3/5 2/5 2/5 1/5 2) 比较候选支持度与最小支持度，得出频繁项集L1

项集支持度

4/5 {面包}

3/5 {花生酱}

2/5 {牛奶}

2/5 {啤酒}

1/5 {果冻}

3)由L1产生候选C2 C2 项集 {面包，花生酱} {面包，牛奶} {面包，啤酒} {面包，果冻} {花生酱，牛奶} {花生酱，啤酒} {花生酱，果冻} {牛奶，啤酒} {牛奶，果冻} {啤酒，果冻} 4）扫描，对每个候选计算支持度 C2 项集 {面包，花生酱} {面包，牛奶} {面包，啤酒} {面包，果冻} {花生酱，牛奶} {花生酱，啤酒} {花生酱，果冻} {牛奶，啤酒} {牛奶，果冻} {啤酒，果冻} L2 项集 {面包，花生酱} {面包，牛奶} {面包，啤酒} {面包，果冻} {花生酱，牛奶} {花生酱，果冻} {牛奶，啤酒} 6)由L2产生候选C3 C3 项集 {面包，花生酱，牛奶} {面包，花生酱，啤酒} {面包，花生酱，果冻} {面包，牛奶，啤酒} {面包，牛奶，果冻} {面包，啤酒，果冻} {花生酱，牛奶，果冻} {花生酱，牛奶，啤酒} 7）扫描，对每个候选计算支持度 C3 项集 {面包，花生酱，牛奶} {面包，花生酱，啤酒} {面包，花生酱，果冻} {面包，牛奶，啤酒} {面包，牛奶，果冻} {面包，啤酒，果冻} {花生酱，牛奶，果冻} {花生酱，牛奶，啤酒} 支持度 1/5 0 1/5 0 0 0 0 0 支持度 3/5 1/5 1/5 1/5 1/5 1/5 1/5 支持度 3/5 1/5 1/5 1/5 1/5 0 1/5 1/5 0 0 5)比较候选支持度与最小支持度，得出频繁项集L2

8）比较候选支持度与最小支持度，得出频繁项集L3

C3 项集 {面包，花生酱，牛奶} {面包，花生酱，果冻} 支持度 1/5 1/5 下面计算关联规则：

<1>{面包，花生酱，牛奶}的非空子集有{面包，花生酱}，{面包，牛奶}，{花生酱，牛奶}，{面包}，{花生酱}，{牛奶}

1/5=33.3% 3/51/5{面包，牛奶} {花生酱} confidence==100%

1/51/5{花生酱，牛奶} {面包} confidence==100%

1/51/5{面包} {花生酱，牛奶} confidence==25%

4/51/5{花生酱} {面包，牛奶} confidence==33.3%

3/51/5{牛奶} {面包，花生酱} confidence==50%

2/5{面包，花生酱}

{牛奶} confidence=

故强关联规则有{面包，牛奶} {花生酱}，{花生酱，牛奶} {面包}， {牛奶} {面包，花生酱}

<2>{面包，花生酱，果冻}的非空子集有{面包，花生酱}，{面包，果冻}，{花生酱，果冻}，{面包}，{花生酱}，{果冻}

1/5=33.3% 3/51/5{面包，果冻} {花生酱} confidence==100%

1/51/5{花生酱，果冻} {面包} confidence==100%

1/51/5{面包} {花生酱，果冻 } confidence==25%

4/51/5{花生酱} {面包，果冻} confidence==33.3%

3/51/5{果冻} {面包，花生酱} confidence=100%

1/5{面包，花生酱}

{果冻} confidence=

故强关联规则有{面包，果冻} {花生酱}，{花生酱，果冻} {面包}， {果冻} {面包，花生酱}

The following shows a history of customers with their incomes, ages and an attribute called “Have_iPhone”

indicating whether they have an iPhone. We also indicate whether they will buy an iPad or not in the last

column.

No. Income Age Have_iPhone Buy_iPad 1 high young yes yes 2 high old yes yes 3 medium young no yes 4 high old no yes 5 medium young no no 6 medium young no no 7 medium old no no 8 medium old no no

(a) We want to train a CART decision tree classifier to predict whether a new customer will buy an iPad or not. We define the value of attribute Buy_iPad is the label of a record.

(i) Please find a CART decision tree according to the above example. In the decision tree, whenever

we process a node containing at most 3 records, we stop to process this node for splitting.

(ii) Consider a new young customer whose income is medium and he has an iPhone. Please predict

whether this new customer will buy an iPad or not.

(b) What is the difference between the C4.5 decision tree and the ID3 decision tree? Why is there a difference?

解：解：a.(i)对于所给定样本的期望信息是：-属性Income的样本:

Info(high)=-3 log21-0 log20=0

4444log2-log2=1 88881144 log2- log2=0.72193 555535期望信息为：×0+×0.72193=0.27072

88Info(medium)=-

信息增益为：Gain（Income）=1-E(Income)= 0.729277

同样计算知：

Gain(Age)=0.09436

Gain(Have_iPhone)=0.311

这三个属性中Income的Gain最大，所以选择Income为最优特征，于是根节点生成两个子节点，一个是叶节点，对另一个节点继续使用以上方法，在A2，A3选择最优特征及其最优切分点，结果是Age。依此计算得，CART树为： Income

medium High

Yes Age Young Old NO NO

（ii）这个新的年轻、中等收入、有IPhone的顾客，将不会购买IPad。

（b）C4.5决策树算法和ID3算法相似，但是C4.5决策树算法是对ID3算法的改进，ID3算法在生成决策树的过程中，使用信息增益来进行特征选择，是选择信息增益最大的特征；C4.5算法在生成决策树的过程中，用信息增益比来选择特征，是选择信息增益比最大的特征。因为信息增益的大小是相对于训练数据集而言的，并没有绝对的意义，在分类困难时，也就是在训练数据集的经验熵大的时候，信息增益会偏大，反之，信息增益会偏小。使用信息增益比可以对这一问题进行校正。

Consider the following eight two-dimensional data points:

x1: (23, 12), x2: (6, 6), x3: (15, 0), x4: (15, 28), x5:(20, 9), x6: (8, 9), x7: (20, 11), x8: (8, 13), Consider algorithm k-means.

Please answer the following questions. You are required to show the information about each final cluster

(including the mean of the cluster and all data points in this cluster). You can consider writing a program for

this part but you are not required to submit the program.

(a) If k = 2 and the initial means are (20, 9) and (8, 9), what is the output of the algorithm? (b) If k = 2 and the initial means are (15, 0) and (15, 29), what is the output of the algorithm? 解：(a)已知K=2，初始质心是(20, 9)、(8, 9) 则： M1 (20, 9) (18.6,12) (19.5,15) M2 (8, 9) (7.3,9.3) (9.5,7) K1 (20,9),(23,12),(15,0), (15,28), (20,11) (23,12),(15,28),(20,9),(20,11)} (23,12),(15,28),(20,9),(20,11) K2 (8,9), (6,6), (8,13) (15,0),(6,6),(8,9),(8,13) (15,0),(6,6),(8,9),(8,13) 所以，算法输出两个簇： K1={x1,x4,x5,x7} K2={x2,x3,x6,x8}

（b）已知K=2，初始质心是(15, 0)、(15, 29)

则： M1 M2 K1 (15, 0) (14.3,8.6) (15, 29) (15,28) (23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13) (15,28) (23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13) (15,28) K2 所以，算法输出两个簇： K1={x1,x2,x3,x5,x6,x7,x8} K2={x4}

共2页:

数据挖掘期末考试在线测试答案.doc 将本文的Word文档下载到电脑下载失败或者文档不完整，请联系客服人员解决！

下载这篇word文档