forked from PnYuan/Machine-Learning_ZhouZhihua
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path周志华《机器学习》课后习题解答系列(五):Ch4.3 - 编程实现ID3算法.html
508 lines (427 loc) · 16.7 KB
/
周志华《机器学习》课后习题解答系列(五):Ch4.3 - 编程实现ID3算法.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
<!DOCTYPE html>
<html>
<head>
<title>周志华《机器学习》课后习题解答系列(五):Ch4.3 - 编程实现ID3算法</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
/* GitHub stylesheet for MarkdownPad (http://markdownpad.com) */
/* Author: Nicolas Hery - http://nicolashery.com */
/* Version: b13fe65ca28d2e568c6ed5d7f06581183df8f2ff */
/* Source: https://github.com/nicolahery/markdownpad-github */
/* RESET
=============================================================================*/
html, body, div, span, applet, object, iframe, h1, h2, h3, h4, h5, h6, p, blockquote, pre, a, abbr, acronym, address, big, cite, code, del, dfn, em, img, ins, kbd, q, s, samp, small, strike, strong, sub, sup, tt, var, b, u, i, center, dl, dt, dd, ol, ul, li, fieldset, form, label, legend, table, caption, tbody, tfoot, thead, tr, th, td, article, aside, canvas, details, embed, figure, figcaption, footer, header, hgroup, menu, nav, output, ruby, section, summary, time, mark, audio, video {
margin: 0;
padding: 0;
border: 0;
}
/* BODY
=============================================================================*/
body {
font-family: Helvetica, arial, freesans, clean, sans-serif;
font-size: 14px;
line-height: 1.6;
color: #333;
background-color: #fff;
padding: 20px;
max-width: 960px;
margin: 0 auto;
}
body>*:first-child {
margin-top: 0 !important;
}
body>*:last-child {
margin-bottom: 0 !important;
}
/* BLOCKS
=============================================================================*/
p, blockquote, ul, ol, dl, table, pre {
margin: 15px 0;
}
/* HEADERS
=============================================================================*/
h1, h2, h3, h4, h5, h6 {
margin: 20px 0 10px;
padding: 0;
font-weight: bold;
-webkit-font-smoothing: antialiased;
}
h1 tt, h1 code, h2 tt, h2 code, h3 tt, h3 code, h4 tt, h4 code, h5 tt, h5 code, h6 tt, h6 code {
font-size: inherit;
}
h1 {
font-size: 28px;
color: #000;
}
h2 {
font-size: 24px;
border-bottom: 1px solid #ccc;
color: #000;
}
h3 {
font-size: 18px;
}
h4 {
font-size: 16px;
}
h5 {
font-size: 14px;
}
h6 {
color: #777;
font-size: 14px;
}
body>h2:first-child, body>h1:first-child, body>h1:first-child+h2, body>h3:first-child, body>h4:first-child, body>h5:first-child, body>h6:first-child {
margin-top: 0;
padding-top: 0;
}
a:first-child h1, a:first-child h2, a:first-child h3, a:first-child h4, a:first-child h5, a:first-child h6 {
margin-top: 0;
padding-top: 0;
}
h1+p, h2+p, h3+p, h4+p, h5+p, h6+p {
margin-top: 10px;
}
/* LINKS
=============================================================================*/
a {
color: #4183C4;
text-decoration: none;
}
a:hover {
text-decoration: underline;
}
/* LISTS
=============================================================================*/
ul, ol {
padding-left: 30px;
}
ul li > :first-child,
ol li > :first-child,
ul li ul:first-of-type,
ol li ol:first-of-type,
ul li ol:first-of-type,
ol li ul:first-of-type {
margin-top: 0px;
}
ul ul, ul ol, ol ol, ol ul {
margin-bottom: 0;
}
dl {
padding: 0;
}
dl dt {
font-size: 14px;
font-weight: bold;
font-style: italic;
padding: 0;
margin: 15px 0 5px;
}
dl dt:first-child {
padding: 0;
}
dl dt>:first-child {
margin-top: 0px;
}
dl dt>:last-child {
margin-bottom: 0px;
}
dl dd {
margin: 0 0 15px;
padding: 0 15px;
}
dl dd>:first-child {
margin-top: 0px;
}
dl dd>:last-child {
margin-bottom: 0px;
}
/* CODE
=============================================================================*/
pre, code, tt {
font-size: 12px;
font-family: Consolas, "Liberation Mono", Courier, monospace;
}
code, tt {
margin: 0 0px;
padding: 0px 0px;
white-space: nowrap;
border: 1px solid #eaeaea;
background-color: #f8f8f8;
border-radius: 3px;
}
pre>code {
margin: 0;
padding: 0;
white-space: pre;
border: none;
background: transparent;
}
pre {
background-color: #f8f8f8;
border: 1px solid #ccc;
font-size: 13px;
line-height: 19px;
overflow: auto;
padding: 6px 10px;
border-radius: 3px;
}
pre code, pre tt {
background-color: transparent;
border: none;
}
kbd {
-moz-border-bottom-colors: none;
-moz-border-left-colors: none;
-moz-border-right-colors: none;
-moz-border-top-colors: none;
background-color: #DDDDDD;
background-image: linear-gradient(#F1F1F1, #DDDDDD);
background-repeat: repeat-x;
border-color: #DDDDDD #CCCCCC #CCCCCC #DDDDDD;
border-image: none;
border-radius: 2px 2px 2px 2px;
border-style: solid;
border-width: 1px;
font-family: "Helvetica Neue",Helvetica,Arial,sans-serif;
line-height: 10px;
padding: 1px 4px;
}
/* QUOTES
=============================================================================*/
blockquote {
border-left: 4px solid #DDD;
padding: 0 15px;
color: #777;
}
blockquote>:first-child {
margin-top: 0px;
}
blockquote>:last-child {
margin-bottom: 0px;
}
/* HORIZONTAL RULES
=============================================================================*/
hr {
clear: both;
margin: 15px 0;
height: 0px;
overflow: hidden;
border: none;
background: transparent;
border-bottom: 4px solid #ddd;
padding: 0;
}
/* TABLES
=============================================================================*/
table th {
font-weight: bold;
}
table th, table td {
border: 1px solid #ccc;
padding: 6px 13px;
}
table tr {
border-top: 1px solid #ccc;
background-color: #fff;
}
table tr:nth-child(2n) {
background-color: #f8f8f8;
}
/* IMAGES
=============================================================================*/
img {
max-width: 100%
}
</style>
</head>
<body>
<p>这里采用<strong>Python-sklearn</strong>的方式,环境搭建可参考<a href="http://blog.csdn.net/snoopy_yuan/article/details/61211639"> 数据挖掘入门:Python开发环境搭建(eclipse-pydev模式)</a>.</p>
<p>相关答案和源代码托管在我的Github上:<a href="https://github.com/PY131/Machine-Learning_ZhouZhihua">PY131/Machine-Learning_ZhouZhihua</a>.</p>
<h2>4.3 编程实现ID3算法</h2>
<blockquote>
<p><img src="Ch4/4.3.png" />
<img src="Ch4/4.3.1.png" /></p>
</blockquote>
<p>这里采用了自己编程实现和调用sklearn库函数两种不同的方式,详细解答和编码过程如下:(<a href="https://github.com/PY131/Machine-Learning_ZhouZhihua/tree/master/ch4_decision_tree/4.3_ID3">查看完整代码</a>):</p>
<h3>1.获取数据、查看数据、预分析</h3>
<p>这里由数据表生成.csv文件(注意<strong>中文字符的编码格式</strong>)。采用<strong>pandas<strong>.read_csv()读取数据,然后采用</strong>seaborn</strong>可视化部分数据来进行初步判断。</p>
<p>观察数据可知,变量包含'色泽'等8个属性,其中6个<strong>标称属性</strong>,2个<strong>连续属性</strong>。类标签为西瓜是否好瓜(两类):</p>
<p>下面是一些数据可视化图像:</p>
<p>下图是含糖率和密度两个连续变量的可视化图,这个图在之前的练习中也有出现:</p>
<p><img src="Ch4/4.3.2.png" /></p>
<p>下图则是更多的变量两两组合可视化图:</p>
<p><img src="Ch4/4.3.3.png" /></p>
<p>基于可视化手段进行一些分析,看以大概了解数据的分布及其与类别的关系。</p>
<h3>2.自己编程实现ID3算法</h3>
<p>在进行编程之前,先做一些分析如下:</p>
<ul>
<li>对于树形结构,可以创建<strong>节点类</strong>来进行操作,根据书p74决策树基本算法,广泛采用<strong>递归操作</strong>;</li>
<li>对于连续属性(密度、含糖率),参考书p83,对其进行<strong>离散化</strong>(可采用二分法);</li>
<li>对于标称属性(色泽、纹理等),考虑<strong>Python字典</strong>,方便操作其特征类别(如:'色泽' -> '青绿','乌黑','浅白');</li>
<li>由于数据集完整,这里暂不考虑缺值问题(若需考虑,可参考书p86-87的加权思路 -> C4.5算法);</li>
</ul>
<p>下面是实现过程:</p>
<h5>2.1 建立决策树节点类</h5>
<p>该节点类包含当前节点的属性,向下划分的属性取值,节点的类标签(叶节点有效,为通用性而保留所有节点类标签);</p>
<p>样例代码如下:</p>
<pre><code>'''
definition of decision node class
@variable attr: attribution as parent for a new branching
@variable attr_down: dict: {key, value}
key: categoric: categoric attr_value
continuous: '<=div_value' for small part
'>div_value' for big part
value: children (Node class)
@variable label: class label (the majority of current sample labels)
'''
class Node(object):
def __init__(self, attr_init=None, attr_down_init={}, label_init=None):
self.attr = attr_init
self.attr_down = attr_down_init
self.label = label_init
</code></pre>
<h5>2.2 递归实现决策树生成算法主体</h5>
<p>基本的决策树生成算法参照书p74-图4.2,如下所示:</p>
<blockquote>
<p><img src="Ch4/4.3.4.png" /></p>
</blockquote>
<p>下面是算法主体代码,注意对连续变量和离散变量的不同操作:</p>
<pre><code>'''
Branching for decision tree using recursion
@param df: the pandas dataframe of the data_set
@return root: Node, the root node of decision tree
'''
def TreeGenerate(df):
# generating a new root node
new_node = Node(None, None, {})
label_arr = df[df.columns[-1]]
label_count = NodeLabel(label_arr)
if label_count: # assert the label_count isn's empty
new_node.label= max(label_count, key=label_count.get)
# end if there is only 1 class in current node data
# end if attribution array is empty
if len(label_count) == 1 or len(label_arr) == 0:
return new_node
# get the optimal attribution for a new branching
new_node.attr, div_value = OptAttr(df)
# recursion
if div_value == 0: # categoric variable
value_count = ValueCount(df[new_node.attr])
for value in value_count:
df_v = df[ df[new_node.attr].isin([value]) ] # get sub set
# delete current attribution
df_v = df_v.drop(new_node.attr, 1)
new_node.attr_down[value] = TreeGenerate(df_v)
else: # continuous variable # left and right child
value_l = "<=%.3f" % div_value
value_r = ">%.3f" % div_value
df_v_l = df[ df[new_node.attr] <= div_value ] # get sub set
df_v_r = df[ df[new_node.attr] > div_value ]
new_node.attr_down[value_l] = TreeGenerate(df_v_l)
new_node.attr_down[value_r] = TreeGenerate(df_v_r)
return new_node
</code></pre>
<h5>2.3 最优划分属性选择-信息增益判断</h5>
<p><strong>ID3算法</strong>采用<strong>信息增益最大化<strong>来实现最优划分属性的选择,这里主要的挑战是离散和连续两种属性变量的分别操作。对于离散变量(categoric variable),参考书p75-77内容实现,对于连续变量(continuous variable),采用书p83-85所介绍的</strong>二分法</strong>实现。</p>
<p>相关内容如<strong>信息熵、信息增益最大化、二分法</strong>等可参考书p75-76及p84页内容。</p>
<p>具体的实现代码:<a href="https://github.com/PY131/Machine-Learning_ZhouZhihua/blob/master/ch4_decision_tree/4.3_ID3/src/decision_tree.py">查看完整代码</a>。</p>
<p>如下所示为综合了离散类别变量和连续变量的信息增益计算代码实现:</p>
<pre><code>'''
calculating the information gain of an attribution
@param df: dataframe, the pandas dataframe of the data_set
@param attr_id: the target attribution in df
@return info_gain: the information gain of current attribution
@return div_value: for discrete variable, value = 0
for continuous variable, value = t (the division value)
'''
def InfoGain(df, index):
info_gain = InfoEnt(df.values[:,-1]) # info_gain for the whole label
div_value = 0 # div_value for continuous attribute
n = len(df[index]) # the number of sample
# 1.for continuous variable using method of bisection
if df[index].dtype == (float, int):
sub_info_ent = {} # store the div_value (div) and it's subset entropy
df = df.sort([index], ascending=1) # sorting via column
df = df.reset_index(drop=True)
data_arr = df[index]
label_arr = df[df.columns[-1]]
for i in range(n-1):
div = (data_arr[i] + data_arr[i+1]) / 2
sub_info_ent[div] = ( (i+1) * InfoEnt(label_arr[0:i+1]) / n ) \
+ ( (n-i-1) * InfoEnt(label_arr[i+1:-1]) / n )
# our goal is to get the min subset entropy sum and it's divide value
div_value, sub_info_ent_max = min(sub_info_ent.items(), key=lambda x: x[1])
info_gain -= sub_info_ent_max
# 2.for discrete variable (categoric variable)
else:
data_arr = df[index]
label_arr = df[df.columns[-1]]
value_count = ValueCount(data_arr)
for key in value_count:
key_label_arr = label_arr[data_arr == key]
info_gain -= value_count[key] * InfoEnt(key_label_arr) / n
return info_gain, div_value
</code></pre>
<h5>2.4 划分训练集和测试集进行测试</h5>
<p>首先给出<strong>预测函数</strong>,采用简单的向下搜索即可实现。<a href="https://github.com/PY131/Machine-Learning_ZhouZhihua/blob/master/ch4_decision_tree/4.3_ID3/src/decision_tree.py">查看完整代码</a>。</p>
<p>通过多次划分训练集和测试集,评估出所生成的完全决策树的预测精度,下面是几种不同情况下的结果和分析:</p>
<ul>
<li>
<p>训练集为整个数据样本的情况:</p>
<pre><code>accuracy: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
average accuracy: 1.000
</code></pre>
<p>此时预测准度为100%,结果证明了<strong>题4-1</strong>。</p>
<p>采用<strong>pydotplus.graphviz</strong>绘图库,可以画出这样一棵完全决策树如下图(<a href="https://github.com/PY131/Machine-Learning_ZhouZhihua/blob/master/ch4_decision_tree/4.3_ID3/src/decision_tree.py">点击查看绘图程序</a>):</p>
<p><img src="Ch4/4.3_decision_tree_ID3.png" /></p>
</li>
<li>
<p>训练集与测试集各取一半:</p>
<pre><code>accuracy: 0.556 0.778 0.333 0.778 0.444 0.333 0.667 0.444 0.778 0.333
average accuracy: 0.544
</code></pre>
<p>多次实验预测准度均在55%左右(几乎等同于随机预测),说明此时模型无实用性。</p>
</li>
<li>
<p>按照K折交叉验证模型(k=5):</p>
<pre><code>accuracy: 1.000 0.000 0.667 1.000 0.000
average accuracy: 0.533
</code></pre>
<p>结果精度依然很不理想,可以看到,不同的训练集与测试集的划分导致结果差别很大,这主要是由于数据量太少的缘故。</p>
</li>
<li>
<p>只分析离散属性:</p>
<pre><code>accuracy: 1.000 0.000 0.333 0.667 0.000
average accuracy: 0.400
</code></pre>
<p>此时由于特征信息减少,模型更差了。</p>
</li>
</ul>
<p>综上,为提高决策树泛化预测精度,需要进一步对其进行<strong>剪枝</strong>。</p>
<p>关于剪枝实现可参考下一题。</p>
<hr />
<p>相关参考:</p>
<p>1.seaborn/matplotlib可视化、中文字符处理:</p>
<ul>
<li><a href="http://seaborn.pydata.org/tutorial/categorical.html">seaborn官网 - Plotting with categorical data</a></li>
<li><a href="https://github.com/mwaskom/seaborn/issues/1009">seaborn显示中文 - The solution of display chinese in seaborn plot</a></li>
<li><a href="https://segmentfault.com/a/1190000000621721">Linux下解决matplotlib中文乱码的方法</a></li>
<li><a href="http://www.cnblogs.com/TsengYuen/archive/2012/05/22/2513290.html">Unicode和Python的中文处理</a></li>
</ul>
<p>2.python基础知识: </p>
<ul>
<li><a href="http://blog.csdn.net/moodytong/article/details/7647684">Python学习 - python字典</a></li>
<li><a href="https://zhidao.baidu.com/question/328905513600600605.html">Python从字符串中提取数字 - 正则表达式</a></li>
</ul>
<p>3.决策树绘制相关:</p>
<ul>
<li><a href="http://scikit-learn.org/stable/modules/tree.html">sklearn官网 - Decision Trees</a></li>
<li><a href="http://pydotplus.readthedocs.io/reference.html#module-pydotplus.graphviz">pydotplus官网 - graphviz的API</a></li>
<li><a href="https://wenku.baidu.com/view/207d623b76a20029bc642de0.html">GraphViz常用属性学习笔记</a></li>
<li><a href="http://www.aiuxian.com/article/p-835489.html">pydot包的安装和使用</a></li>
</ul>
<hr />
</body>
</html>
<!-- This document was created with MarkdownPad, the Markdown editor for Windows (http://markdownpad.com) -->