﻿1
00:00:17,320 --> 00:00:24,240
Hi! In the last class, we looked at a bare-bones
algorithm for constructing decision trees.

2
00:00:24,240 --> 00:00:29,450
To get an industrial strength decision tree
induction algorithm, we need to add some more

3
00:00:29,450 --> 00:00:32,870
complicated stuff, notably pruning.

4
00:00:32,870 --> 00:00:37,949
We're going to talk in this [lesson] about
pruning decision trees.

5
00:00:37,949 --> 00:00:41,600
Here's a guy pruning a tree, and that's a
good image to have in your mind when we're

6
00:00:41,600 --> 00:00:43,110
talking about decision trees.

7
00:00:43,110 --> 00:00:47,460
We're looking at those little twigs and little
branches around the edge of the tree, seeing

8
00:00:47,460 --> 00:00:52,340
if their worthwhile, and snipping them off
if they're not contributing.

9
00:00:52,340 --> 00:00:58,570
That way, we'll get a decision tree that might
perform worse on the training data, but perhaps

10
00:00:58,579 --> 00:01:02,680
generalizes better to independent test data.

11
00:01:02,680 --> 00:01:07,100
That's what we want.

12
00:01:07,100 --> 00:01:08,350
Here's the weather data again.

13
00:01:08,350 --> 00:01:12,490
I'm sorry to keep harking back to the weather
data, but it's just a nice simple example

14
00:01:12,490 --> 00:01:13,970
that we all know now.

15
00:01:13,970 --> 00:01:15,920
I've added here a new attribute.

16
00:01:15,920 --> 00:01:21,090
I call it an ID code attribute, which is different
for each instance.

17
00:01:21,090 --> 00:01:25,530
I've just given them an identification code:
a, b, c, and so on.

18
00:01:25,530 --> 00:01:29,119
Let's just think from the last lesson, what's
going to happen when we consider which is

19
00:01:29,119 --> 00:01:34,170
the best attribute to split on at the root,
the first decision.

20
00:01:34,170 --> 00:01:39,759
We're going to be looking for the information
gain from each of our attributes separately.

21
00:01:39,759 --> 00:01:43,630
We're going to gain a lot of information by
choosing the ID code.

22
00:01:43,630 --> 00:01:48,040
Actually, if you split on the ID code, that
tells you everything about the instance we're

23
00:01:48,040 --> 00:01:49,049
looking at.

24
00:01:49,049 --> 00:01:54,149
That's going to be a maximal amount of information
gain, and clearly we're going to split on

25
00:01:54,149 --> 00:01:58,590
that attribute at the root node of the decision
tree.

26
00:01:58,590 --> 00:02:05,590
But that's not going to generalize at all
to new weather instances.

27
00:02:07,119 --> 00:02:12,790
To get around this problem, having constructed
a decision tree, decision tree algorithms

28
00:02:12,790 --> 00:02:15,230
then automatically prune it back.

29
00:02:15,230 --> 00:02:22,230
You don't see any of this, it just happens
when you start the algorithm in Weka.

30
00:02:22,350 --> 00:02:28,419
How do we prune? There are some simple techniques
for pruning, and some more complicated techniques

31
00:02:28,419 --> 00:02:29,410
for pruning.

32
00:02:29,410 --> 00:02:36,410
A very simple technique is to not continue
splitting if the nodes get very small.

33
00:02:37,079 --> 00:02:43,850
I said in the last lesson that we're going
to keep splitting until each node has just

34
00:02:43,850 --> 00:02:46,130
one class associated with it.

35
00:02:46,130 --> 00:02:51,070
Perhaps that's not such a good idea. If we
have a very small node with a couple instances,

36
00:02:51,070 --> 00:02:53,470
it's probably not worth splitting that node.

37
00:02:53,470 --> 00:02:56,160
That's actually a parameter in J48.

38
00:02:56,160 --> 00:03:02,690
I've got Weka going here. I'm going to choose J48
and look at the parameters.

39
00:03:08,370 --> 00:03:12,480
There's a parameter called minNumObj.

40
00:03:12,480 --> 00:03:18,190
If I mouse over that parameter, it says "The
minimum number of instances per leaf".

41
00:03:18,190 --> 00:03:22,669
The default value for that is 2.

42
00:03:22,669 --> 00:03:26,560
The second thing we do is to build a full
tree and then work back from the leaves.

43
00:03:26,560 --> 00:03:31,500
It turns out to be better to build a full
tree and prune back rather than trying to

44
00:03:31,500 --> 00:03:35,040
do forward pruning as you're building the
tree.

45
00:03:35,040 --> 00:03:37,880
We apply a statistical test at each stage.

46
00:03:37,880 --> 00:03:39,570
That's the confidenceFactor parameter.

47
00:03:39,570 --> 00:03:40,590
It's here.

48
00:03:40,590 --> 00:03:42,979
The default value is 0.25.

49
00:03:42,979 --> 00:03:48,190
"The confidence factor used for pruning [smaller
values incur more pruning]."

50
00:03:48,190 --> 00:03:53,519
Then, sometimes it's good to prune an interior
node, and to raise the subtree beneath that

51
00:03:53,519 --> 00:03:57,269
interior node up one level.

52
00:03:57,269 --> 00:03:59,130
That's called subtreeRaising.

53
00:03:59,130 --> 00:04:01,220
That's this parameter here.

54
00:04:01,220 --> 00:04:02,880
We can switch it on or switch it off.

55
00:04:02,880 --> 00:04:09,570
"Whether to consider the subtree raising operation
during pruning." Subtree raising actually

56
00:04:09,570 --> 00:04:18,130
increases the complexity of the algorithm,
so it would work faster if you turned off

57
00:04:18,130 --> 00:04:21,479
subtree raising on a large problem.

58
00:04:21,479 --> 00:04:24,820
I'm not going to talk about the details of
these methods.

59
00:04:24,820 --> 00:04:29,009
Pruning is a messy and complicated subject,
and it's not particularly illuminating.

60
00:04:29,009 --> 00:04:33,229
Actually, I don't really recommend playing
around with these parameters here.

61
00:04:33,229 --> 00:04:38,130
The default values on J48 tend to do a pretty
good job.

62
00:04:40,300 --> 00:04:45,700
Of course, it's become apparent to you now
that the need to prune is really a result

63
00:04:45,700 --> 00:04:51,800
of the original unpruned tree overfitting
the training dataset.

64
00:04:51,800 --> 00:04:54,080
This is another instance of overfitting.

65
00:04:54,080 --> 00:05:00,010
Sometimes simplifying a decision tree gives
better results, not just a smaller, more manageable

66
00:05:00,010 --> 00:05:03,030
tree, but actually better results.

67
00:05:03,030 --> 00:05:05,530
I'm going to open the diabetes data.

68
00:05:13,360 --> 00:05:20,440
I'm going to choose J48, and I'm just going
to run it with the default parameters.

69
00:05:20,440 --> 00:05:28,480
I get an accuracy of 73.8%, evaluated using
cross-validation.

70
00:05:28,480 --> 00:05:37,020
The size of the tree is 20 leaves, and a total
of 39 nodes.

71
00:05:37,020 --> 00:05:42,950
That's 19 interior nodes and 20 leaf nodes.

72
00:05:43,500 --> 00:05:45,320
Let's switch off pruning.

73
00:05:45,320 --> 00:05:47,260
J48 prunes by default.

74
00:05:47,260 --> 00:05:48,940
We're going to switch off pruning.

75
00:05:48,940 --> 00:05:53,140
We've got an unpruned option here, which is
false, which means it's pruning.

76
00:05:53,140 --> 00:05:59,850
I'm going to change that to true -- which
means it's not pruning any more -- and run

77
00:05:59,850 --> 00:06:00,750
it again.

78
00:06:00,750 --> 00:06:07,750
Now we get a slightly worse result, 72.7%,
probably not significantly worse.

79
00:06:07,750 --> 00:06:13,760
We get a slightly larger tree -- 22 leaves and
43 nodes.

80
00:06:13,760 --> 00:06:15,420
That's a double whammy, really.

81
00:06:15,420 --> 00:06:19,280
We've got a bigger tree, which is harder to
understand, and we've got a slightly worse

82
00:06:19,280 --> 00:06:20,460
prediction result.

83
00:06:20,460 --> 00:06:25,490
We would prefer the pruned [tree] in this
example on this dataset.

84
00:06:26,240 --> 00:06:30,580
I'm going to show you a more extreme example
with the breast cancer data.

85
00:06:30,580 --> 00:06:36,770
I don't think we've looked at the breast cancer data before.

86
00:06:36,770 --> 00:06:45,460
The class is no-recurrence-events versus recurrence-events,
and there are attributes like age, menopause,

87
00:06:45,460 --> 00:06:49,360
tumor size, and so on.

88
00:06:49,360 --> 00:06:53,560
I'm going to go classify this with J48 in
the default configuration.

89
00:06:53,560 --> 00:07:02,460
I need to switch on pruning -- that is, make
unpruned false -- and then run it.

90
00:07:04,000 --> 00:07:11,710
I get an accuracy of 75.5%, and I get a fairly
small tree with 4 leaves and 2 internal nodes.

91
00:07:11,710 --> 00:07:18,700
I can look at that tree here, or I can visualize
the tree.

92
00:07:22,440 --> 00:07:27,680
We get this nice, simple little decision structure
here, which is quite comprehensible and performs

93
00:07:27,680 --> 00:07:30,490
pretty well, 75% accuracy.

94
00:07:30,490 --> 00:07:35,740
I'm going to switch off pruning.

95
00:07:35,740 --> 00:07:39,930
Make unpruned true, and run it again.

96
00:07:41,700 --> 00:07:49,910
First of all, I get a much worse result, 69.6%
-- probably signficantly worse than the 75.5%

97
00:07:49,910 --> 00:07:51,870
I had before.

98
00:07:51,870 --> 00:07:58,870
More importantly, I get a huge tree, with
152 leaves and [179] [total] nodes.

99
00:07:59,510 --> 00:08:00,460
It's massive.

100
00:08:00,460 --> 00:08:04,720
If I try to visualize that, I probably won't
be able to see very much.

101
00:08:04,720 --> 00:08:09,530
I can try to fit that to my screen,

102
00:08:12,360 --> 00:08:14,860
and it's still impossible to see what's going on here.

103
00:08:14,860 --> 00:08:21,860
In fact, if I look at the textual description
of the tree, it's just extremely complicated.

104
00:08:22,880 --> 00:08:24,949
That's a bad thing.

105
00:08:24,949 --> 00:08:27,810
Here, an unpruned tree is a very bad idea.

106
00:08:27,810 --> 00:08:34,810
We get a huge tree which does quite a bit
worse than a much simpler decision structure.

107
00:08:35,930 --> 00:08:42,919
J48 does pruning by default and, in general,
you should let it do pruning according to

108
00:08:42,919 --> 00:08:44,079
the default parameters.

109
00:08:44,079 --> 00:08:46,940
That would be my recommendation.

110
00:08:48,870 --> 00:08:52,920
We've talked about J48, or, in other words,
C4.5.

111
00:08:52,920 --> 00:08:59,589
Remember, in Lesson 1.4, we talked about the
progression from C4.5 by Ross Quinlan.

112
00:08:59,589 --> 00:09:05,240
Here is a picture of Ross Quinlan, an Australian
computer scientist, at the bottom of the screen.

113
00:09:05,240 --> 00:09:11,500
The progression from C4.5 from Ross to J48,
which is the Java implementation essentially

114
00:09:11,500 --> 00:09:14,100
equivalent to C4.5.

115
00:09:14,100 --> 00:09:15,690
It's a very popular method.

116
00:09:15,690 --> 00:09:17,520
It's a simple method and easy to use.

117
00:09:17,520 --> 00:09:21,900
Decision trees are very attractive because
you can look at them and see what the structure

118
00:09:21,900 --> 00:09:24,740
of the decision is, see what's important about
your data.

119
00:09:25,850 --> 00:09:31,740
There are many different pruning methods,
and their main effect is to change the size

120
00:09:31,790 --> 00:09:32,410
of the tree.

121
00:09:32,410 --> 00:09:36,360
They have a small effect on the accuracy,
and it often makes the accuracy worse.

122
00:09:36,360 --> 00:09:42,650
They often have a huge effect on the size of
the tree, as we just saw with the breast cancer data.

123
00:09:42,650 --> 00:09:47,670
Pruning is actually a general technique to
guard against overfitting, and it can be applied

124
00:09:47,670 --> 00:09:52,540
to structures other than trees, like decision rules.

125
00:09:52,540 --> 00:09:56,940
There's a lot more we could say about decision trees.

126
00:09:56,940 --> 00:10:01,730
For example, we've been talking about univariate
decision trees -- that is, ones that have

127
00:10:01,730 --> 00:10:04,450
a single test at each node.

128
00:10:04,450 --> 00:10:08,550
You can imagine a multivariate tree, where
there is a compound test.

129
00:10:08,550 --> 00:10:14,110
The test of the node might be 'if this attribute
is that AND that attribute is something else'.

130
00:10:14,110 --> 00:10:19,370
You can imagine more complex decision trees
produced by more complex decision tree algorithms.

131
00:10:19,370 --> 00:10:26,370
In general, C4.5/J48 is a popular and useful
workhorse algorithm for data mining.

132
00:10:27,740 --> 00:10:32,310
You can read a lot more about decision trees
if you go to the course text.

133
00:10:32,310 --> 00:10:37,610
Section 6.1 tells you about pruning and gives
you the mathematical details of the pruning

134
00:10:37,610 --> 00:10:43,310
methods that I've just sketched here.

135
00:10:43,310 --> 00:10:46,240
It's time for you to do the activity, and
I'll see you in the next lesson.

136
00:10:46,240 --> 00:10:47,900
Bye for now!

