1
00:00:17,199 --> 00:00:24,199
Hi! Here in Lesson 3.4, we're continuing our
exploration of simple classifiers by looking

2
00:00:24,230 --> 00:00:29,219
at classifiers that produce decision trees.

3
00:00:29,219 --> 00:00:33,269
We're going to look at J48.

4
00:00:33,269 --> 00:00:35,920
We've used this classifier quite a bit so far.

5
00:00:35,920 --> 00:00:38,399
Let's have a look at how it works inside.

6
00:00:38,399 --> 00:00:45,399
J48 is based on a top-down strategy, a recursive
divide and conquer strategy.

7
00:00:46,479 --> 00:00:51,680
You select which attribute to split on at
the root node, and then you create a branch

8
00:00:51,680 --> 00:00:57,960
for each possible attribute value, and that
splits the instances into subsets, one for

9
00:00:57,960 --> 00:01:01,589
each branch that extends from the root node.

10
00:01:01,589 --> 00:01:06,210
Then you repeat the the procedure recursively
for each branch, selecting an attribute at each node,

11
00:01:06,210 --> 00:01:12,180
and you use only instances that
reach that branch to make the selection.

12
00:01:12,180 --> 00:01:19,180
At the end you stop, perhaps you might continue
until all instances have the same class.

13
00:01:19,439 --> 00:01:26,439
The trick is, the question is, how do you
select a good attribute for the root node.

14
00:01:29,490 --> 00:01:36,119
This is the weather data, and as you can see,
outlook has been selected for the root node.

15
00:01:37,140 --> 00:01:42,899
Here are the four possibilities: outlook,
windy, humidity, and temperature.

16
00:01:42,899 --> 00:01:46,819
These are the consequences of splitting on
each of these attributes.

17
00:01:48,000 --> 00:01:53,819
What we're really looking for is a pure split,
a split into pure nodes.

18
00:01:54,709 --> 00:02:00,709
We would be delighted if we found an attribute
that split exactly into one node where they

19
00:02:00,709 --> 00:02:04,749
are all yeses, another node where they
are all nos, and perhaps a third node where

20
00:02:04,749 --> 00:02:05,779
they are all yeses again.

21
00:02:05,779 --> 00:02:06,990
That would be the best thing.

22
00:02:06,990 --> 00:02:12,340
What we don't want is mixtures, because when
we get mixtures of yeses and nos at a node,

23
00:02:12,340 --> 00:02:14,740
then we've got to split again.

24
00:02:14,740 --> 00:02:17,040
You can see that splitting on outlook looks
pretty good.

25
00:02:17,040 --> 00:02:24,040
We get one branch with two yeses and three
nos, then we get a pure yes branch for overcast,

26
00:02:24,209 --> 00:02:29,950
and, when outlook is rainy, we get three yeses
and two nos.

27
00:02:29,950 --> 00:02:35,120
How are we going to quantify this to decide
which one of these attributes produces the

28
00:02:35,120 --> 00:02:42,120
purest nodes? We're on a quest here for purity.

29
00:02:42,269 --> 00:02:51,110
The aim is to get the smallest tree, and top-down
tree induction methods use some kind of heuristic.

30
00:02:51,110 --> 00:02:58,110
The most popular heuristic to produce pure
nodes is an information theory-based heuristic.

31
00:03:00,269 --> 00:03:04,030
I'm not going to explain information theory
to you, that would be another MOOC of its

32
00:03:04,030 --> 00:03:06,939
own -- quite an interesting one, actually.

33
00:03:06,939 --> 00:03:12,909
Information theory was founded by Claude Shannon,
an American mathematician and scientist who

34
00:03:12,909 --> 00:03:14,880
died about 12 years ago.

35
00:03:14,880 --> 00:03:16,439
He was an amazing guy.

36
00:03:16,439 --> 00:03:17,680
He did some amazing things.

37
00:03:17,680 --> 00:03:23,000
One of the most amazing things, I think, is
that he could ride a unicycle and juggle clubs

38
00:03:23,000 --> 00:03:26,799
at the same time when he was in his 80's.

39
00:03:26,799 --> 00:03:29,829
That's pretty impressive.

40
00:03:29,829 --> 00:03:35,519
He came up the whole idea of information theory
and quantifying entropy, which measures information

41
00:03:35,519 --> 00:03:37,019
in bits.

42
00:03:37,019 --> 00:03:44,019
This is the formula for entropy: the sum of
p log p's for each of the possible outcomes.

43
00:03:44,659 --> 00:03:47,299
I'm not really going to explain it to you.

44
00:03:47,299 --> 00:03:51,299
All of those minus signs are there because
logarithms are negative if numbers are less

45
00:03:51,299 --> 00:03:53,930
than 1 and probabilities always are less than
1.

46
00:03:53,930 --> 00:03:56,709
So, the entropy comes out to be a positive
number.

47
00:03:57,460 --> 00:03:59,939
What we do is we look at the information gain.

48
00:03:59,939 --> 00:04:07,170
How much information in bits do you gain by
knowing the value of an attribute? That is,

49
00:04:07,170 --> 00:04:12,180
the entropy of the distribution before the
split minus the entropy of the distribution

50
00:04:12,180 --> 00:04:14,359
after the split.

51
00:04:14,359 --> 00:04:17,480
Here's how it works out for the weather data.

52
00:04:18,440 --> 00:04:19,620
These are the number of bits.

53
00:04:19,620 --> 00:04:24,150
If you split on outlook, you gain 0.247 bits.

54
00:04:24,150 --> 00:04:28,750
I know you might be surprise to see fractional
numbers of bits, normally we think of 1 bit

55
00:04:28,750 --> 00:04:34,690
or 8 bits or 32 bits, but information theory
shows how you can regard bits as fractions.

56
00:04:34,690 --> 00:04:36,330
These produce fractional numbers of bits.

57
00:04:36,330 --> 00:04:38,820
I don't want to go into the details.

58
00:04:38,820 --> 00:04:46,310
You can see, knowing the value for windy gives
you only 0.048 bits of information.

59
00:04:46,310 --> 00:04:53,310
Humidity is quite a bit better; temperature
is way down there at 0.029 bits.

60
00:04:53,310 --> 00:04:58,630
We're going to choose the attribute that gains
the most bits of information, and that, in

61
00:04:58,630 --> 00:05:00,460
this case, is outlook.

62
00:05:00,460 --> 00:05:05,610
At the top level of this tree, the root node,
we're going to split on outlook.

63
00:05:05,610 --> 00:05:11,000
Having decided to split on outlook, we need
to look at each of 3 branches that emanate

64
00:05:11,000 --> 00:05:16,250
from outlook corresponding to the 3 possible
values of outlook, and consider what to do

65
00:05:16,250 --> 00:05:17,600
at each of those branches.

66
00:05:17,600 --> 00:05:22,710
At the first branch, we might split on temperature,
windy or humidity.

67
00:05:22,710 --> 00:05:25,980
We're not going to split on outlook again
because we know that outlook is sunny.

68
00:05:25,980 --> 00:05:30,020
For all instances that reach this place, the
outlook is sunny.

69
00:05:30,020 --> 00:05:32,950
For the other 3 things, we do exactly the
same thing.

70
00:05:32,950 --> 00:05:38,750
We evaluate the information gain for temperature
at that point, for windy and humidity, and

71
00:05:38,750 --> 00:05:39,640
we choose the best.

72
00:05:39,640 --> 00:05:44,230
In this case, it's humidity with a gain of
0.971 bits.

73
00:05:44,230 --> 00:05:50,710
You can see that, if we branch on humidity,
then we get pure nodes: 3 nos in one and 2

74
00:05:50,710 --> 00:05:52,010
yeses in the other.

75
00:05:52,010 --> 00:05:54,120
When we get that, we don't need to split anymore.

76
00:05:54,850 --> 00:06:00,800
We're on a quest for purity.

77
00:06:00,800 --> 00:06:01,520
That's how it works.

78
00:06:01,520 --> 00:06:05,990
It just carries on until it reaches the end,
until it has pure nodes.

79
00:06:05,990 --> 00:06:11,860
Let's open up Weka, and just do this with
the nominal weather data.

80
00:06:11,860 --> 00:06:15,520
Of course, we've done this before, but I'll
just do it again.

81
00:06:15,520 --> 00:06:17,650
It won't take long.

82
00:06:18,140 --> 00:06:22,640
J48 is the workhorse data mining algorithm.

83
00:06:22,640 --> 00:06:23,910
There's the data.

84
00:06:23,910 --> 00:06:26,430
We're going to choose J48.

85
00:06:26,430 --> 00:06:28,500
It's a tree classifier.

86
00:06:29,840 --> 00:06:38,110
We're going to run this, and we get a tree
-- the very tree I showed you before -- split

87
00:06:38,110 --> 00:06:42,150
first on outlook: sunny, overcast and rainy.

88
00:06:42,150 --> 00:06:47,360
Then, if it's sunny, split on humidity, 3
instances reach that node.

89
00:06:47,360 --> 00:06:51,610
Then split on normal, 3 yes instances reach that node,
and so on.

90
00:06:51,610 --> 00:06:58,610
We can look at the tree using Visualize
the tree in the right-click menu.

91
00:06:58,990 --> 00:07:03,060
Here it is.

92
00:07:03,060 --> 00:07:08,140
These are the number of yes instances that
reach this node and the number of no instances.

93
00:07:08,140 --> 00:07:12,900
In the case of this particular tree, of course
we're using cross validation here.

94
00:07:12,900 --> 00:07:16,250
It's done an 11th run on the whole dataset.

95
00:07:16,250 --> 00:07:19,750
It's given us these numbers by looking at
the training set.

96
00:07:19,750 --> 00:07:24,520
In fact, this becomes a pure node here.

97
00:07:24,520 --> 00:07:29,950
Sometimes you get 2 numbers here -- 3/2 or 3/1.

98
00:07:29,950 --> 00:07:35,670
The first number indicates the number of correct
things that reach that node, so in this case

99
00:07:35,670 --> 00:07:36,980
the number of nos.

100
00:07:36,980 --> 00:07:41,090
If there was another number following the
3, that would indicate the number of yeses,

101
00:07:41,090 --> 00:07:43,460
that is, incorrect things that reach that
node.

102
00:07:43,460 --> 00:07:47,850
But that doesn't occur in this very simple
situation.

103
00:07:50,740 --> 00:07:55,560
There you have it, J48: top-down induction
of decision trees.

104
00:07:56,490 --> 00:07:59,360
It's soundly based in information theory.

105
00:07:59,360 --> 00:08:02,230
It's a pretty good data mining algorithm.

106
00:08:02,230 --> 00:08:08,590
10 years ago I might have said it's the best
data mining algorithm, but some even better

107
00:08:08,590 --> 00:08:11,280
ones, I think, have been produced since then.

108
00:08:11,280 --> 00:08:18,200
However, the real advantage of J48 is that
it's reliable and robust, and, most importantly,

109
00:08:18,200 --> 00:08:20,980
it produces a tree that people can understand.

110
00:08:20,980 --> 00:08:23,850
It's very easy to understand the output of J48.

111
00:08:23,850 --> 00:08:28,090
That's really important when you're applying
data mining.

112
00:08:28,090 --> 00:08:31,430
There are a lot of different criteria you
could use for attribute selection.

113
00:08:31,430 --> 00:08:33,459
Here we're using information gain.

114
00:08:33,459 --> 00:08:38,000
Actually, in practice, these don't normally
make a huge difference.

115
00:08:38,000 --> 00:08:42,269
There are some important modifications that
need to be done to this algorithm to be useful

116
00:08:42,269 --> 00:08:42,820
in practice.

117
00:08:42,820 --> 00:08:45,310
I've only really explained the basic principles.

118
00:08:45,310 --> 00:08:51,590
The actual J48 incorporates some more complex
stuff to make it work under different circumstances

119
00:08:51,590 --> 00:08:52,300
in practice.

120
00:08:52,300 --> 00:08:56,370
We'll talk about those in the next lesson.

121
00:08:56,370 --> 00:09:02,580
Section 4.3 of the text Divide-and-conquer:
Constructing decision trees explains the simple

122
00:09:02,580 --> 00:09:05,360
version of J48 that I've explained here.

123
00:09:06,100 --> 00:09:09,590
Now you should go and do the activity associated
with this lesson.

124
00:09:09,590 --> 00:09:12,220
Good luck! See you next time!