﻿1
00:00:18,109 --> 00:00:20,490
Hi! you probably learned a bit about flowers if you did the activity

2
00:00:20,490 --> 00:00:23,150
associated with the last lesson.

3
00:00:23,150 --> 00:00:26,509
Now, we're going to actually build
a classifier: Lesson 1.4

4
00:00:26,509 --> 00:00:28,810
Building a classifier.

5
00:00:28,810 --> 00:00:30,349
We're going to use a

6
00:00:30,349 --> 00:00:35,730
system called J48—I'll
tell you why it's called J48 in a minute—

7
00:00:35,730 --> 00:00:38,450
to analyze the glass dataset.

8
00:00:38,450 --> 00:00:41,820
That we looked at in the last lesson.

9
00:00:41,820 --> 00:00:44,990
I've got the glass dataset open here.

10
00:00:44,990 --> 00:00:49,180
I going to go to the Classify panel.

11
00:00:49,180 --> 00:00:52,590
I choose a classifier here.

12
00:00:52,590 --> 00:00:56,750
There are different kinds of
classifiers. Weka has

13
00:00:56,750 --> 00:01:01,310
bayes classifiers, functions classifiers,
lazy classifiers, meta classifiers, and so on.

14
00:01:02,000 --> 00:01:08,890
We're going to use a tree classifier. J48 is
a tree classifier. I'm going to open trees and click

15
00:01:08,890 --> 00:01:10,700
J48.

16
00:01:10,700 --> 00:01:15,240
Here is the J48 classifier.

17
00:01:15,240 --> 00:01:19,590
Let's run it. If we just press start,
we've got the dataset, we've got the classifier,

18
00:01:19,590 --> 00:01:21,040
and lo and behold,

19
00:01:21,040 --> 00:01:22,430
it's done it.

20
00:01:22,430 --> 00:01:24,290
It's a bit of an anticlimax, really.

21
00:01:24,290 --> 00:01:26,270
Weka makes things very easy

22
00:01:26,270 --> 00:01:27,650
for you to do.

23
00:01:27,650 --> 00:01:30,130
The problem is understanding what
it is that you have done.

24
00:01:30,130 --> 00:01:32,190
Let's take a look.

25
00:01:32,190 --> 00:01:35,390
Here is some information
about the datasets,

26
00:01:35,390 --> 00:01:38,890
glass dataset. The number of
instances and attributes.

27
00:01:38,890 --> 00:01:42,720
Then it's printed out a
representation of a tree here.

28
00:01:43,860 --> 00:01:46,900
We'll look at these trees later on,

29
00:01:46,900 --> 00:01:50,159
but just note that this tree has

30
00:01:50,159 --> 00:01:54,880
30 leaves and 59 nodes altogether.

31
00:01:54,880 --> 00:01:57,120
The overall accuracy

32
00:01:57,120 --> 00:02:00,180
is 66.8%.

33
00:02:00,180 --> 00:02:01,330
So, it's done pretty well.

34
00:02:02,619 --> 00:02:05,410
Down at the bottom,

35
00:02:05,410 --> 00:02:08,929
we've got a confusion matrix.
Remember there were about seven different

36
00:02:08,929 --> 00:02:10,260
kinds of glass.

37
00:02:10,260 --> 00:02:11,320
This is

38
00:02:11,320 --> 00:02:15,329
building windows made of float glass.

39
00:02:15,329 --> 00:02:19,109
You can see that 50 of these
have been classified as 'a', which is

40
00:02:19,109 --> 00:02:20,959
correctly classified.

41
00:02:20,959 --> 00:02:23,369
15 of them have been classified as 'b',

42
00:02:23,369 --> 00:02:26,719
which is building windows non-float glass,
so those are errors,

43
00:02:26,719 --> 00:02:28,579
and 3 have been classified as 'c',

44
00:02:28,579 --> 00:02:29,649
and so on.

45
00:02:29,649 --> 00:02:32,619
This is a confusion matrix.

46
00:02:32,619 --> 00:02:36,019
Most of the weight is down the
main diagonal, which

47
00:02:36,019 --> 00:02:39,780
we like to see because that
indicates correct classifications.

48
00:02:39,780 --> 00:02:41,549
Everything off the main diagonal

49
00:02:41,549 --> 00:02:46,049
indicates a misclassification.

50
00:02:46,049 --> 00:02:50,360
That's the confusion matrix.

51
00:02:50,360 --> 00:02:52,260
Let's investigate this a bit further.

52
00:02:52,260 --> 00:02:55,950
We're going to open a configuration
panel for J48.

53
00:02:55,950 --> 00:02:57,689
Remember I chose it

54
00:02:57,689 --> 00:03:00,979
by clicking the Choose button.

55
00:03:00,979 --> 00:03:03,099
Now, if I click it here,

56
00:03:03,099 --> 00:03:05,489
I get a configuration panel.

57
00:03:05,489 --> 00:03:10,559
I clicked J48 in this menu,
and I get a configuration panel, which

58
00:03:10,559 --> 00:03:12,969
gives a bunch of parameters.

59
00:03:12,969 --> 00:03:14,359
I'm not going to

60
00:03:14,359 --> 00:03:18,659
really talk about these parameters.
Let's just look at one of them, the unpruned

61
00:03:18,659 --> 00:03:20,849
parameter, which by default is false.

62
00:03:20,849 --> 00:03:22,730
What we've just done is to build a

63
00:03:22,730 --> 00:03:26,659
pruned tree, because unpruned is False.

64
00:03:26,659 --> 00:03:28,709
We can change this to

65
00:03:28,709 --> 00:03:31,949
make it true and build an unpruned tree.

66
00:03:31,949 --> 00:03:33,499
We've changed the configuration.

67
00:03:33,499 --> 00:03:36,059
We can run it again.

68
00:03:36,059 --> 00:03:38,999
It just ran again, and now we have

69
00:03:38,999 --> 00:03:43,209
a potentially different result.

70
00:03:43,209 --> 00:03:48,149
Let's just have a look. We have
67% correct classification.

71
00:03:48,149 --> 00:03:49,739
What did we have before?

72
00:03:49,739 --> 00:03:52,579
These are the runs.
This is the previous run,

73
00:03:52,579 --> 00:03:54,040
and there we had

74
00:03:54,040 --> 00:03:57,139
66.8%.

75
00:03:57,139 --> 00:04:01,109
Now, in this run that we've just done with

76
00:04:01,109 --> 00:04:06,939
the unpruned tree, we've got 67% accuracy,

77
00:04:06,939 --> 00:04:11,559
and the tree is the same size.

78
00:04:11,559 --> 00:04:14,619
That's one option.

79
00:04:14,619 --> 00:04:18,239
I'm just going to look at another option,
and then we'll look at some trees.

80
00:04:18,239 --> 00:04:20,430
I'm going to click the
configuration panel again,

81
00:04:20,430 --> 00:04:24,930
and I'm going to change

82
00:04:26,330 --> 00:04:30,439
the minNumObj parameter.

83
00:04:30,439 --> 00:04:32,229
What is that?

84
00:04:32,229 --> 00:04:36,470
That is the minimum number of
instances per leaf.

85
00:04:36,470 --> 00:04:38,969
I'm going to change that from 2

86
00:04:38,969 --> 00:04:41,169
up to 15

87
00:04:41,169 --> 00:04:44,599
to have larger leaves.

88
00:04:44,599 --> 00:04:47,090
These are the leaves of the tree here,

89
00:04:47,090 --> 00:04:49,610
and these numbers in
brackets are the number of

90
00:04:49,610 --> 00:04:53,419
instances that get to the leaf. When
there are two numbers, this means that one

91
00:04:53,419 --> 00:04:56,699
incorrectly classified instance
got to this leaf and five correctly

92
00:04:56,699 --> 00:04:59,159
classified instances got there.

93
00:04:59,159 --> 00:05:00,000
You can see that all of

94
00:05:00,000 --> 00:05:01,730
these leaves are pretty small,

95
00:05:01,730 --> 00:05:03,810
with sometimes just two or three

96
00:05:03,810 --> 00:05:05,530
or here is one with 31

97
00:05:05,530 --> 00:05:09,630
instances. We've constrained now this number,

98
00:05:09,630 --> 00:05:12,730
the tree is going to be generated,
and this number is always going to be

99
00:05:12,730 --> 00:05:16,670
15 or more. Let's run it again.

100
00:05:16,670 --> 00:05:17,630
Now we've got

101
00:05:17,630 --> 00:05:22,080
a worse result, 61%
correct classification, but a much

102
00:05:22,080 --> 00:05:25,920
smaller tree,

103
00:05:25,920 --> 00:05:30,920
with only eight leaves.

104
00:05:32,470 --> 00:05:35,630
Now, we can visualize this tree.

105
00:05:35,630 --> 00:05:37,660
If I right click

106
00:05:37,660 --> 00:05:41,910
on the line—these are the lines that describe
each of the runs that we've done, and this

107
00:05:41,910 --> 00:05:45,360
is the third run—if I right
click on that, I get a little menu,

108
00:05:45,360 --> 00:05:49,220
and I can visualize the tree.

109
00:05:49,220 --> 00:05:53,660
There it is. If I right click on empty
space, I can fit this to the screen.

110
00:05:54,880 --> 00:05:57,940
This is the decision tree.
This says first look at the

111
00:05:57,940 --> 00:05:59,850
Barium (Ba) content.

112
00:05:59,850 --> 00:06:02,910
If it's large, then it must be headlamps.

113
00:06:02,910 --> 00:06:05,700
If it's small, then Magnesium (Mg).

114
00:06:05,700 --> 00:06:11,280
If that's small, then let's look at potassium (K),
and if that's small, then we've got tableware.

115
00:06:11,280 --> 00:06:16,320
That sounds like a pretty good thing to me;
I don't want too much potassium in my tableware.

116
00:06:16,320 --> 00:06:18,560
This is a visualization of the tree

117
00:06:18,560 --> 00:06:24,470
and it's the same tree that you
can see by looking here.

118
00:06:24,470 --> 00:06:30,580
This is a different representation
of the same tree.

119
00:06:30,580 --> 00:06:33,540
I'll just show you one more
thing about this configuration panel,

120
00:06:33,540 --> 00:06:36,930
the More button. This
gives you more information

121
00:06:36,930 --> 00:06:39,350
about the classifier,

122
00:06:39,350 --> 00:06:41,190
about J48.

123
00:06:41,190 --> 00:06:44,230
It's always useful to look at that to
see where these classifiers have come from.

124
00:06:47,970 --> 00:06:49,000
In this case,

125
00:06:49,000 --> 00:06:52,910
let me explain why it's called
J48. It's based on a famous

126
00:06:52,910 --> 00:06:56,070
system that's called C4.5,
which was described in a book.

127
00:06:56,070 --> 00:06:57,880
The book is referenced here.

128
00:06:57,880 --> 00:06:59,260
In fact, I think I've got

129
00:06:59,260 --> 00:07:01,290
on my shelf here. This book here,

130
00:07:01,290 --> 00:07:05,830
"C4.5: Programs for Machine Learning"
by an Australian

131
00:07:05,830 --> 00:07:09,250
computer scientist called Ross Quinlan.

132
00:07:09,250 --> 00:07:12,460
He started out with a system called ID3—

133
00:07:12,460 --> 00:07:14,740
I think that might have
been in his PhD thesis—

134
00:07:14,740 --> 00:07:18,630
and then C4.5 became quite famous.
This kind of morphed through various

135
00:07:18,630 --> 00:07:20,750
versions into C4.5.

136
00:07:20,750 --> 00:07:25,080
It became famous; the book came out,
and so on. He continued to work on this system.

137
00:07:25,080 --> 00:07:26,880
It went up to C4.8,

138
00:07:26,880 --> 00:07:30,950
and then he went commercial. Up until
then, these were all open source

139
00:07:30,950 --> 00:07:32,070
systems.

140
00:07:32,070 --> 00:07:33,890
When we built Weka,

141
00:07:33,890 --> 00:07:37,420
we took the latest version

142
00:07:37,420 --> 00:07:39,900
of C4.5,

143
00:07:39,900 --> 00:07:41,380
which was C4.8,

144
00:07:41,380 --> 00:07:45,500
and we rewrote it. Weka's written
in Java, so we called it J48.

145
00:07:45,500 --> 00:07:47,410
Maybe it's not a

146
00:07:47,410 --> 00:07:48,810
very good name,

147
00:07:48,810 --> 00:07:50,500
but that's the name that stuck.

148
00:07:50,500 --> 00:07:54,240
There's a little bit of history for you.

149
00:07:54,240 --> 00:07:57,950
We've talked about classifiers in Weka.

150
00:07:57,950 --> 00:08:00,380
I've shown you where you find the
classifiers. We classified the glass

151
00:08:00,380 --> 00:08:04,260
dataset. We looked at how to interpret
the output from J48, in

152
00:08:04,260 --> 00:08:09,170
particular the confusion matrix.
We looked at the configuration panel for J48.

153
00:08:09,170 --> 00:08:12,810
We looked at a couple of options: pruned
versus unpruned trees and the option to

154
00:08:12,810 --> 00:08:14,330
avoid small leaves.

155
00:08:14,330 --> 00:08:15,530
I told you how

156
00:08:15,530 --> 00:08:18,850
J48 really corresponds to the
machine learning system that

157
00:08:18,850 --> 00:08:24,670
most people know as C4.5.
C4.5 and C4.8 were really pretty similar,

158
00:08:24,670 --> 00:08:26,030
so we just talk

159
00:08:26,030 --> 00:08:30,450
about J48 as if it's synonymous with C4.5.

160
00:08:30,450 --> 00:08:32,220
You can read about this in the book—

161
00:08:32,220 --> 00:08:35,930
Section 11.1 about Building a
decision tree and Examining the output.

162
00:08:35,930 --> 00:08:40,520
Now, off you go, and do the
activity associated with this lesson.

163
00:08:40,520 --> 00:08:47,520
See you again soon!

