1
00:00:18,500 --> 00:00:21,340
Hi! Well, it's summertime here in New Zealand.

2
00:00:21,349 --> 00:00:28,310
Summer's just arrived, and, as you can see,
I'm sitting outside for a change of venue.

3
00:00:28,310 --> 00:00:35,280
This is Class 5 of the MOOC -- the last class!
Here are a few comments on Class 4, some issues

4
00:00:35,280 --> 00:00:37,160
that came up.

5
00:00:37,160 --> 00:00:40,270
We had a couple of errors in the activities;
we corrected those pretty quickly.

6
00:00:40,270 --> 00:00:44,210
Some of the activities are getting harder
-- you will have noticed that! But I think

7
00:00:44,210 --> 00:00:46,129
if you're doing the activities you'll be learning
a lot.

8
00:00:46,129 --> 00:00:50,600
You learn a lot through doing the activities,
so keep it up! And the Class 5 activities

9
00:00:50,600 --> 00:00:51,829
are much easier.

10
00:00:53,220 --> 00:00:58,829
There was a question about converting nominal
variables to numeric in Activity 4.2.

11
00:01:00,560 --> 00:01:04,660
Someone said the result of the supervised
nominal binary filter was weird.

12
00:01:05,460 --> 00:01:07,160
Yes, well, it is a little bit weird.

13
00:01:07,160 --> 00:01:12,380
If you click the "More" button for that filter,
it says that k-1 new binary attributes are

14
00:01:12,380 --> 00:01:16,370
generated in the manner described in this
book (if you can get hold of it).

15
00:01:16,370 --> 00:01:19,140
Let me just tell you a little bit more about
this.

16
00:01:20,200 --> 00:01:27,100
I've come up with an example of a nominal
attribute called "fruit", and it has 3 values:

17
00:01:27,100 --> 00:01:29,390
orange, apple, and banana.

18
00:01:29,390 --> 00:01:33,120
In this dataset, the class is "juicy"; it's
a numeric measure of juiciness.

19
00:01:33,120 --> 00:01:38,080
I don't know about where you live, but in
New Zealand oranges are juicier than apples,

20
00:01:38,080 --> 00:01:39,920
and apples are juicier than bananas.

21
00:01:39,920 --> 00:01:46,000
I'm assuming that in this dataset, if you
average the juiciness of all the instances

22
00:01:46,000 --> 00:01:51,830
where the fruit attribute equals orange you
get a larger value than if you do this with

23
00:01:51,830 --> 00:01:58,050
all the instances where the fruit attribute
equals apple, and that's larger than for banana.

24
00:01:58,050 --> 00:02:02,110
That sort of orders these values.

25
00:02:02,110 --> 00:02:08,399
Let's consider ways of making "fruit" into
a set of binary attributes.

26
00:02:08,399 --> 00:02:14,940
The simplest method, and the one that's used
by the unsupervised conversion filter, is

27
00:02:14,940 --> 00:02:15,790
Method 1 here.

28
00:02:15,790 --> 00:02:21,540
We create 3 new binary attributes; I've just
called them "fruit=orange", "fruit=apple",

29
00:02:21,540 --> 00:02:22,970
and "fruit=banana".

30
00:02:22,970 --> 00:02:27,530
The first attribute value is 1 if it's an
orange and 0 otherwise.

31
00:02:27,530 --> 00:02:31,970
The second attribute, "fruit=apple", is 1
if it's an apple and 0 otherwise, and the

32
00:02:31,970 --> 00:02:34,470
same for banana.

33
00:02:34,470 --> 00:02:40,520
Of course, of these three binary attributes,

34
00:02:40,520 --> 00:02:45,790
exactly one of them has to be "1" for any instance.

35
00:02:45,790 --> 00:02:47,430
Here's another way of doing it, Method 2.

36
00:02:47,430 --> 00:02:52,790
We take each possible subset: as well as "orange",
"apple" and "banana", we have another binary

37
00:02:52,790 --> 00:02:59,790
variable for "orange_or_apple", another for
"orange_or_banana", and another for "apple_or_banana".

38
00:03:01,069 --> 00:03:06,880
For example, if the value of fruit was "orange",
then the first attribute ("fruit=orange")

39
00:03:06,880 --> 00:03:12,380
would be 1, the fourth attribute ("orange_or_apple")
would be 1, and the fifth attribute ("orange_or_banana")

40
00:03:12,380 --> 00:03:13,180
would be 1.

41
00:03:13,180 --> 00:03:15,380
All of the others would be 0.

42
00:03:15,380 --> 00:03:23,370
This effectively creates a binary attribute
for each subset of possible values of the

43
00:03:23,370 --> 00:03:24,810
"fruit" attribute.

44
00:03:25,840 --> 00:03:31,810
Actually, we don't create one for the empty
subset or the full subset (with all 3 of the values in).

45
00:03:41,120 --> 00:03:46,240
We get 2^k-2 values for a k-valued attribute.

46
00:03:46,240 --> 00:03:51,720
That's impractical in general, because 2^k
grows very fast as k grows.

47
00:03:51,720 --> 00:03:55,370
The third method is the one that is actually
used, and this is the one that's described

48
00:03:55,370 --> 00:03:57,170
in that book.

49
00:03:57,170 --> 00:04:04,170
We create 2 new attributes (k-1, in general,
for a k-valued attribute):

50
00:04:04,170 --> 00:04:08,540
"fruit=orange_or_apple" and "fruit=apple".

51
00:04:08,540 --> 00:04:14,450
For oranges, the first attribute is 1 and
the second is 0; for apples, they're both 1; 

52
00:04:14,450 --> 00:04:18,079
and for bananas, they're both 0.

53
00:04:18,079 --> 00:04:24,070
That's assuming this ordering of class values:
orange is largest in juiciness, and banana

54
00:04:24,070 --> 00:04:25,680
is smallest in juiciness.

55
00:04:25,680 --> 00:04:29,770
There's a theorem that, if you're making a
decision tree, the best way of splitting a

56
00:04:29,770 --> 00:04:36,530
node for a nominal variable with k values
is one of the k-1 positions -- well, you can

57
00:04:36,530 --> 00:04:37,570
read this.

58
00:04:37,570 --> 00:04:41,660
In fact, this theorem is reflected in Method 3.

59
00:04:41,660 --> 00:04:46,770
That is the best way of splitting these attribute
values.

60
00:04:46,770 --> 00:04:50,620
Whether it's a good thing in practice or not,
well, I don't know.

61
00:04:50,620 --> 00:04:52,280
You should try it and see.

62
00:04:52,280 --> 00:04:59,090
Perhaps you can try Method 3 for the supervised
conversion filter and Method 1 for the unsupervised

63
00:04:59,090 --> 00:05:04,400
conversion filter and see which produces the
best results on your dataset.

64
00:05:04,400 --> 00:05:08,800
Weka doesn't implement Method 2, because the
number of attributes explodes with the number

65
00:05:08,800 --> 00:05:15,800
of possible values, and you could end up with
some very large datasets.

66
00:05:16,660 --> 00:05:23,660
The next question is about simulating multiresponse
linear regression: "Please explain!" Well,

67
00:05:24,790 --> 00:05:30,290
we're looking at a Weka screen like this.

68
00:05:30,290 --> 00:05:43,310
We're running linear regression on the iris
dataset where we've mapped the values so that

69
00:05:43,310 --> 00:05:50,310
the class for any Virginica instance is 1
and 0 for the others.

70
00:05:50,770 --> 00:05:54,030
We've done it with this kind of configuration.

71
00:05:54,030 --> 00:05:57,830
This is the default configuration of the makeIndicator
filter.

72
00:05:57,830 --> 00:06:00,500
It's working on the last attribute -- that's
the class.

73
00:06:00,500 --> 00:06:09,550
In this case, the value index is last, which
means we're looking at the last value, which,

74
00:06:09,550 --> 00:06:11,340
in fact, is Virginica.

75
00:06:11,340 --> 00:06:17,140
We could put a number here to get the first,
second, or third values.

76
00:06:17,140 --> 00:06:25,780
That's how we get the dataset, and then we
run linear regression on this to get a linear model.

77
00:06:26,360 --> 00:06:30,650
Now, I want to look at the output for the
first 4 instances.

78
00:06:30,650 --> 00:06:37,010
We've got an actual class of 1, 1, 0, 0 and
the predicted value of these numbers.

79
00:06:37,010 --> 00:06:42,919
I've written those down in this little table
over here: 1, 1, 0, 0 and these numbers.

80
00:06:42,919 --> 00:06:49,910
That for the dataset where all of the Virginicas
are mapped to 1 and the other irises are mapped to 0.

81
00:06:49,919 --> 00:06:53,830
When we do the corresponding mapping with
Versicolors, we get this as the actual class

82
00:06:53,830 --> 00:06:59,210
-- we just run Weka and look at what appeared
on the screen -- and this is the predicted value.

83
00:06:59,210 --> 00:07:01,320
We get these for Setosa.

84
00:07:01,320 --> 00:07:08,020
So, you can see that the first instance is
actually a Virginica - 1, 0, 0.

85
00:07:08,020 --> 00:07:11,919
I've put in bold the largest of these 3 numbers.

86
00:07:11,919 --> 00:07:18,020
This is the largest, 0.966, which is bigger
than 0.117 and -0.065, so multiresponse linear

87
00:07:18,020 --> 00:07:22,919
regression is going to predict Virginica for
instance 1.

88
00:07:22,919 --> 00:07:25,360
It's got the largest value.

89
00:07:25,360 --> 00:07:27,150
And that's correct.

90
00:07:27,150 --> 00:07:32,070
For the second instance, it's also a Virginica,
and it's also the largest of the 3 values

91
00:07:32,070 --> 00:07:33,080
in its row.

92
00:07:33,080 --> 00:07:36,210
For the third instance, it's actually a Versicolor.

93
00:07:36,210 --> 00:07:42,680
The actual output is 1 for the Versicolor
model, but the largest prediction is still

94
00:07:42,680 --> 00:07:44,020
for the Virginica model.

95
00:07:44,020 --> 00:07:48,520
It's going to predict Virginica for an iris
that's actually Versicolor.

96
00:07:48,520 --> 00:07:51,270
That's going to be a mistake.

97
00:07:51,270 --> 00:07:57,100
In the [fourth] case, it's actually a Setosa
-- the actual column is 1 for Setosa -- and

98
00:07:57,100 --> 00:08:01,970
this is the largest value in the row, so it's
going to correctly predict Setosa.

99
00:08:01,970 --> 00:08:08,669
That's how multiresponse linear regression
works.

100
00:08:11,840 --> 00:08:15,669
"How does OneR use the rules it generates?
Please explain!" 

101
00:08:21,900 --> 00:08:23,860
Well, here's the rule generated by OneR.

102
00:08:23,860 --> 00:08:26,640
It hinges on attribute 6.

103
00:08:26,640 --> 00:08:31,320
Of course, if you click the "Edit" button
in the Preprocess panel, you can see the value

104
00:08:31,320 --> 00:08:38,320
of this attribute for each instance.

105
00:08:43,579 --> 00:08:49,839
This is what we see in the Explorer when we
run OneR.

106
00:08:49,839 --> 00:08:53,649
You can see the predicted instances here.

107
00:08:53,649 --> 00:08:58,740
These are the predicted instances -- g, b,
g, b, g, g, etc.

108
00:08:58,740 --> 00:08:59,920
These are the predictions.

109
00:08:59,920 --> 00:09:02,410
The question is, how does it get these predictions.

110
00:09:02,410 --> 00:09:06,900
This is the value of attribute 6 for instance 1.

111
00:09:06,900 --> 00:09:14,080
What the OneR code does is go through each
of these conditions and looks to see if it's satisfied.

112
00:09:14,089 --> 00:09:19,929
Is 0.02 less than -0.2? -- no, it's not.

113
00:09:19,929 --> 00:09:22,839
Is it less than -0.01? -- no, it's not.

114
00:09:22,839 --> 00:09:26,319
Is it less than 0.001? -- no, it's not.

115
00:09:26,319 --> 00:09:29,740
(It's surprisingly hard to get these right,
especially when you've got all of the other

116
00:09:29,740 --> 00:09:36,530
decimal places in the list here.) Is it less
than 0.1? -- yes, it is.

117
00:09:36,530 --> 00:09:40,869
So rule 4 fires -- this is rule 4 -- and predicts
"g".

118
00:09:43,170 --> 00:09:47,820
I've written down here the number of the rule
clause that fires.

119
00:09:47,820 --> 00:09:55,390
In this case, for instance 2, the value of
the attribute is -0.4, and that satisfies

120
00:09:55,399 --> 00:09:56,649
the first rule.

121
00:09:56,649 --> 00:10:00,300
So this satisfies number 1, and we predict "b".

122
00:10:00,300 --> 00:10:02,610
And so on down the list.

123
00:10:02,610 --> 00:10:03,559
That's what OneR does.

124
00:10:03,559 --> 00:10:10,670
It goes through the rule evaluating each of
these clauses until it finds one that is true,

125
00:10:10,679 --> 00:10:17,679
and then it uses the corresponding prediction
as its output.

126
00:10:17,860 --> 00:10:22,050
Moving on to ensemble learning questions.

127
00:10:22,050 --> 00:10:28,779
There were some questions on ensemble learning,
about these ten OneR models.

128
00:10:28,779 --> 00:10:37,510
"Are these ten alternative ways of classifying
the data?" Well, in a sense, but they are used together: 

129
00:10:37,519 --> 00:10:39,470
AdaBoost.M1 combines them.

130
00:10:39,470 --> 00:10:44,769
In practice you don't just pick one of them
and use that: AdaBoost combines these models

131
00:10:44,769 --> 00:10:50,550
inside itself -- the predictions it prints
are produced by its combined model.

132
00:10:50,550 --> 00:10:55,800
The weights are used in the combination to
decide how much weight to give each of these models.

133
00:10:55,809 --> 00:11:00,720
And when Weka reports a certain accuracy,
that's for the combined model.

134
00:11:00,720 --> 00:11:08,480
It's not the average; it's not the best; it's
combined in the way that AdaBoost combines them.

135
00:11:08,489 --> 00:11:10,970
That's all done internally in the algorithm.

136
00:11:10,970 --> 00:11:17,870
I didn't really explain the details of how the algorithm works; you'll have to look that up, I guess.

137
00:11:17,879 --> 00:11:21,790
The point is AdaBoostM1 combines these models for you.

138
00:11:21,790 --> 00:11:23,679
You don't have to think of them as separate
models.

139
00:11:23,679 --> 00:11:27,559
They're all combined by AdaBoostM1.

140
00:11:27,559 --> 00:11:31,029
Someone complained that we're supposed to
be looking for simplicity, and this seems

141
00:11:31,029 --> 00:11:32,259
pretty complicated.

142
00:11:32,259 --> 00:11:33,579
That's true.

143
00:11:33,579 --> 00:11:38,929
The real disadvantage of these kinds of models,
ensemble models, is that it's hard to look

144
00:11:38,929 --> 00:11:39,420
at the rules.

145
00:11:39,420 --> 00:11:42,439
It's hard to see inside to see what they're doing.

146
00:11:42,439 --> 00:11:44,980
Perhaps you should be a bit wary of that.

147
00:11:44,980 --> 00:11:46,899
But they can produce very good results.

148
00:11:46,899 --> 00:11:51,929
You know how to test machine learning methods
reliably using cross-validation or whatever.

149
00:11:51,929 --> 00:11:57,249
So, sometimes they're good to use.

150
00:11:58,400 --> 00:12:03,929
"How does Weka make predictions? How can you
use Weka to make predictions?" You can use

151
00:12:03,929 --> 00:12:09,059
the "Supplied test set" option on the Classify
panel to put in a test set and see the predictions

152
00:12:09,059 --> 00:12:09,839
on that.

153
00:12:09,839 --> 00:12:15,250
Or, alternatively, there is a program -- if
you can run Java programs -- there's a program here.

154
00:12:15,259 --> 00:12:23,220
This is how you run it: "java weka.classifiers.trees.J48"
with your ARFF data file, and you put question

155
00:12:23,220 --> 00:12:25,600
marks there to indicate the class.

156
00:12:25,600 --> 00:12:32,029
Then you give it the model, which you've output
from the Explorer.

157
00:12:32,029 --> 00:12:40,000
You can look at how to do this on the Weka
Wiki on the FAQ list: "using Weka to make predictions".

158
00:12:43,790 --> 00:12:49,339
Can you bootstrap learning? Someone talked
about some friends of his who were using training

159
00:12:49,339 --> 00:12:54,199
data to train a classifier and using the results
of the classification to create further training

160
00:12:54,199 --> 00:12:57,410
data, and continuing the cycle -- kind of
bootstrapping.

161
00:12:57,410 --> 00:13:01,410
That sounds very attractive, but it can also
be unstable.

162
00:13:01,410 --> 00:13:06,069
It might work, but I think you'd be pretty
lucky for it to work well.

163
00:13:06,069 --> 00:13:11,220
It's a potentially rather unreliable way of
doing things -- believing the classifications

164
00:13:11,220 --> 00:13:16,959
on new data and using that to further train
the classifier.

165
00:13:16,959 --> 00:13:21,019
He also said these friends of his don't really
look into the classification algorithm.

166
00:13:21,019 --> 00:13:24,790
I guess I'm trying to tell you a little bit
about how each classification algorithm works,

167
00:13:24,790 --> 00:13:26,639
because I think it really does help to know
that.

168
00:13:26,639 --> 00:13:32,440
You should be looking inside and thinking
about what's going on inside your data mining method.

169
00:13:32,449 --> 00:13:38,489
A couple of suggestions of things not covered
in this MOOC: FilteredClassifier and association

170
00:13:38,489 --> 00:13:40,939
rules, the Apriori association rule learner.

171
00:13:40,939 --> 00:13:47,939
As I said before, maybe we'll produce a follow-up
MOOC and include topics like this in it.

172
00:13:48,619 --> 00:13:49,579
That's it for now.

173
00:13:49,579 --> 00:13:51,350
Class 5 is the last class.

174
00:13:51,350 --> 00:13:52,569
It's a short class.

175
00:13:52,569 --> 00:13:54,040
Go ahead and do it.

176
00:13:54,040 --> 00:13:57,879
Please complete the assessments and finish
off the course.

177
00:13:57,879 --> 00:14:03,260
It'll be open this week, and it'll remain
open for one further week if you're getting behind.

178
00:14:03,269 --> 00:14:05,320
But after that, it'll be closed.

179
00:14:05,320 --> 00:14:07,300
So, you need to get on with it.

180
00:14:07,300 --> 00:14:09,140
We'll talk to you later. Bye!