1
00:00:18,000 --> 00:00:24,420
Hi! Well, Class 2 has gone flying by, and
here are some things I'd like to discuss.

2
00:00:24,420 --> 00:00:28,640
First of all, we made some mistakes in the
answers to the activities.

3
00:00:28,640 --> 00:00:29,640
Sorry about that.

4
00:00:29,640 --> 00:00:32,480
We've corrected them.

5
00:00:32,480 --> 00:00:37,530
Secondly -- a general point -- some people
have been asking questions, for example, about

6
00:00:37,530 --> 00:00:38,870
huge datasets.

7
00:00:38,870 --> 00:00:44,059
How big a dataset can Weka deal with? The
answer is pretty big, actually.

8
00:00:44,059 --> 00:00:47,710
But it depends on what you do, and it's a
fairly complicated question to discuss.

9
00:00:47,710 --> 00:00:52,570
If it's not big enough, there are ways of
improving things.

10
00:00:52,570 --> 00:00:57,619
Anyway, issues like that should be discussed
on the Weka mailing list, or you should look

11
00:00:57,619 --> 00:01:03,920
in the Weka FAQ, where there's quite a lot
of discussion on this particular issue.

12
00:01:03,920 --> 00:01:07,939
The Weka API: the programming interface to
Weka.

13
00:01:07,939 --> 00:01:11,430
You can incorporate the Weka routines in your
program.

14
00:01:11,430 --> 00:01:14,220
It's wonderful stuff, but it's not covered
in this MOOC.

15
00:01:14,220 --> 00:01:19,430
So the right place to discuss those issues
is the Weka mailing list.

16
00:01:19,430 --> 00:01:22,240
Finally, personal emails to me.

17
00:01:22,240 --> 00:01:27,360
You know, there are 5,000 people on this MOOC,
and I can't cope with personal emails, so

18
00:01:27,360 --> 00:01:33,960
please send them to the mailing list and not
to me personally.

19
00:01:33,960 --> 00:01:38,460
I'd like to discuss the issues of numeric
precision in Weka.

20
00:01:38,460 --> 00:01:44,670
Weka prints percentages to 4 decimal places;
it prints most numbers to 4 decimal places.

21
00:01:44,670 --> 00:01:47,520
That's misleadingly high accuracy.

22
00:01:47,520 --> 00:01:49,350
Don't take these at face value.

23
00:01:49,350 --> 00:01:59,250
For example, here we've done an experiment
using a 40% percentage split, and we get 92.3333%

24
00:01:59,250 --> 00:02:00,659
accuracy printed out.

25
00:02:00,659 --> 00:02:07,240
Well, that's the exact right answer to the
wrong question.

26
00:02:07,240 --> 00:02:11,540
We're not interested in the performance on
this particular test set.

27
00:02:11,540 --> 00:02:18,530
What we're interested in is how Weka will
do in general on data from this source.

28
00:02:18,530 --> 00:02:25,110
We certainly can't infer that that's this
percentage to 4 decimal place accuracy.

29
00:02:25,110 --> 00:02:29,430
In Class 2, we're trying to sensitize you
to the fact that these figures aren't to be

30
00:02:29,430 --> 00:02:31,300
taken at face value.

31
00:02:31,300 --> 00:02:35,090
For example, there we are with a 40% split.

32
00:02:35,090 --> 00:02:42,090
If we do a 30% split we get 92.381%.

33
00:02:43,530 --> 00:02:46,590
The difference between these two numbers is
completely insignificant.

34
00:02:46,590 --> 00:02:50,260
You shouldn't be saying this is better than
the other number.

35
00:02:50,260 --> 00:02:56,270
They are both the same, really, within the
amount of statistical fuzz that's involved

36
00:02:56,270 --> 00:02:57,660
in the experiment.

37
00:02:57,660 --> 00:03:06,760
We're trying to train you to write your answers
to the nearest percentage point, or perhaps

38
00:03:06,760 --> 00:03:07,900
1 decimal place.

39
00:03:07,900 --> 00:03:12,489
Those are the answers that are being accepted
as correct.

40
00:03:12,489 --> 00:03:17,090
The reason we're doing that is to try to train
you to think about these numbers and what

41
00:03:17,090 --> 00:03:22,090
they really represent, rather than just copy/pasting
whatever Weka prints out.

42
00:03:22,090 --> 00:03:25,520
These numbers need to be interpreted.

43
00:03:25,520 --> 00:03:37,840
For example, in Activity 2.6 in question 2,
the 4-digit answer would be 0.7354%, and 0.7

44
00:03:37,840 --> 00:03:41,520
and 0.74 are the only accepted answers.

45
00:03:41,520 --> 00:03:51,810
In question 5, the 4-decimal place accuracy
is 1.7256%, and we would accept 1.73%, 1.7% and 2%.

46
00:03:51,819 --> 00:03:55,790
We're a bit selective in what we'll accept
here.

47
00:03:58,740 --> 00:04:02,790
I want to move on to the user classifier now.

48
00:04:04,280 --> 00:04:10,030
Some people got some confusing results, because
they created splits that involved the class

49
00:04:10,030 --> 00:04:13,330
attribute.

50
00:04:13,330 --> 00:04:16,739
When you're dealing with the test set, you
don't know the class attribute -- that's what

51
00:04:16,739 --> 00:04:18,120
you're trying to find out.

52
00:04:18,120 --> 00:04:22,750
So it doesn't make sense to create splits
in the decision tree that involve testing

53
00:04:22,750 --> 00:04:24,889
the class attribute.

54
00:04:24,889 --> 00:04:31,819
If you do that, you're going to get 0 accuracy
on test data, because the class value cannot

55
00:04:31,819 --> 00:04:37,259
be evaluated on the test data.

56
00:04:37,259 --> 00:04:40,800
That was the cause of that confusion.

57
00:04:40,800 --> 00:04:44,080
Here's the league table for the user classifier.

58
00:04:44,080 --> 00:04:47,909
J48 gets 96.2%, just as a reference point.

59
00:04:47,909 --> 00:04:51,719
Magda did really well and got very close to
that, with 93.9%.

60
00:04:51,719 --> 00:04:58,719
It took her 6.5-7 minutes, according to the
script that she mailed in.

61
00:05:01,409 --> 00:05:04,909
Myles did pretty well -- 93.5%.

62
00:05:04,909 --> 00:05:09,369
In the class, I got 78% in just a few seconds.

63
00:05:09,369 --> 00:05:14,710
I think if you get over 90% you're doing pretty
well on this dataset for the user classifier.

64
00:05:14,710 --> 00:05:21,710
The point is not to get a good result, it's
to think about the process of classification.

65
00:05:23,680 --> 00:05:30,050
Let's move to Activity 2.2, partitioning the
datasets for training and testing.

66
00:05:30,050 --> 00:05:40,080
Question 1 asked you to evaluate J48 with
percentage split, using 10% for the training

67
00:05:40,080 --> 00:05:43,650
set, 20%, 40%, 60%, and 80%.

68
00:05:43,650 --> 00:05:50,650
What you observed is that the accuracy increases
as we go through that set of numbers.

69
00:05:51,960 --> 00:05:55,169
"Performance always increases" for those numbers.

70
00:05:55,169 --> 00:05:57,939
It doesn't always increase in general.

71
00:05:57,939 --> 00:06:03,979
In general, you would expect an increasing
trend -- the more training data the better

72
00:06:03,979 --> 00:06:08,559
the performance, asymptoting off at some point.

73
00:06:08,559 --> 00:06:12,569
You would expect some fluctuation, though,
so sometimes you would expect it to go down

74
00:06:12,569 --> 00:06:13,499
and up again.

75
00:06:13,499 --> 00:06:20,119
In this particular case, performance always
increases.

76
00:06:20,119 --> 00:06:28,500
You were asked to estimate J48's true accuracy
on the segment-challenge dataset in Question 4.

77
00:06:28,509 --> 00:06:34,240
Well, "true accuracy" -- what do we mean by
"true accuracy"? I guess maybe it's not very

78
00:06:34,240 --> 00:06:40,770
well defined, but what one thinks of is if
you have a large enough training set, the

79
00:06:40,770 --> 00:06:45,300
performance of J48 is going to increase up
to some kind of point, and what would that

80
00:06:45,300 --> 00:06:55,780
point be? Actually, if you do this -- in fact,
you've done it! -- you found that between

81
00:06:55,789 --> 00:07:05,370
60% training sets and 97-98% training sets
using the percentage split option consistently

82
00:07:05,379 --> 00:07:10,619
yield correctly classified instances in the
range 94-97%.

83
00:07:10,619 --> 00:07:15,960
So 95% is probably the best fit from this
selection of possible numbers.

84
00:07:15,960 --> 00:07:22,339
It's true, by the way, that greater weight
is normally given to the training portion

85
00:07:22,339 --> 00:07:23,240
of this split.

86
00:07:23,240 --> 00:07:31,330
Usually when we use percentage split, we would
use 2/3, or maybe 3/4, or maybe 90% of the

87
00:07:31,339 --> 00:07:34,909
training data, and the smaller amount for
the test data.

88
00:07:36,600 --> 00:07:41,520
Questions 6 and 7 were confusing, and we've
changed those.

89
00:07:41,520 --> 00:07:48,890
The issue there was how a classifier's performance,
and secondly the reliability of the estimate

90
00:07:48,899 --> 00:07:53,490
of the classifier's performance, is expected
to increase as the volume of the training

91
00:07:53,490 --> 00:07:54,699
data increases.

92
00:07:56,020 --> 00:07:59,949
Or, how they change with the size of the dataset.

93
00:07:59,949 --> 00:08:05,249
The performance is expected to increase as
the volume of training data increases, and

94
00:08:05,249 --> 00:08:11,490
the reliability of the estimate is also expected
to increase as the volume of test data increases.

95
00:08:11,490 --> 00:08:14,689
With the percentage split option, there's
a trade-off between the amount of test data

96
00:08:14,689 --> 00:08:16,289
and the amount of training data.

97
00:08:16,289 --> 00:08:22,669
That's what that question is trying to get
at.

98
00:08:22,669 --> 00:08:31,030
Activity 2.3 Question 5: "How do the mean
and standard deviation estimates depend on

99
00:08:31,039 --> 00:08:40,900
the number of samples?" Well, the answer is
that roughly speaking both stay the same.

100
00:08:40,900 --> 00:08:45,460
Let me find Activity 2.3, Question 5.

101
00:08:46,340 --> 00:08:57,740
As you increase the number of samples, you
expect the estimated mean to converge to the true

102
00:08:57,740 --> 00:09:02,850
value of the mean, and the estimated standard
deviation to converge to the true standard

103
00:09:02,850 --> 00:09:04,150
deviation.

104
00:09:04,150 --> 00:09:09,050
So, they would both stay about the same.

105
00:09:09,050 --> 00:09:14,160
This is, in fact, now marked as correct.

106
00:09:14,160 --> 00:09:24,080
Actually, because of the "n - 1" in the denominator
of the formula for variance, it's true that

107
00:09:24,080 --> 00:09:29,820
the standard deviation decreases a tiny bit,
but it's a very small effect.

108
00:09:29,820 --> 00:09:34,770
So we've also accepted that answer as correct.

109
00:09:34,770 --> 00:09:38,630
That's how the mean and standard deviation
estimates depend on the number of samples.

110
00:09:38,630 --> 00:09:45,630
Perhaps a more important question is how the
reliability of the mean would change.

111
00:09:46,340 --> 00:09:51,710
What decreases is the standard error of the
estimate of the mean, which is the standard

112
00:09:51,710 --> 00:09:57,740
deviation of the theoretical distribution
of the large population of such estimates.

113
00:09:57,740 --> 00:10:04,740
The estimate of the mean is a better, more
reliable estimate with a larger training set size.

114
00:10:10,160 --> 00:10:17,610
"The supermarket dataset is weird." Yes, it
is weird: it's intended to be weird.

115
00:10:17,610 --> 00:10:25,960
Actually, in the supermarket dataset, each
instance represents a supermarket trolley,

116
00:10:25,960 --> 00:10:30,450
and, instead of putting a 0 for every item
you don't buy -- of course, when we go to

117
00:10:30,450 --> 00:10:36,660
the supermarket, we don't buy most of the
items in the supermarket -- the ARFF file

118
00:10:36,660 --> 00:10:39,800
codes that as a question mark, which stands
for "missing value".

119
00:10:39,800 --> 00:10:43,380
We're going to discuss missing values in Class 5.

120
00:10:44,320 --> 00:10:49,990
This dataset is suitable for association rule
learning, which we're not doing in this course.

121
00:10:49,990 --> 00:10:54,570
The message I'm trying to emphasize here is
that you need to understand what you're doing,

122
00:10:54,570 --> 00:10:57,220
not just process datasets blindly.

123
00:10:57,220 --> 00:10:59,250
Yes, it is weird.

124
00:11:00,520 --> 00:11:06,990
There's been some discussion on the mailing
list about cross-validation and the extra model.

125
00:11:06,990 --> 00:11:10,500
When you do cross-validation, you're trying
to do two things.

126
00:11:10,500 --> 00:11:19,540
You're trying to get an estimate of the expected
accuracy of a classifier, and you're trying

127
00:11:19,540 --> 00:11:21,930
to actually produce a really good classifier.

128
00:11:21,930 --> 00:11:27,090
To produce a really good classifier to use
in the future, you want to use the entire

129
00:11:27,090 --> 00:11:30,880
training set to train up the classifier.

130
00:11:30,880 --> 00:11:35,070
To get an estimate of its accuracy, however,
you can't do that unless you have an independent

131
00:11:35,070 --> 00:11:36,680
test set.

132
00:11:36,680 --> 00:11:44,190
So cross-validation takes 90% for training
and 10% for testing, repeats that 10 times,

133
00:11:44,190 --> 00:11:46,700
and averages the results to get an estimate.

134
00:11:46,700 --> 00:11:53,440
Once you've got the estimate, if you want
an actual classifier to use, the best classifier

135
00:11:53,440 --> 00:11:56,960
is one built on the full training set.

136
00:11:56,960 --> 00:12:00,760
The same is true with a percentage split option.

137
00:12:00,760 --> 00:12:05,190
Weka will evaluate the percentage split, but
then it will print the classifier that it

138
00:12:05,190 --> 00:12:10,600
produces from the entire training set to give
you a classifier to use on your problem in

139
00:12:10,600 --> 00:12:11,410
the future.

140
00:12:12,920 --> 00:12:16,310
There's been a little bit of discussion on
advanced stuff.

141
00:12:16,310 --> 00:12:19,570
I think maybe a follow-up course might be
a good idea here.

142
00:12:19,570 --> 00:12:24,430
Someone noticed that if you apply a filter
to the training set, you need to apply exactly

143
00:12:24,430 --> 00:12:28,690
the same filter to the test set, which is
sometimes a bit difficult to do, particularly

144
00:12:28,690 --> 00:12:33,220
if the training and test sets are produced
by cross-validation.

145
00:12:33,220 --> 00:12:40,010
There's an advanced classifier called the
"FilteredClassifier" which addresses that problem.

146
00:12:40,010 --> 00:12:45,160
In his response to a question on the supermarket
dataset, Peter mentioned "unbalanced" datasets,

147
00:12:45,160 --> 00:12:47,110
and the cost of different kinds of error.

148
00:12:47,110 --> 00:12:51,900
This is something that Weka can take into
account with a cost sensitive evaluation,

149
00:12:51,900 --> 00:12:58,090
and there is a classifier called the CostSensitiveClassifier
that allows you to do that.

150
00:12:58,090 --> 00:13:03,490
Finally, someone just asked a question on
attribute selection: how do you select a good

151
00:13:03,490 --> 00:13:09,050
subset of attributes? Excellent question!
There's a whole attribute Selection panel,

152
00:13:09,050 --> 00:13:11,490
which we're not able to talk about in this
MOOC.

153
00:13:11,490 --> 00:13:15,100
This is just an introductory MOOC on Weka.

154
00:13:15,100 --> 00:13:20,680
Maybe we'll come up with an advanced, followup
MOOC where we're able to discuss some of these

155
00:13:20,680 --> 00:13:22,030
more advanced issues.

156
00:13:23,340 --> 00:13:24,400
That's it.

157
00:13:24,400 --> 00:13:29,940
I just want to finish with a picture that
someone sent in of two wekas in an enclosure.

158
00:13:30,610 --> 00:13:36,350
It's rare to see wekas in the wild -- I've seen them a couple of times myself, but not very often.

159
00:13:36,350 --> 00:13:43,270
More likely, to see a weka you need to go
to a place where they keep captured wekas

160
00:13:43,270 --> 00:13:45,170
for you to look at.

161
00:13:45,170 --> 00:13:48,450
Here are two wekas that Leah from Vancouver
sent in.

162
00:13:50,000 --> 00:13:50,980
That's it.

163
00:13:50,980 --> 00:13:55,240
Now Class 3 is up now, and off you go with
Class 3.

164
00:13:55,240 --> 00:13:56,960
Good luck! We'll talk to you later.

165
00:13:56,960 --> 00:13:58,100
Bye for now!