﻿1
00:00:18,000 --> 00:00:23,040
Hi! We've just finished Class 3, and here
are some of the issues that arose.

2
00:00:23,040 --> 00:00:27,949
I have a list of them here, so let's start
at the top.

3
00:00:27,949 --> 00:00:34,800
Numeric precision in activities has caused
a little bit of unnecessary angst.

4
00:00:34,800 --> 00:00:36,570
So we've simplified our policy.

5
00:00:36,570 --> 00:00:41,440
In general, we're asking you to round your
percentages to the nearest integer.

6
00:00:41,440 --> 00:00:51,940
We certainly don't want you typing in those
4 decimal places, because that accuracy is misleading.

7
00:00:51,940 --> 00:00:54,250
Some people are getting the wrong results
in Weka.

8
00:00:54,250 --> 00:00:59,309
One reason you might get the wrong results
is that the random seed is not set to the

9
00:00:59,309 --> 00:01:00,559
default value.

10
00:01:00,559 --> 00:01:04,670
Whenever you change the random seed, it stays
there until you change it back or until you

11
00:01:04,670 --> 00:01:06,960
restart Weka.

12
00:01:06,960 --> 00:01:10,960
Just restart Weka or reset the random seed
to 1.

13
00:01:10,960 --> 00:01:14,420
Another thing you should do is check your
version of Weka.

14
00:01:14,420 --> 00:01:17,820
We asked you to download 3.6.10.

15
00:01:17,820 --> 00:01:22,450
There have been some bug fixes since the previous
version, so you really do need to use this

16
00:01:22,450 --> 00:01:25,590
new version.

17
00:01:25,590 --> 00:01:33,560
One of the activities asked you to copy an
attribute, and some people found some surprising

18
00:01:33,560 --> 00:01:37,840
things with Weka claiming 100% accuracy.

19
00:01:37,840 --> 00:01:43,970
If you accidentally ask Weka to predict something
that's already there as an attribute, it will

20
00:01:43,970 --> 00:01:49,780
do very well, with very high accuracy! It's
very easy to mislead yourself when you're

21
00:01:49,780 --> 00:01:51,039
doing data mining.

22
00:01:51,039 --> 00:01:56,530
You just need to make sure you know what you're
trying to predict, you know what the attributes

23
00:01:56,530 --> 00:02:02,640
are, and you haven't accidentally included
a copy of the class attribute as one of the

24
00:02:02,640 --> 00:02:05,720
attributes that's being used for prediction.

25
00:02:07,480 --> 00:02:11,770
There's been some discussion on the mailing
list about whether OneR is really always better

26
00:02:11,770 --> 00:02:13,819
than ZeroR on the training set.

27
00:02:13,819 --> 00:02:16,140
In fact, it is.

28
00:02:16,140 --> 00:02:16,939
Someone proved it.

29
00:02:16,939 --> 00:02:22,579
(Thank you Jurek for sharing that proof with
us.)

30
00:02:22,579 --> 00:02:29,370
Someone else found a counterexample! "If we
had a dataset with 10 instances, 6 belonging

31
00:02:29,370 --> 00:02:36,370
to Class A and 4 belonging to Class B, with
attribute values selected randomly, wouldn't

32
00:02:36,450 --> 00:02:43,450
ZeroR outperform OneR? -- OneR would be fooled
by the randomness of attribute values."

33
00:02:45,250 --> 00:02:49,329
It's kind of anthropomorphic to talk about OneR
being "fooled by" things.

34
00:02:49,329 --> 00:02:50,920
It's not fooled by anything.

35
00:02:50,920 --> 00:02:55,379
It's not a person; it's not a being: it's
just an algorithm.

36
00:02:55,379 --> 00:03:01,400
It just gets an input and does its thing with
the data.

37
00:03:01,400 --> 00:03:08,300
If you think that OneR might be fooled, then
why don't you try it? Set up this dataset

38
00:03:08,300 --> 00:03:14,100
with 10 instances, 6 in A and 4 in B, select
the attributes randomly, and see what happens.

39
00:03:14,109 --> 00:03:19,030
I think you'll be able to convince yourself
quite easily that this counterexample isn't

40
00:03:19,030 --> 00:03:20,489
a counterexample at all.

41
00:03:20,489 --> 00:03:26,129
It is definitely true that OneR is always
better than ZeroR on the training set.

42
00:03:26,129 --> 00:03:33,129
That doesn't necessarily mean it's going to
be better on an independent test set, of course.

43
00:03:33,780 --> 00:03:40,780
The next thing is Activity 3.3, which asks
you to repeat attributes with NaiveBayes.

44
00:03:41,370 --> 00:03:48,280
Some people asked "why are we doing this?"
It's just an exercise! We're just trying to

45
00:03:48,280 --> 00:03:54,540
understand NaiveBayes a bit better, and what
happens when you get highly correlated attributes,

46
00:03:54,540 --> 00:03:57,790
like repeated attributes.

47
00:03:57,790 --> 00:04:03,010
With NaiveBayes, enough repetitions mean that
the other attributes won't matter at all.

48
00:04:03,010 --> 00:04:08,129
This is because all attributes contribute
equally to the decision, so multiple copies

49
00:04:08,129 --> 00:04:10,469
of an attribute skew it in that direction.

50
00:04:10,469 --> 00:04:13,639
This is not true with other learning algorithms.

51
00:04:13,639 --> 00:04:17,870
It's true for NaiveBayes, but it's not true
for OneR or J48, for example.

52
00:04:17,870 --> 00:04:22,360
Copied attributes doesn't effect OneR at all.

53
00:04:22,360 --> 00:04:27,000
The copying exercise is just to illustrate
what happens with NaiveBayes when you have

54
00:04:27,000 --> 00:04:28,280
non-independent attributes.

55
00:04:28,280 --> 00:04:30,770
It's not something you do in real life.

56
00:04:30,770 --> 00:04:37,770
Although you might copy an attribute in order
to transform it in some way, for example.

57
00:04:38,750 --> 00:04:40,449
Someone asked about the mathematics.

58
00:04:40,449 --> 00:04:50,090
In Bayes formula you get Pr[E|H]^k, if the
attribute was repeated k times, in the top line.

59
00:04:50,090 --> 00:04:57,720
How does this work mathematically? First of
all, I'd just like to say that the Bayes formulation

60
00:04:57,720 --> 00:05:01,750
assumes independent attributes.

61
00:05:01,750 --> 00:05:05,509
Bayes expansion is not true if the attributes
are dependent.

62
00:05:05,509 --> 00:05:09,349
But the algorithm works off that, so let's
see what would happen.

63
00:05:09,349 --> 00:05:17,900
If you can stomach a bit of mathematics, here's
the equation for the probability of the hypothesis

64
00:05:17,900 --> 00:05:19,490
given the evidence (Pr[H|E]).

65
00:05:19,490 --> 00:05:24,610
H might be Play is "yes" or Play is "no",
for example, in the weather data.

66
00:05:24,610 --> 00:05:28,310
It's equal to this fairly complicated formula
at the top, which, let me just simplify it

67
00:05:28,310 --> 00:05:32,360
by writing "..." for all the bits after here.

68
00:05:32,360 --> 00:05:44,370
So Pr[E1|H]^k, where E1 is repeated k times,
times all the other stuff, divided by Pr[E].

69
00:05:44,370 --> 00:05:51,349
What the algorithm does: because we don't
know Pr[E], we normalize the 2 probabilities

70
00:05:51,349 --> 00:06:00,340
by calculating Pr[yes|E] using this formula
and Pr[no|E], and normalizing them so that

71
00:06:00,340 --> 00:06:02,599
they add up to 1.

72
00:06:02,599 --> 00:06:12,400
That then computes Pr[yes|E] as this thing
here -- which is at the top, up here -- Pr[E1|yes]^k,

73
00:06:12,400 --> 00:06:17,099
divided by that same thing, plus the corresponding
thing for "no".

74
00:06:17,099 --> 00:06:23,580
If you look at this formula and just forget
about the "...", what's going to happen is

75
00:06:23,580 --> 00:06:26,259
that these probabilities are less than 1.

76
00:06:26,259 --> 00:06:31,240
If we take them to the k'th power, they are
going to get very small as k gets bigger.

77
00:06:31,240 --> 00:06:33,159
In fact, they're going to approach 0.

78
00:06:33,159 --> 00:06:37,389
But one of them is going to approach 0 faster
than the other one.

79
00:06:37,389 --> 00:06:42,610
Whichever one is bigger -- for example, if
the "yes" one is bigger than the "no" one

80
00:06:42,610 --> 00:06:45,849
-- then it's going to dominate.

81
00:06:45,849 --> 00:06:53,690
The normalized probability then is going to
be 1 if the "yes" probability is bigger than

82
00:06:53,699 --> 00:06:57,050
the "no" probability, otherwise 0.

83
00:06:57,050 --> 00:07:01,569
That's what's actually going to happen in
this formula as k approaches infinity.

84
00:07:01,569 --> 00:07:05,909
The result is as though there is only one
attribute: E1.

85
00:07:05,909 --> 00:07:11,300
That's a mathematical explanation of what
happens when you copy attributes in NaiveBayes.

86
00:07:12,380 --> 00:07:18,300
Don't worry if you didn't follow that; that
was just for someone who asked.

87
00:07:19,650 --> 00:07:24,729
Decision trees and bits.

88
00:07:24,729 --> 00:07:28,069
Someone said on the mailing list that in the
lecture there was a condition that resulted

89
00:07:28,069 --> 00:07:32,979
in branches with all "yes" or all "no" results
completely determining things.

90
00:07:32,979 --> 00:07:39,639
Why was the information gain only [0.971] and
not the full 1 bit? This is the picture they

91
00:07:39,639 --> 00:07:41,159
were talking about.

92
00:07:41,159 --> 00:07:48,280
Here, "humidity" determines these are all "no" and these are all "yes" for high and normal humidity, respectively.

93
00:07:48,280 --> 00:07:57,520
When you calculate the information gain -- and
this is the formula for information gain -- you

94
00:07:57,520 --> 00:08:00,870
get 0.971 bits.

95
00:08:00,870 --> 00:08:06,789
You might expect 1 (and I would agree), and
you would get 1 if you had 3 no's and 3 yes's

96
00:08:06,789 --> 00:08:11,280
here, or if you had 2 no's and 2 yes's.

97
00:08:11,280 --> 00:08:16,030
But because there is a slight imbalance between
the number of no's and the number of yes's,

98
00:08:16,030 --> 00:08:20,819
you don't actually get 1 bit under these circumstances.

99
00:08:23,940 --> 00:08:30,750
There were some questions on Class 2 about
stratified cross-validation, which tries to

100
00:08:30,750 --> 00:08:34,409
get the same proportion of class values in
each fold.

101
00:08:34,409 --> 00:08:39,169
Some suggested maybe you should choose the
number of folds so that it can do this exactly,

102
00:08:39,169 --> 00:08:40,690
instead of approximately.

103
00:08:40,690 --> 00:08:47,590
If you chose as the number of folds an exact
divisor of the number of elements in each class, 

104
00:08:47,590 --> 00:08:50,380
we'd be able to do this exactly.

105
00:08:50,380 --> 00:08:52,320
"Would that be a good thing to do?" was the
question.

106
00:08:53,480 --> 00:08:55,000
The answer is no, not really.

107
00:08:55,000 --> 00:08:59,960
These things are all estimates, and you're
treating them as though they were exact answers.

108
00:08:59,960 --> 00:09:01,200
They are all just estimates.

109
00:09:01,200 --> 00:09:05,140
There are more important considerations to
take into account when determining the number

110
00:09:05,140 --> 00:09:07,760
of folds to do in your cross-validation.

111
00:09:07,760 --> 00:09:13,670
Like: you want a large enough test set to
get an accurate estimate of the classification

112
00:09:13,860 --> 00:09:20,340
performance, and you want a large enough training
set to train the classifier adequately.

113
00:09:20,340 --> 00:09:23,020
Don't worry about stratification being approximate.

114
00:09:23,020 --> 00:09:28,100
The whole thing is pretty approximate actually.

115
00:09:28,100 --> 00:09:33,350
Someone else asked "why is there a 'Use training
set'" option on the Classify tab.

116
00:09:33,350 --> 00:09:40,100
It's very misleading to take the evaluation
you get on the training data seriously, as

117
00:09:40,100 --> 00:09:40,890
we know.

118
00:09:40,890 --> 00:09:46,270
So why is it there in Weka? Well, we might
want it for some purposes.

119
00:09:46,270 --> 00:09:51,480
For example, it does give you a quick upper
bound on an algorithm's performance: it couldn't

120
00:09:51,480 --> 00:09:54,210
possibly do better than it would do on the
training set.

121
00:09:54,210 --> 00:09:59,630
That might be useful, allowing you to quickly
reject a learning algorithm.

122
00:09:59,630 --> 00:10:04,130
The important thing here is to understand
what is wrong with using the training set

123
00:10:04,130 --> 00:10:08,120
for a performance estimate, and what overfitting
is.

124
00:10:08,120 --> 00:10:13,930
Rather than changing the interface so you
can't do bad things, I would rather protect

125
00:10:13,930 --> 00:10:18,190
you by educating you about what the issues
are here.

126
00:10:20,340 --> 00:10:24,940
There have been quite a few suggested topics
for a follow-up course: attribute selection,

127
00:10:24,940 --> 00:10:30,900
clustering, the Experimenter, parameter optimization,
the KnowledgeFlow interface, and simple command

128
00:10:30,900 --> 00:10:32,390
line interface.

129
00:10:32,390 --> 00:10:36,660
We're considering a followup course, and we'll
be asking you for feedback on that at the

130
00:10:36,660 --> 00:10:38,600
end of this course.

131
00:10:38,600 --> 00:10:44,480
Finally, someone said "Please let me know
if there is a way to make a small donation"

132
00:10:44,480 --> 00:10:47,540
-- he's enjoying the course so much! Well,
thank you very much.

133
00:10:47,540 --> 00:10:52,880
We'll make sure there is a way to make a small
donation at the end of the course.

134
00:10:52,880 --> 00:10:53,760
That's it for now.

135
00:10:53,760 --> 00:10:54,900
On with Class 4.

136
00:10:54,900 --> 00:10:58,500
I hope you enjoy Class 4, and we'll talk again
later.

137
00:10:58,500 --> 00:11:05,500
Bye for now!

