1
00:00:17,560 --> 00:00:21,910
Hi! This is Lesson 3.3 on using probabilities.

2
00:00:21,910 --> 00:00:26,160
It's the one bit of Data Mining with Weka
that we're going to see a little bit of mathematics,

3
00:00:26,160 --> 00:00:31,609
but don't worry, I'll take you through it
gently.

4
00:00:31,609 --> 00:00:36,879
The OneR strategy that we've just been studying
assumes that there is one of the attributes

5
00:00:36,879 --> 00:00:40,930
that does all the work, that takes the responsibility
of the decision.

6
00:00:41,520 --> 00:00:43,120
That's a simple strategy.

7
00:00:43,120 --> 00:00:48,420
Another simple strategy is the opposite, to
assume all of the attributes contribute equally

8
00:00:48,420 --> 00:00:51,659
and independently to the decision.

9
00:00:51,659 --> 00:00:54,229
This is called the "Naive Bayes" method --

10
00:00:54,229 --> 00:00:55,869
I'll explain the name later on.

11
00:00:56,580 --> 00:01:02,159
There are two assumptions that underline Naive
Bayes: that the attributes are equally important

12
00:01:02,159 --> 00:01:05,070
and that they are statistically independent,

13
00:01:05,070 --> 00:01:09,909
that is, knowing the value of one of the attributes
doesn't tell you anything about the value

14
00:01:09,909 --> 00:01:12,619
of any of the other attributes.

15
00:01:12,619 --> 00:01:17,780
This independence assumption is never actually
correct, but the method based on it often

16
00:01:17,780 --> 00:01:23,509
works well in practice.

17
00:01:23,509 --> 00:01:30,159
There's a theorem in probability called "Bayes
Theorem" after this guy Thomas Bayes from the

18
00:01:30,159 --> 00:01:33,030
18th century.

19
00:01:33,030 --> 00:01:39,369
It's about the probability of a hypothesis
H given evidence E.

20
00:01:39,369 --> 00:01:46,100
In our case, the hypothesis is the class of
an instance and the evidence is the attribute

21
00:01:46,100 --> 00:01:48,899
values of the instance.

22
00:01:48,899 --> 00:01:55,319
The theorem is that Pr[H|E] -- the probability of the class
given the instance, the hypothesis

23
00:01:55,319 --> 00:02:02,109
given the evidence -- is equal to Pr[E|H] times Pr[H] divided

24
00:02:02,109 --> 00:02:06,119
by Pr[E].

25
00:02:06,119 --> 00:02:13,119
Pr[H] by itself is called the [prior] probability
of the hypothesis H.

26
00:02:13,290 --> 00:02:18,480
That's the probability of the event before
any evidence is seen.

27
00:02:18,480 --> 00:02:22,800
That's really the baseline probability of
the event.

28
00:02:22,800 --> 00:02:29,370
For example, in the weather data, I think
there are 9 yeses and 5 nos, so the baseline

29
00:02:29,370 --> 00:02:38,280
probability of the hypothesis "play equals
yes" is 9/14 and "play equals no" is 5/14.

30
00:02:38,280 --> 00:02:44,920
What this equation says is how to update that
probability Pr[H] when you see some evidence,

31
00:02:44,920 --> 00:02:51,340
to get what's call the "a posteriori" probability
of H, that means after the evidence.

32
00:02:51,340 --> 00:02:58,340
The evidence in our case is the attribute
values of an unknown instance. That's E.

33
00:03:01,159 --> 00:03:02,129
That's Bayes Theorem.

34
00:03:02,129 --> 00:03:08,430
Now, what makes this method "naive"? The naive
assumption is -- I've said it before -- that the

35
00:03:08,430 --> 00:03:13,140
evidence splits into parts that are statistically
independent.

36
00:03:13,140 --> 00:03:19,390
The parts of the evidence in our case are
the four different attribute values in the

37
00:03:19,390 --> 00:03:20,950
weather data.

38
00:03:20,950 --> 00:03:28,280
When you have independent events, the probabilities
multiply, so Pr[H|E],

39
00:03:28,280 --> 00:03:33,719
according to the top equation, is the product
of Pr[E|H] times the prior probability

40
00:03:33,719 --> 00:03:37,379
Pr[H] divided by Pr[E].

41
00:03:37,379 --> 00:03:43,079
Pr[E|H] splits up into
these parts: Pr[E1|H],

42
00:03:43,079 --> 00:03:48,030
the first attribute value; Pr[E2|H],
the second attribute value; and so on for all

43
00:03:48,030 --> 00:03:51,030
of the attributes.

44
00:03:51,030 --> 00:03:56,650
That's maybe a bit abstract, let's look at
the actual weather data.

45
00:03:56,650 --> 00:03:59,829
On the right-hand side is the weather data.

46
00:03:59,829 --> 00:04:03,930
In the large table at the top, we've taken
each of the attributes.

47
00:04:03,930 --> 00:04:09,799
Let's start with "outlook". Under the "yes" hypothesis and the "no" hypothesis, we've looked at

48
00:04:09,799 --> 00:04:11,959
how many times the outlook is "sunny".

49
00:04:11,959 --> 00:04:14,849
It's sunny twice under yes and 3 times under no.

50
00:04:14,849 --> 00:04:18,220
That comes straight from the data in the table.

51
00:04:18,220 --> 00:04:19,840
Overcast.

52
00:04:19,840 --> 00:04:25,120
When the outlook is overcast, it's always
a "yes" instance, so there were 4 of those,

53
00:04:25,120 --> 00:04:26,950
and zero "no" instances.

54
00:04:26,950 --> 00:04:31,250
Then, rainy is 3 "yes" instances and 2 "no"
instances.

55
00:04:31,250 --> 00:04:35,979
Those numbers just come straight from the
data table given the instance values.

56
00:04:35,979 --> 00:04:40,380
Then, we take those numbers and underneath
we make them into probabilities.

57
00:04:40,380 --> 00:04:43,259
Let's say we know the hypothesis.

58
00:04:43,259 --> 00:04:46,160
Let's say we know it's a "yes".

59
00:04:46,160 --> 00:04:52,960
Then the probability of it being "sunny" is
2/9ths, "overcast" is 4/9ths, and "rainy" 3/9ths,

60
00:04:52,960 --> 00:04:56,460
simply because when you add up 2 plus 4 plus
3 you get 9.

61
00:04:56,460 --> 00:04:59,400
Those are the probabilities.

62
00:04:59,400 --> 00:05:06,860
If we know that the outcome is "no", the probabilities
are "sunny" 3/5ths, "overcast" 0/5ths, and "rainy"

63
00:05:06,860 --> 00:05:08,340
2/5ths.

64
00:05:08,340 --> 00:05:10,169
That's for the "outlook" attribute.

65
00:05:11,740 --> 00:05:18,060
That's what we're looking for, you see, the
probability of each of these attribute values

66
00:05:18,060 --> 00:05:21,729
given the hypothesis H.

67
00:05:21,729 --> 00:05:25,889
The next attribute is temperature, and we
just do the same thing with that to get the

68
00:05:25,889 --> 00:05:30,729
probabilities of the 3 values -- hot, mild,
and cool -- under the "yes" hypothesis or the

69
00:05:30,729 --> 00:05:32,199
"no" hypothesis.

70
00:05:32,199 --> 00:05:39,960
The same with humidity and windy. Play,
that's the prior probability -- Pr[H].

71
00:05:39,960 --> 00:05:45,669
It's "yes" 9/14ths of the time, "no" 5/14ths of the
time, even if you don't know anything about

72
00:05:45,669 --> 00:05:47,810
the attribute values.

73
00:05:47,810 --> 00:05:52,669
The equation we're looking at is this one
below, and we just need to work it out.

74
00:05:52,669 --> 00:05:54,090
Here's an example.

75
00:05:54,090 --> 00:05:56,970
Here's an unknown day, a new day.

76
00:05:56,970 --> 00:06:03,970
We don't know what the value of "play" is, but
we know it's sunny, cool, high, and windy.

77
00:06:05,280 --> 00:06:07,509
We can just multiply up these probabilities.

78
00:06:07,509 --> 00:06:13,819
If we multiply for the yes hypothesis, we
get 2/9th times 3/9ths times 3/9ths times

79
00:06:13,819 --> 00:06:22,300
3/9ths -- those are just the numbers on the
previous slide Pr[E1|H], Pr[E2|H], Pr[E3|H]

80
00:06:22,300 --> 00:06:28,400
Pr[E4|H] -- finally Pr[H], that is 9/14ths.

81
00:06:28,400 --> 00:06:36,560
That gives us a likelihood of 0.0053 when
you multiply them.

82
00:06:36,560 --> 00:06:43,560
Then, for the "no" class, we do the same to
get a likelihood of 0.0206.

83
00:06:44,120 --> 00:06:46,720
These numbers are not probabilities.

84
00:06:46,720 --> 00:06:48,129
Probabilities have to add up to 1.

85
00:06:48,129 --> 00:06:49,639
They are likelihoods.

86
00:06:49,639 --> 00:06:55,610
But we can get the probabilities from them
by using a straightforward technique of normalization.

87
00:06:55,610 --> 00:06:56,500
Take those likelihoods for "yes"

88
00:06:56,500 --> 00:07:02,440
and "no" and we normalize them as shown below
to make them add up to 1.

89
00:07:02,440 --> 00:07:09,440
That's how we get the probability of "play"
on a new day with different attribute values.

90
00:07:10,030 --> 00:07:11,380
Just to go through that again.

91
00:07:11,380 --> 00:07:17,340
The evidence is "outlook" is "sunny", "temperature"
is "cool", "humidity" is "high", "windy" is "true" --

92
00:07:17,340 --> 00:07:19,550
and we don't know what play is.

93
00:07:19,550 --> 00:07:26,990
The [likelihood] of a "yes", given the evidence
is the product of those 4 probabilities -- one

94
00:07:26,990 --> 00:07:33,000
for outlook, temperature, humidity and windy
-- times the prior probability, which is

95
00:07:33,000 --> 00:07:37,000
just the baseline probability of a "yes".

96
00:07:37,000 --> 00:07:40,650
That product of fractions is divided by Pr[E].

97
00:07:40,650 --> 00:07:45,160
We don't know what Pr[E] is, but it doesn't
matter, because we can do the same calculation

98
00:07:45,160 --> 00:07:52,240
for Pr[E] of "no", which gives us another
equation just like this, and then we can calculate

99
00:07:52,240 --> 00:07:56,870
the actual probabilities by normalizing them
so that the two probabilities add up to 1.

100
00:07:56,870 --> 00:08:01,560
Pr[E] for "yes" plus Pr[E] for "no" equals 1.

101
00:08:02,220 --> 00:08:07,850
It's actually quite simple when you look at
it in numbers, and it's simple when you look

102
00:08:07,850 --> 00:08:09,660
at it in Weka, as well.

103
00:08:09,660 --> 00:08:15,490
I'm going to go to Weka here, and I'm going
to open the nominal weather data,

104
00:08:15,490 --> 00:08:19,920
which is here.

105
00:08:19,920 --> 00:08:22,540
We've seen that before, of course, many times.

106
00:08:22,540 --> 00:08:25,590
I'm going to go to Classify.

107
00:08:25,590 --> 00:08:29,150
I'm going to use the NaiveBayes method.

108
00:08:29,150 --> 00:08:30,800
It's under this bayes category here.

109
00:08:30,800 --> 00:08:34,280
There are a lot of implementations of different
variants of Bayes.

110
00:08:34,280 --> 00:08:38,240
I'm just going to use the straightforward
NaiveBayes method here.

111
00:08:38,650 --> 00:08:42,480
I'll just run it.

112
00:08:42,480 --> 00:08:43,960
This is what we get.

113
00:08:44,870 --> 00:08:48,170
The success probability calculated according
to cross-validation.

114
00:08:48,170 --> 00:08:51,570
More interestingly, we get the model.

115
00:08:51,570 --> 00:08:56,900
The model is just like the table I showed
you before divided under the "yes" class and

116
00:08:56,900 --> 00:08:58,320
the "no" class.

117
00:08:58,320 --> 00:09:04,600
We've got the four attributes -- outlook,
temperature, humidity, and windy -- and then,

118
00:09:04,600 --> 00:09:10,020
for each of the attribute values, we've got
the number of times that attribute value appears.

119
00:09:10,630 --> 00:09:15,400
Now, there's one little and important difference
between this table and the one I showed you before.

120
00:09:15,400 --> 00:09:15,420
Let me go back to my slide and look at these
numbers.
before.

121
00:09:15,420 --> 00:09:18,490
Let me go back to my slide and look at these
numbers.

122
00:09:18,490 --> 00:09:26,670
You can see that for outlook under "yes" on
my slide, I've got 2, 4, and 3, and Weka has

123
00:09:26,670 --> 00:09:29,410
got 3, 5, and 4.

124
00:09:29,410 --> 00:09:35,960
That's 1 more each time for a total of 12,
instead of a total of 9.

125
00:09:35,960 --> 00:09:39,410
Weka adds 1 to all of the counts.

126
00:09:39,410 --> 00:09:42,990
The reason it does this is to get
rid of the zeros.

127
00:09:42,990 --> 00:09:50,580
In the original table under outlook, under
"no", the probability of overcast given "no" is

128
00:09:50,580 --> 00:09:53,670
zero, and we're going to be multiplying that
into things.

129
00:09:53,670 --> 00:09:58,200
What that would mean in effect, if we took
that zero at face value, is that the probability

130
00:09:58,200 --> 00:10:06,050
of the class being "no" given any day for which
the outlook was overcast would be zero.

131
00:10:06,050 --> 00:10:09,230
Anything multiplied by zero is zero.

132
00:10:09,230 --> 00:10:13,970
These zeros in probability terms have sort
of a veto over all of the other numbers, and

133
00:10:13,970 --> 00:10:14,940
we don't want that.

134
00:10:14,940 --> 00:10:21,010
We don't want to categorically conclude that
it must be a "no" day on a basis that it's overcast,

135
00:10:21,010 --> 00:10:25,590
and we've never seen an overcast outlook on
a "no" day before.

136
00:10:26,270 --> 00:10:30,800
That's called a "zero-frequency problem", and
Weka's solution -- the most common solution

137
00:10:30,800 --> 00:10:34,650
-- is very simple, we just add 1 to all the
counts.

138
00:10:34,650 --> 00:10:39,690
That's why all those numbers in the Weka table
are 1 bigger than the numbers in the table

139
00:10:39,690 --> 00:10:41,290
on the slide.

140
00:10:42,030 --> 00:10:45,540
Aside from that, it's all exactly the same.

141
00:10:45,540 --> 00:10:50,780
We're avoiding zero frequencies by effectively
starting all counts at 1 instead of starting

142
00:10:50,780 --> 00:10:56,480
them at 0, so they can't end up at 0.

143
00:10:57,090 --> 00:10:59,480
That's the Naive Bayes method.

144
00:10:59,480 --> 00:11:04,210
The assumption is that all attributes contribute
equally and independently to the outcome.

145
00:11:04,210 --> 00:11:09,710
That works surprisingly well, even in situations
where the independence assumption is clearly violated.

146
00:11:11,040 --> 00:11:13,520
Why does it work so well when the assumption
is wrong?

147
00:11:13,520 --> 00:11:15,450
That's a good question.

148
00:11:15,450 --> 00:11:19,170
Basically, classification doesn't need accurate
probability estimates.

149
00:11:19,170 --> 00:11:25,110
We're just going to choose as the class the
outcome with the largest probability.

150
00:11:25,110 --> 00:11:29,600
As long as the greatest probability is assigned
to the correct class, it doesn't matter if

151
00:11:29,600 --> 00:11:33,540
the probability estimates are all that accurate.

152
00:11:33,540 --> 00:11:38,330
This actually means that if you add redundant
attributes you get problems with Naive Bayes.

153
00:11:38,330 --> 00:11:44,630
The extreme case of dependence is where two
attributes have the same values, identical

154
00:11:44,630 --> 00:11:46,160
attributes.

155
00:11:46,160 --> 00:11:49,780
That will cause havoc with the Naive Bayes
method.

156
00:11:49,780 --> 00:11:54,550
However, Weka contains methods for attribute
selection to allow you to select a subset

157
00:11:54,550 --> 00:12:00,100
of fairly independent attributes after which
you can safely use Naive Bayes.

158
00:12:01,610 --> 00:12:07,100
There's quite a bit of stuff on statistical
modeling in Section 4.2 of the course text.

159
00:12:07,890 --> 00:12:12,530
Now you need to go and do that activity.

160
00:12:12,530 --> 00:12:14,070
See you soon!