1
00:00:17,789 --> 00:00:20,869
Hi! Welcome back to Data Mining with Weka.

2
00:00:20,869 --> 00:00:26,449
In the last lesson, we looked at classification
by regression, how to use linear regression

3
00:00:26,449 --> 00:00:33,079
to perform classification tasks. In this lesson
we're going to look at a more powerful way

4
00:00:33,079 --> 00:00:37,059
of doing the same kind of thing. It's called
"logistic regression". It's fairly mathematical,

5
00:00:37,059 --> 00:00:43,399
and we're not going to go into the dirty details
of how it works, but I'd like to give you

6
00:00:43,399 --> 00:00:48,879
a flavor of the kinds of things it does and
the basic principles that underline logistic

7
00:00:48,879 --> 00:00:53,629
regression. Then, of course, you can use it
yourself in Weka without any problem.

8
00:00:55,560 --> 00:00:59,750
One of the things about data mining is that
you can sometimes do better by using prediction

9
00:00:59,750 --> 00:01:05,970
probabilities rather than actual classes.
Instead of predicting whether it's going to

10
00:01:05,970 --> 00:01:10,820
be a "yes" or a "no", you might do better
to predict the probability with which you

11
00:01:10,820 --> 00:01:15,790
think it's going to be a "yes" or a "no".
For example, the weather is 95% likely to

12
00:01:15,790 --> 00:01:21,420
be rainy tomorrow, or 72% likely to be sunny,
instead of saying it's definitely going to

13
00:01:21,420 --> 00:01:26,080
be rainy or it's definitely going to be sunny.

14
00:01:26,080 --> 00:01:32,110
Probabilities are really useful things in
data mining. NaiveBayes produces probabilities;

15
00:01:32,110 --> 00:01:36,360
it works in terms of probabilities. We've
sen that in an earlier lesson.

16
00:01:36,360 --> 00:01:43,360
I'm going to open diabetes and run NaiveBayes.

17
00:01:49,640 --> 00:01:55,660
I'm going to use a percentage split with 90%,

18
00:01:55,660 --> 00:02:07,280
so that leaves 10% as a test set. Then I'm
going to make sure I output the predictions

19
00:02:07,280 --> 00:02:14,280
on those 10%, and run it. I want to look at
the predictions that have been output.

20
00:02:14,960 --> 00:02:20,840
This is a 2-class dataset, the classes are tested_negative
and tested_positive, and these are the instances

21
00:02:20,840 --> 00:02:25,569
-- number 1, number 2, number 3, etc. This
is the actual class -- tested_negative, tested_positive,

22
00:02:25,569 --> 00:02:29,959
tested_negative, etc. This is the predicted
class -- tested_negative, tested_negative,

23
00:02:29,959 --> 00:02:34,819
tested_negative, tested_negative, etc. This
is a plus under the error column to say where

24
00:02:34,819 --> 00:02:41,459
there's an error, so there's an error with
instance number 2. These are the actual probabilities

25
00:02:41,459 --> 00:02:43,019
that come out of NaiveBayes.

26
00:02:43,019 --> 00:02:51,020
So for instance 1 we've got a 99% probability
that it's negative, and a 1% probability that

27
00:02:51,029 --> 00:02:56,340
it's positive. So we predict it's going to
be negative; that's why that's tested_negative.

28
00:02:56,340 --> 00:03:02,489
And in fact we're correct; it is tested_negative.
This instance, which is actually incorrect,

29
00:03:02,489 --> 00:03:07,909
we're predicting 67% percent for negative
and 33% for positive, so we decide it's a

30
00:03:07,909 --> 00:03:14,549
negative, and we're wrong. We might have been
better saying that here we're really sure

31
00:03:14,549 --> 00:03:18,760
it's going to be a negative, and we're right;
here we think it's going to be a negative,

32
00:03:18,760 --> 00:03:24,260
but we're not sure, and it turns out that
we're wrong. Sometimes it's a lot better to

33
00:03:24,260 --> 00:03:31,150
think in terms of the output as probabilities,
rather than being forced to make a binary,

34
00:03:31,150 --> 00:03:34,620
black-or-white classification.

35
00:03:34,620 --> 00:03:41,620
Other data mining methods produce probabilities,
as well. If I look at ZeroR, and run that,

36
00:03:46,689 --> 00:03:53,689
these are the probabilities -- 65% versus
35%. All of them are the same.

37
00:03:55,000 --> 00:04:00,650
Of course, it's ZeroR! -- it always produces the same
thing. In this case, it always says tested_negative

38
00:04:00,650 --> 00:04:05,699
and always has the same probabilities. The
reason why the numbers are like that, if you

39
00:04:05,699 --> 00:04:11,650
look at the slide here, is that we've chosen
a 90% training set and a 10% test set, and

40
00:04:11,650 --> 00:04:18,650
the training set contains 448 negative instances
and 243 positive instances.

41
00:04:18,650 --> 00:04:28,180
Remember the "Laplace Correction" in Lesson 3.2? -- we add 1 to
each of those counts to get 449 and 244.

42
00:04:29,560 --> 00:04:37,620
That gives us a 65% probability for being a negative
instance. That's where these numbers come from.

43
00:04:40,150 --> 00:04:50,800
If we look at J48 and run that, then we get
more interesting probabilities here --

44
00:04:51,920 --> 00:04:56,560
the negative and positive probabilities, respectively.

45
00:04:56,560 --> 00:04:58,200
You can see where the errors are.

46
00:04:58,200 --> 00:05:00,430
These probabilities are all different.

47
00:05:00,430 --> 00:05:06,110
Internally, J48 uses probabilities in order
to do its pruning operations.

48
00:05:06,110 --> 00:05:11,820
We talked about that when we discussed J48's
pruning, although I didn't explain explicitly

49
00:05:11,820 --> 00:05:15,400
how the probabilities are derived.

50
00:05:15,400 --> 00:05:21,380
The idea of logistic regression is to make
linear regression produce probabilities, too.

51
00:05:21,380 --> 00:05:23,990
This gets a little bit hairy.

52
00:05:23,990 --> 00:05:29,380
Remember, when we use linear regression for
classification, we calculate a linear function

53
00:05:29,380 --> 00:05:36,380
using regression and then apply a threshold
to decide whether it's a 0 or a 1.

54
00:05:36,650 --> 00:05:41,200
It's tempting to imagine that you can interpret
these numbers as probabilities, instead of

55
00:05:41,200 --> 00:05:43,660
thresholding like that, but that's a mistake.

56
00:05:43,660 --> 00:05:45,690
They're not probabilities.

57
00:05:45,690 --> 00:05:48,960
These numbers that come out on the regression
line are sometimes negative, and sometimes

58
00:05:48,960 --> 00:05:50,100
greater than 1.

59
00:05:50,100 --> 00:05:54,710
They can't be probabilities, because probabilities
don't work like that.

60
00:05:54,710 --> 00:06:01,660
In order to get better probability estimates,
a slightly more sophisticated technique is used.

61
00:06:01,660 --> 00:06:04,350
In linear regression, we have a linear sum.

62
00:06:04,350 --> 00:06:10,020
In logistic regression, we have the same linear
sum down here -- the same kind of linear sum

63
00:06:10,020 --> 00:06:13,540
that we saw before -- but we embed it in this
kind of formula.

64
00:06:13,540 --> 00:06:16,120
This is called a "logit transform".

65
00:06:16,120 --> 00:06:21,460
A logit transform -- this is multi-dimensional
with a lot of different a's here.

66
00:06:21,460 --> 00:06:27,340
If we've got just one dimension, one variable,
a1, then if this is the input to the logit

67
00:06:27,340 --> 00:06:32,360
transform, the output looks like this: it's
between 0 and 1.

68
00:06:32,360 --> 00:06:36,090
It's sort of an S-shaped curve that applies
a softer function.

69
00:06:36,090 --> 00:06:42,540
Rather than just 0 and then a step function,
it's soft version of a step function that

70
00:06:42,540 --> 00:06:49,800
never gets below 0, never gets above 1, and
has a smooth transition in between.

71
00:06:49,800 --> 00:06:54,930
When you're working with a logit transform,
instead of minimizing the squared error (remember,

72
00:06:54,930 --> 00:07:00,460
when we do linear regression we minimize the
squared error), it's better to choose weights

73
00:07:00,460 --> 00:07:05,860
to maximize a probabilistic function called
the "log-likelihood function", which is this

74
00:07:05,860 --> 00:07:10,210
pretty scary looking formula down at the bottom.

75
00:07:10,210 --> 00:07:12,620
That's the basis of logistic regression.

76
00:07:12,620 --> 00:07:15,889
We won't talk about the details any more:
let me just do it.

77
00:07:15,889 --> 00:07:19,139
We're going to use the diabetes dataset.

78
00:07:19,139 --> 00:07:23,360
In the last lesson we got 76.8% with classification
by regression.

79
00:07:23,360 --> 00:07:29,370
Let me tell you if you do ZeroR, NaiveBayes,
and J48, you get these numbers here.

80
00:07:29,370 --> 00:07:35,460
I'm going to find the logistic regression
scheme.

81
00:07:35,460 --> 00:07:38,310
It's in "functions", and called "Logistic".

82
00:07:38,310 --> 00:07:41,620
I'm going to use 10-fold cross-validation.

83
00:07:41,620 --> 00:07:43,540
I'm not going to output the predictions.

84
00:07:45,360 --> 00:07:50,540
I'll just run it -- and I get 77.2% accuracy.

85
00:07:52,080 --> 00:07:59,290
That's the best figure in this column, though
it's not much better than NaiveBayes, so you

86
00:07:59,290 --> 00:08:02,070
might be a bit skeptical about whether it
really is better.

87
00:08:02,070 --> 00:08:07,639
I did this 10 times and calculated the means
myself, and we get these figures for the mean

88
00:08:07,639 --> 00:08:08,930
of 10 runs.

89
00:08:08,930 --> 00:08:15,480
ZeroR stays the same, of course, at 65.1%;
it produces the same accuracy on each run.

90
00:08:15,480 --> 00:08:21,910
NaiveBayes and J48 are different, and here
logistic regression gets an average of 77.5%,

91
00:08:21,910 --> 00:08:27,970
which is appreciably better than the other
figures in this column.

92
00:08:27,970 --> 00:08:30,880
You can extend the idea to multiple classes.

93
00:08:30,880 --> 00:08:37,880
When we did this in the previous lesson, we
performed a regression for each class, a multi-response

94
00:08:37,880 --> 00:08:38,810
regression.

95
00:08:38,810 --> 00:08:44,209
That actually doesn't work well with logistic
regression, because you need the probabilities

96
00:08:44,209 --> 00:08:48,149
to sum to 1 over the various different classes.

97
00:08:48,149 --> 00:08:50,700
That introduces more computational complexity

98
00:08:50,700 --> 00:08:55,000
and needs to be tackled as a joint optimization problem.

99
00:08:57,040 --> 00:09:02,850
The result is logistic regression, a popular
and powerful machine learning method that

100
00:09:02,850 --> 00:09:07,009
uses the logit transform to predict probabilities directly.

101
00:09:07,009 --> 00:09:12,749
It works internally with probabilities, like
NaiveBayes does.

102
00:09:12,749 --> 00:09:17,250
We also learned in this lesson about prediction
probabilities that can be obtained from other

103
00:09:17,250 --> 00:09:21,699
methods, and how to calculate probabilities
from ZeroR.

104
00:09:21,699 --> 00:09:26,520
You can read in the course text about logistic
regression in Section 4.6.

105
00:09:26,520 --> 00:09:30,500
Now you should go and do the activity associated
with this lesson.

106
00:09:30,500 --> 00:09:31,500
See you soon.

107
00:09:31,500 --> 00:09:33,000
Bye for now!