﻿1
00:00:16,779 --> 00:00:24,500
Hi! Before we go on to talk about some more
simple classifier methods, we need to talk about overfitting.

2
00:00:25,890 --> 00:00:30,190
Any machine learning method may 'overfit'
the training data, that's when it produces

3
00:00:30,190 --> 00:00:36,760
a classifier that fits the training data too
tightly and doesn't generalize well to independent

4
00:00:36,760 --> 00:00:39,460
test data.

5
00:00:39,460 --> 00:00:43,470
Remember the user classifier that you built
at the beginning of Class 2, when you built

6
00:00:43,470 --> 00:00:49,440
a classifier yourself? Imagine tediously putting
a tiny circle around every single training

7
00:00:49,440 --> 00:00:51,280
data point.

8
00:00:51,280 --> 00:00:56,890
You could build a classifier very laboriously
that would be 100% correct on the training

9
00:00:56,890 --> 00:01:01,280
data, but probably wouldn't generalize very
well to independent test data.

10
00:01:01,280 --> 00:01:03,320
That's overfitting.

11
00:01:03,320 --> 00:01:04,900
It's a general problem.

12
00:01:04,900 --> 00:01:07,040
We're going to illustrate it with OneR.

13
00:01:09,020 --> 00:01:14,960
We're going to look at the numeric version
of the weather problem, where temperature

14
00:01:14,960 --> 00:01:18,630
and humidity are numbers and not nominal values.

15
00:01:18,630 --> 00:01:25,630
If you think about how OneR works, when it
comes to make a rule on the attribute temperature,

16
00:01:25,640 --> 00:01:31,430
it's going to make complex rule that branches
14 different ways perhaps for the 14 different

17
00:01:31,430 --> 00:01:33,710
instances of the dataset.

18
00:01:33,710 --> 00:01:39,290
Each rule is going to have zero errors; it's
going to get it exactly right.

19
00:01:39,290 --> 00:01:44,070
If we branch on temperature, we're going to
get a perfect rule, with a total error count

20
00:01:44,070 --> 00:01:45,210
of zero.

21
00:01:48,020 --> 00:01:52,990
In fact, OneR has a parameter that limits
the complexity of rules.

22
00:01:53,920 --> 00:01:56,140
I'm not going to talk about how it works.

23
00:01:56,140 --> 00:02:00,390
It's pretty simple, but it's just a bit distracting
and not very important.

24
00:02:00,390 --> 00:02:06,890
The point is that the parameter allows you
to limit the complexity of the rules that are

25
00:02:06,890 --> 00:02:09,250
produced by OneR.

26
00:02:09,250 --> 00:02:16,250
Let's open the numeric weather data.

27
00:02:20,020 --> 00:02:26,330
We can go to OneR, and choose it.

28
00:02:26,330 --> 00:02:31,720
There's OneR, and let's just create a rule.

29
00:02:31,720 --> 00:02:35,610
Here the rule is based on the outlook attribute.

30
00:02:35,610 --> 00:02:38,050
This is exactly what happened in the last
lesson with

31
00:02:38,050 --> 00:02:42,990
the nominal version of the weather data.

32
00:02:42,990 --> 00:02:49,990
Let's just remove the outlook attribute, and
try it again.

33
00:02:51,700 --> 00:02:57,310
Now let's see what happens when we classify
with OneR.

34
00:03:02,840 --> 00:03:04,710
Now it branches on humidity.

35
00:03:04,880 --> 00:03:11,080
If humidity is less than 82.5%, it's a yes day;
if it's greater than 82.5%, it's a no day and

36
00:03:11,080 --> 00:03:15,100
that gets 10 out of 14 instances correct.

37
00:03:15,100 --> 00:03:22,060
So far so good, that's using the default setting
of OneR's parameter that controls the complexity

38
00:03:22,060 --> 00:03:24,420
of the rules it generates.

39
00:03:24,420 --> 00:03:30,240
We can go and look at OneR, and remember you
can configure a classifier by clicking on it.

40
00:03:30,240 --> 00:03:36,880
We see that there's a parameter called minBucketSize,

41
00:03:36,880 --> 00:03:39,570
and it's set to 6 by default, which is a good

42
00:03:39,570 --> 00:03:41,460
compromise value.

43
00:03:41,460 --> 00:03:47,960
I'm going to change that value to 1, and then
see what happens.

44
00:03:48,570 --> 00:03:54,320
Run OneR again, and now I get a different
kind of rule.

45
00:03:54,320 --> 00:03:58,740
It's branching many different ways on the
temperature attribute.

46
00:03:58,740 --> 00:04:05,740
This rule is overfitted to the dataset.

47
00:04:06,120 --> 00:04:11,260
It's a very accurate rule on the training
data, but it won't generalize well to independent

48
00:04:11,260 --> 00:04:14,910
test data.

49
00:04:15,280 --> 00:04:18,729
Now let's see what happens with a more realistic
dataset.

50
00:04:18,729 --> 00:04:25,729
I'll open diabetes, which is a numeric dataset.

51
00:04:26,849 --> 00:04:32,910
All the attributes are numeric, and the class
is either tested_negative or tested_positive.

52
00:04:32,910 --> 00:04:37,860
Let's run ZeroR to get a baseline figure for
this dataset.

53
00:04:37,860 --> 00:04:41,180
Here I get 65% for the baseline.

54
00:04:41,180 --> 00:04:44,430
We really ought to be able to do better than
that.

55
00:04:44,430 --> 00:04:47,090
Let's run OneR.

56
00:04:47,090 --> 00:04:52,020
The default parameter settings that is a value
of 6 for OneR's parameter that controls rule

57
00:04:52,020 --> 00:04:53,389
complexity.

58
00:04:53,389 --> 00:04:56,150
We get 71.5%.

59
00:04:56,150 --> 00:04:58,310
That's pretty good.

60
00:04:58,310 --> 00:05:00,560
We're evaluating using cross-validation.

61
00:05:00,560 --> 00:05:05,560
OneR outperforms the baseline accuracy by
quite a bit -- 71% versus 65%.

62
00:05:05,560 --> 00:05:10,020
If we look at the rule, it branches on "plas".

63
00:05:10,020 --> 00:05:11,729
This is the plasma-glucose concentration.

64
00:05:11,729 --> 00:05:16,620
So, depending on which of these regions the
plasma-glucose concentration falls into,

65
00:05:16,620 --> 00:05:19,400
then we're going to predict a negative or
a positive outcome.

66
00:05:19,400 --> 00:05:21,650
That seems like quite a sensible rule.

67
00:05:21,650 --> 00:05:25,770
Now, let's change OneR's parameter to make
it overfit.

68
00:05:25,770 --> 00:05:31,280
We'll configure OneR, find the minBucketSize parameter,
and change it to 1.

69
00:05:33,720 --> 00:05:41,440
When we run OneR again, we get 57% accuracy,
quite a bit lower than the ZeroR baseline

70
00:05:41,440 --> 00:05:41,750
of 65%.

71
00:05:41,750 --> 00:05:45,850
If you look at the rule.

72
00:05:45,850 --> 00:05:49,470
Here it is.

73
00:05:49,470 --> 00:05:53,229
It's testing a different attribute, pedi,
which -- if you look at the comments of the

74
00:05:53,229 --> 00:05:58,630
ARFF file -- happens to be the diabetes pedigree
function, whatever that is.

75
00:05:58,630 --> 00:06:02,039
You can see that this attribute has a lot
of different values, and it looks like we're

76
00:06:02,039 --> 00:06:04,660
branching on pretty well every single one.

77
00:06:04,660 --> 00:06:10,069
That gives us lousy performance when evaluated
by cross-validation, which is what we're doing now.

78
00:06:10,930 --> 00:06:16,849
If you were to evaluate it on the training
set, you would expect to see very good performance.

79
00:06:16,849 --> 00:06:23,849
Yes, here we get 87.5% accuracy on the training
set, which is very good for this dataset.

80
00:06:24,110 --> 00:06:29,110
Of course, that figure is completely misleading;
the rule is strongly overfitted to the training

81
00:06:29,110 --> 00:06:33,160
dataset, and doesn't generalize well to independent
test sets.

82
00:06:33,160 --> 00:06:36,340
That's a good example of overfitting.

83
00:06:36,340 --> 00:06:39,819
Overfitting is a general phenomenon that plagues
all machine learning methods.

84
00:06:39,819 --> 00:06:45,280
We've illustrated it by playing around with
the parameter of the OneR method, but it happens

85
00:06:45,280 --> 00:06:47,319
with all machine learning methods.

86
00:06:47,319 --> 00:06:51,190
It's one reason why you should never evaluate
on the training set.

87
00:06:51,190 --> 00:06:54,380
Overfitting can occur in more general contexts.

88
00:06:54,380 --> 00:06:59,229
Let's suppose you've got a dataset and you
choose a very large number of machine learning methods, 

89
00:06:59,229 --> 00:07:04,789
say a million different machine learning
methods and choose the best for your dataset

90
00:07:04,789 --> 00:07:06,370
using cross-validation.

91
00:07:06,370 --> 00:07:10,800
Well, because you've used so many machine
learning methods, you can't expect to get

92
00:07:10,800 --> 00:07:13,610
the same performance on new test data.

93
00:07:13,610 --> 00:07:18,490
You've chosen so many, that the one that you've
ended up with is going to be overfitted to

94
00:07:18,490 --> 00:07:20,669
the dataset you're using.

95
00:07:20,669 --> 00:07:24,669
It's not sufficient just to use cross-validation
and believe the results.

96
00:07:24,669 --> 00:07:30,720
In this case, you might divide the data three
ways, into a training set, a test set, and

97
00:07:30,720 --> 00:07:33,289
a validation set.

98
00:07:33,289 --> 00:07:36,120
Choose the method using the training and test set.

99
00:07:36,120 --> 00:07:40,210
By all means, use your million machine learning
methods and choose the best on the training

100
00:07:40,210 --> 00:07:44,180
and test set or the best using cross-validation
on the training set.

101
00:07:44,180 --> 00:07:50,050
But then, leave aside this separate validation
set for use at the end, once you've chosen

102
00:07:50,050 --> 00:07:56,479
your machine learning method, and evaluate
it on that to get a much more realistic assessment

103
00:07:56,479 --> 00:08:00,870
of how it would perform on independent test data.

104
00:08:00,870 --> 00:08:04,770
Overfitting is a really big problem in machine learning.

105
00:08:04,770 --> 00:08:10,560
You can read a bit more about OneR and what
this parameter actually does in the course

106
00:08:10,560 --> 00:08:13,449
text in Section 4.1.

107
00:08:13,449 --> 00:08:17,620
Off you go now and do the activity associated
with this class.

108
00:08:17,620 --> 00:08:18,600
Bye for now.

