1
00:00:15,920 --> 00:00:22,260
Hello again! In this lesson we're going to
look at an important new concept called baseline

2
00:00:22,260 --> 00:00:30,610
accuracy. We're going to actually use a new
dataset, the diabetes dataset.

3
00:00:30,610 --> 00:00:37,260
I've got Weka here, and I'm going to open
diabetes.arff.

4
00:00:37,260 --> 00:00:38,600
There it is.

5
00:00:39,000 --> 00:00:40,700
Have a quick look at this dataset.

6
00:00:40,700 --> 00:00:47,700
The class is tested_negative or tested_positive
for diabetes.

7
00:00:48,990 --> 00:00:55,670
We've got attributes like preg, which I think
has to do with the number of times they've

8
00:00:55,670 --> 00:00:59,020
been pregnant; age, which is the age.

9
00:00:59,020 --> 00:01:06,020
Of course, we can learn more about this dataset
by looking at the ARFF file itself.

10
00:01:07,590 --> 00:01:10,990
Here is the diabetes dataset.

11
00:01:10,990 --> 00:01:14,020
You can see it's diabetes in Pima Indians.

12
00:01:17,500 --> 00:01:19,510
There's a lot of information here.

13
00:01:19,510 --> 00:01:26,510
The attributes: number of times pregnant,
plasma, glucose concentration, and so on.

14
00:01:26,510 --> 00:01:29,830
Diabetes pedigree function.

15
00:01:31,070 --> 00:01:35,070
I'm going to use percentage split.

16
00:01:36,140 --> 00:01:38,970
I'm going to try a few different classifiers.

17
00:01:38,970 --> 00:01:45,970
Let's look at J48 first, our old friend J48.

18
00:01:50,380 --> 00:01:55,010
We get [76%] with J48.

19
00:01:55,600 --> 00:01:57,280
I'm going to look at some other classifiers.

20
00:01:57,280 --> 00:02:01,120
You learn about these classifiers later on
in this course, but right now we're just going

21
00:02:01,120 --> 00:02:02,600
to look at a few.

22
00:02:02,600 --> 00:02:09,300
Look at NaiveBayes classifier in the bayes
category, and run that.

23
00:02:09,300 --> 00:02:15,640
Here we get 77%, a little bit better, but
probably not significant.

24
00:02:15,640 --> 00:02:20,640
Let's choose in the lazy category IBk.

25
00:02:20,640 --> 00:02:25,170
Again, we'll learn about this later on.

26
00:02:25,170 --> 00:02:29,220
Here we get 73%, quite a bit worse.

27
00:02:29,220 --> 00:02:36,220
We'll use one final one, the PART, partial
rules in the rules category.

28
00:02:40,110 --> 00:02:43,060
Here we get 74%.

29
00:02:43,060 --> 00:02:47,840
We'll learn about these classifiers later,
but they are just different classifiers, alternative

30
00:02:47,840 --> 00:02:49,750
to J48.

31
00:02:49,750 --> 00:02:54,520
You can see that J48 and NaiveBayes are pretty
good, probably about the same.

32
00:02:54,520 --> 00:02:57,370
The 1% difference between them probably isn't
significant.

33
00:02:57,370 --> 00:03:00,110
IBk and PART are probably about the same performance.

34
00:03:00,110 --> 00:03:01,610
Again, 1% between them.

35
00:03:01,610 --> 00:03:06,720
There is a fair gap, I guess, between those
bottom two and the top two, which probably

36
00:03:06,720 --> 00:03:07,760
is significant.

37
00:03:08,890 --> 00:03:10,900
I'd like to think about these figures.

38
00:03:10,900 --> 00:03:15,590
76%, is that good to get 76% accuracy?

39
00:03:15,590 --> 00:03:21,670
If we go back and look at this dataset, the class,

40
00:03:21,670 --> 00:03:28,720
we see that there are 500 negative instances
and 268 positive instances.

41
00:03:28,720 --> 00:03:35,720
If you had to guess, you'd guess it would
be negative, and you'd be right 500/768

42
00:03:35,720 --> 00:03:38,800
(the sum of these two things, the total number
of instances).

43
00:03:39,000 --> 00:03:41,390
You'd be right that fraction of the time.

44
00:03:41,390 --> 00:03:48,390
500/768 if you always guess [negative], and
that works out to 65%.

45
00:03:48,950 --> 00:04:00,670
Actually, there's a rules classifier called
ZeroR, which does exactly that.

46
00:04:00,670 --> 00:04:07,670
The ZeroR classifier just looks for the most
popular class and guesses that all the time.

47
00:04:08,420 --> 00:04:15,420
If I run this on the training set, that will
give us the exact same number, 500/768,

48
00:04:16,300 --> 00:04:17,330
 which is 65%.

49
00:04:19,470 --> 00:04:23,830
It's a very, very simple, kind of trivial
classifier, that always just guesses the most

50
00:04:23,830 --> 00:04:25,650
popular class.

51
00:04:25,650 --> 00:04:29,680
It's ok to evaluate that on the training set,
because it's hardly using the training set

52
00:04:29,680 --> 00:04:32,120
at all to form the classifier.

53
00:04:32,120 --> 00:04:37,240
That's what we would call the baseline.

54
00:04:37,240 --> 00:04:43,540
The baseline gives 65% accuracy, and J48 gives
76% accuracy.

55
00:04:43,540 --> 00:04:47,830
It's significantly above the baseline, but
not all that much above the baseline.

56
00:04:47,830 --> 00:04:52,990
It's always good when you're looking at these
figures to consider what the very simplest kind of classifier,

57
00:04:52,990 --> 00:04:56,240
the baseline classifier, would get you.

58
00:04:56,240 --> 00:05:01,350
Sometimes, baseline might give you the best
results.

59
00:05:01,350 --> 00:05:03,110
I'm going to open a dataset here.

60
00:05:03,110 --> 00:05:05,050
We're not going to discuss this dataset.

61
00:05:05,050 --> 00:05:11,660
It's a bit of a strange dataset, not really
designed for this kind of classification.

62
00:05:11,660 --> 00:05:12,940
It's called supermarket.

63
00:05:12,940 --> 00:05:18,630
I'm going to open supermarket, and without
even looking at it, I'm just going to apply

64
00:05:18,630 --> 00:05:19,950
a few schemes here.

65
00:05:19,950 --> 00:05:26,930
I'm going to apply ZeroR, and I get 64%.

66
00:05:27,530 --> 00:05:32,130
I'm going to apply J48,

67
00:05:34,530 --> 00:05:38,790
and I think I'll use a percentage split for evaluation because

68
00:05:38,790 --> 00:05:41,020
it's not fair to use the training set here.

69
00:05:41,020 --> 00:05:43,720
Now I get 63%.

70
00:05:43,720 --> 00:05:46,580
That's worse than the baseline.

71
00:05:46,580 --> 00:05:48,180
If I try NaiveBayes.

72
00:05:49,990 --> 00:05:53,520
These are the ones I tried before.

73
00:05:53,520 --> 00:05:57,070
I get again 63%, worse than the baseline.

74
00:05:57,070 --> 00:06:04,070
If I choose IBk, this is going to take a little
while here, it's a rather slow scheme.

75
00:06:09,910 --> 00:06:11,670
Here we are; it's finished now.

76
00:06:11,670 --> 00:06:13,500
Only 38%.

77
00:06:13,500 --> 00:06:17,010
That is way, way worse than the baseline.

78
00:06:17,010 --> 00:06:24,010
We'll just try PART, partial decision rules.

79
00:06:26,200 --> 00:06:28,000
Here we get 63%.

80
00:06:30,160 --> 00:06:36,350
The upshot is that the baseline actually gave
a better performance than any of these classifiers,

81
00:06:36,350 --> 00:06:41,580
and one of them was really atrocious compared
with the baseline.

82
00:06:41,580 --> 00:06:47,030
This is because, for this dataset, the attributes
are not really informative.

83
00:06:47,030 --> 00:06:52,310
The rule here is, don't just apply Weka to
a dataset blindly.

84
00:06:52,310 --> 00:06:54,900
You need to understand what's going on.

85
00:06:54,900 --> 00:07:02,970
When you do apply Weka to a dataset, always
make sure that you try the baseline classifier,

86
00:07:02,970 --> 00:07:06,360
ZeroR, before doing anything else.

87
00:07:06,360 --> 00:07:09,100
In general, simplicity is best.

88
00:07:09,100 --> 00:07:14,250
Always try simple classifiers before you try
more complicated ones.

89
00:07:14,250 --> 00:07:18,210
Also, you should consider, when you get these
small differences whether the differences

90
00:07:18,210 --> 00:07:19,830
are likely to be significant.

91
00:07:19,830 --> 00:07:24,820
We saw these 1% differences in the last lesson
that were probably not at all significant.

92
00:07:24,820 --> 00:07:27,430
You should always try a simple baseline.

93
00:07:27,430 --> 00:07:29,180
You should look at the dataset.

94
00:07:29,180 --> 00:07:36,070
We shouldn't blindly apply Weka to a dataset;
we should try to understand what's going on.

95
00:07:36,070 --> 00:07:37,140
That's this lesson.

96
00:07:37,140 --> 00:07:44,140
Off you go and do the activity associated
with this lesson, and I'll see you soon!