1
00:00:16,460 --> 00:00:20,980
Hi! Welcome back for another few minutes in
New Zealand.

2
00:00:20,980 --> 00:00:27,410
In the last lesson, Lesson 5.1, we learned
that Weka only helps you with a small part

3
00:00:27,410 --> 00:00:33,370
of the overall data mining process, the technical
part, which is perhaps the easy part.

4
00:00:33,370 --> 00:00:38,690
In this lesson, we're going to learn that
there are many pitfalls and pratfalls even

5
00:00:38,690 --> 00:00:40,470
in that part.

6
00:00:41,860 --> 00:00:43,149
Let me just define these for you.

7
00:00:43,149 --> 00:00:48,840
A "pitfall" is a hidden or unsuspected danger
or difficulty, and there are plenty of those

8
00:00:48,840 --> 00:00:51,059
in the field of machine learning.

9
00:00:51,059 --> 00:00:57,690
A "pratfall" is a stupid and humiliating action,
which is very easy to do when you're working

10
00:00:57,690 --> 00:01:01,870
with data.

11
00:01:01,870 --> 00:01:04,710
The first lesson is that you should be skeptical.

12
00:01:04,710 --> 00:01:08,860
In data mining it's very easy to cheat.

13
00:01:08,860 --> 00:01:14,659
Whether you're cheating consciously or unconsciously,
it's easy to mislead yourself or mislead others

14
00:01:14,659 --> 00:01:18,440
about the significance of your results.

15
00:01:18,440 --> 00:01:25,440
For a reliable test, you should use a completely
fresh sample of data that has never been seen before.

16
00:01:25,440 --> 00:01:29,390
You should save something for the very end,
that you don't use until you've selected your

17
00:01:29,390 --> 00:01:33,579
algorithm, decided how you're going to apply
it, and the filters, and so on.

18
00:01:33,579 --> 00:01:39,659
At the very, very end, having done all that,
run it on some fresh data to get an estimate

19
00:01:39,659 --> 00:01:41,570
of how it will perform.

20
00:01:41,570 --> 00:01:47,500
Don't be tempted to then change it to improve
it so that you get better results on that data.

21
00:01:47,500 --> 00:01:51,659
Always do your final run on fresh data.

22
00:01:51,659 --> 00:01:56,189
We've talked a lot about overfitting, and
this is basically the same kind of problem.

23
00:01:56,189 --> 00:02:00,820
Of course, you know not to test on the training
set.

24
00:02:00,820 --> 00:02:05,030
We've talked about that endlessly throughout
this course.

25
00:02:05,030 --> 00:02:09,370
Data that's been used for development in any
way is tainted.

26
00:02:09,370 --> 00:02:14,650
Any time you use some data to help you make
a choice of the filter, or the classifier,

27
00:02:14,650 --> 00:02:20,250
or how you're going to treat your problem,
then that data is tainted.

28
00:02:20,250 --> 00:02:24,470
You should be using completely fresh data
to get evaluation results.

29
00:02:24,470 --> 00:02:29,400
Leave some evaluation data aside for the very
end of the process.

30
00:02:29,400 --> 00:02:34,239
That's the first piece of advice.

31
00:02:34,239 --> 00:02:38,280
Another thing I haven't told you about in
this course so far is missing values.

32
00:02:38,280 --> 00:02:45,280
In real datasets, it's very common that some
of the data values are missing.

33
00:02:45,370 --> 00:02:46,579
They haven't been recorded.

34
00:02:48,220 --> 00:02:53,579
They might be unknown; we might have forgotten
to record them; they might be irrelevant.

35
00:02:55,810 --> 00:03:00,310
There are two basic strategies for dealing
with missing values in a dataset.

36
00:03:00,310 --> 00:03:05,970
You can omit instances where the attribute
value is missing, or somehow find a way of

37
00:03:05,970 --> 00:03:08,780
omitting that particular attribute in that
instance.

38
00:03:08,780 --> 00:03:13,260
Or you can treat missing as a separate possible
value.

39
00:03:15,060 --> 00:03:20,790
You need to ask yourself, is there significance
in the fact that a value is missing? They

40
00:03:20,799 --> 00:03:24,419
say that if you've got something wrong with
you and go to the doctor, and he does some

41
00:03:24,419 --> 00:03:30,370
tests on you: if you just record the tests
that he does -- not the results of the test,

42
00:03:30,370 --> 00:03:34,669
but just the ones he chooses to do -- there's
a very good chance that you can work out what's

43
00:03:34,669 --> 00:03:39,919
wrong with you just from the existence of
the tests, not from their results.

44
00:03:39,919 --> 00:03:43,180
That's because the doctor chooses tests intelligently.

45
00:03:43,180 --> 00:03:48,680
The fact that he doesn't choose a test doesn't
mean that that value is missing, or accidentally

46
00:03:48,680 --> 00:03:49,660
not there.

47
00:03:49,660 --> 00:03:54,139
There's huge significance in the fact that
he's chosen not to do certain tests.

48
00:03:54,139 --> 00:03:59,019
This is a situation where "missing" should
be treated as a separate possible value.

49
00:03:59,019 --> 00:04:03,709
There's significance in the fact that a value
is missing.

50
00:04:03,709 --> 00:04:06,959
But in other situations, a value might be
missing simply because a piece of equipment

51
00:04:06,959 --> 00:04:11,180
malfunctioned, or for some other reason -- maybe
someone forgot something.

52
00:04:11,180 --> 00:04:16,799
Then there's no significance in the fact that
it's missing.

53
00:04:16,799 --> 00:04:20,850
Pretty well all machine learning algorithms
deal with missing values.

54
00:04:20,850 --> 00:04:25,889
In an ARFF file, if you put a question mark
as a data value, that's treated as a missing

55
00:04:25,889 --> 00:04:27,600
value.

56
00:04:27,600 --> 00:04:30,530
All methods in Weka can deal with missing
values.

57
00:04:30,530 --> 00:04:33,759
But they make different assumptions about
them.

58
00:04:33,759 --> 00:04:39,460
If you don't appreciate this, it's easy to
get misled.

59
00:04:39,460 --> 00:04:45,550
Let me just take two simple and well known
(to us) examples -- OneR and J48.

60
00:04:45,550 --> 00:04:47,460
They deal with missing values in different
ways.

61
00:04:47,460 --> 00:05:00,740
I'm going to load the nominal weather data
and run OneR on it: I get 43%.

62
00:05:00,740 --> 00:05:10,600
Let me run J48 on it, to get 50%.

63
00:05:10,600 --> 00:05:11,750
I'm going to

64
00:05:11,750 --> 00:05:21,940
edit this dataset by changing the value of
"outlook" for the first four "no" instances

65
00:05:21,940 --> 00:05:24,040
to "missing".

66
00:05:24,040 --> 00:05:26,580
That's how we do it here in this editor.

67
00:05:26,580 --> 00:05:32,060
If we were to write this file out in ARFF
format, we'd find that these values are written

68
00:05:32,060 --> 00:05:36,600
into the file as question marks.

69
00:05:37,380 --> 00:05:42,870
Now, if we look at "outlook", you can see
that it says here there are 4 missing values.

70
00:05:42,870 --> 00:05:49,870
If you count up these labels -- 2, 4, and
4 -- that's 10 labels.

71
00:05:50,350 --> 00:05:54,370
Plus another 4 that are missing, to make the
14 instances.

72
00:05:54,370 --> 00:06:00,120
Let's go back to J48 and run it again.

73
00:06:00,120 --> 00:06:02,400
We still get 50%, the same result.

74
00:06:03,400 --> 00:06:09,620
Of course, this is a tiny dataset, but the
fact is that the results here are not affected

75
00:06:09,620 --> 00:06:12,530
by the fact that a few of the values are missing.

76
00:06:12,530 --> 00:06:22,280
However, if we run OneR, I get a much higher
accuracy, a 93% accuracy.

77
00:06:26,370 --> 00:06:31,660
The rule that I've got is "branch on outlook",
which is what we had before I think.

78
00:06:31,660 --> 00:06:36,590
Here it says there are 4 possibilities: if
it's sunny, it's a yes; if it's overcast it's

79
00:06:36,590 --> 00:06:41,130
a yes; if it's rainy, it's a yes; and if it's
missing, it's a no.

80
00:06:41,130 --> 00:06:45,870
Here, OneR is using the fact that a value
is missing as significant, as something you

81
00:06:45,870 --> 00:06:46,970
can branch on.

82
00:06:46,970 --> 00:06:53,010
Whereas if you were to look at a J48 tree,
it would never have a branch that corresponded

83
00:06:53,010 --> 00:06:54,280
to a missing value.

84
00:06:54,280 --> 00:06:56,160
It treats them differently.

85
00:06:56,160 --> 00:07:00,910
It is very important to know and remember.

86
00:07:00,910 --> 00:07:07,910
The final thing I want to tell you about in
this lesson is the "no free lunch" theorem.

87
00:07:08,290 --> 00:07:11,930
There's no free lunch in data mining.

88
00:07:11,930 --> 00:07:13,440
Here's a way to illustrate it.

89
00:07:13,440 --> 00:07:17,430
Suppose you've got a 2-class problem with
100 binary attributes.

90
00:07:17,430 --> 00:07:22,260
Let's say you've got a huge training set with
a million instances and their classifications

91
00:07:22,260 --> 00:07:25,690
in the training set.

92
00:07:25,690 --> 00:07:31,910
The number of possible instances is 2 to the
100 (2^100), because there are 100 binary

93
00:07:31,910 --> 00:07:33,120
attributes.

94
00:07:33,120 --> 00:07:34,980
And you know 10^6 of them.

95
00:07:34,980 --> 00:07:40,160
So you don't know the classes of 2^100 - 10^6
examples.

96
00:07:40,160 --> 00:07:47,780
Let me tell you that 2^100 - 10^6 is 99.999...%
of 2^100.

97
00:07:47,780 --> 00:07:52,220
There's this huge number of examples that
you just don't know the classes of.

98
00:07:52,220 --> 00:07:56,780
How could you possibly figure them out? If
you apply a data mining scheme to this, it

99
00:07:56,780 --> 00:08:02,130
will figure them out, but how could you possibly
figure out all of those things just from the

100
00:08:02,130 --> 00:08:06,750
tiny amount of data that you've been given.

101
00:08:06,750 --> 00:08:11,220
In order to generalize, every learner must
embody some knowledge or assumptions beyond

102
00:08:11,220 --> 00:08:14,440
the data it's given.

103
00:08:14,440 --> 00:08:18,680
Each learning algorithm implicitly provides
a set of assumptions.

104
00:08:18,680 --> 00:08:23,400
The best way to think about those assumptions
is to think back to the Boundary Visualizer

105
00:08:23,400 --> 00:08:26,320
we looked at in Lesson 4.1.

106
00:08:26,320 --> 00:08:30,150
You saw that different machine learning schemes
are capable of drawing different kinds of

107
00:08:30,150 --> 00:08:33,230
boundaries in instance space.

108
00:08:33,230 --> 00:08:39,530
These boundaries correspond to a set of assumptions
about the sort of decisions we can make.

109
00:08:39,530 --> 00:08:44,350
There's no universal best algorithm; there's
no free lunch.

110
00:08:44,350 --> 00:08:46,900
There's no single best algorithm.

111
00:08:46,900 --> 00:08:52,080
Data mining is an experimental science, and
that's why we've been teaching you how to

112
00:08:52,080 --> 00:08:55,010
experiment with data mining yourself.

113
00:08:56,240 --> 00:08:57,920
This is just a summary.

114
00:08:57,920 --> 00:09:02,250
Be skeptical: when people tell you about data
mining results and they say that it gets this

115
00:09:02,250 --> 00:09:07,450
kind of accuracy, then to be sure about that
you want to have them test their classifier

116
00:09:07,450 --> 00:09:12,570
on your new, fresh data that they've never
seen before.

117
00:09:12,570 --> 00:09:15,480
Overfitting has many faces.

118
00:09:15,480 --> 00:09:19,640
Different learning schemes make different
assumptions about missing values, which can

119
00:09:19,640 --> 00:09:21,400
really change the results.

120
00:09:21,400 --> 00:09:26,950
There is no universal best learning algorithm.

121
00:09:26,950 --> 00:09:32,240
Data mining is an experimental science, and
it's very easy to be misled by people quoting

122
00:09:32,240 --> 00:09:37,160
the results of data mining experiments.

123
00:09:37,160 --> 00:09:37,890
That's it for now.

124
00:09:37,890 --> 00:09:40,540
Off you go and do the activity.

125
00:09:40,540 --> 00:09:42,080
We'll see you in the next lesson.

126
00:09:42,080 --> 00:09:43,670
Bye for now!