1
00:00:16,670 --> 00:00:21,500
Hi! This is the third class of Data Mining
with Weka, and in this class,

2
00:00:21,500 --> 00:00:26,130
we're going to look at some simple machine
learning methods and how they work.

3
00:00:26,130 --> 00:00:33,130
We're going to start out emphasizing the message
that simple algorithms often work very well.

4
00:00:34,420 --> 00:00:38,210
In data mining, maybe in life in general,

5
00:00:38,210 --> 00:00:43,480
you should always try simple things before
you try more complicated things.

6
00:00:43,480 --> 00:00:45,980
There are many different kinds of simple structure.

7
00:00:45,980 --> 00:00:49,930
For example, it might that one attribute in
the dataset does all the work,

8
00:00:49,930 --> 00:00:52,489
everything depends on the value of one of
the attributes.

9
00:00:52,489 --> 00:00:57,989
Or, it might be that all of the attributes
contribute equally and independently.

10
00:00:57,989 --> 00:01:02,769
Or a simple structure might be a decision
tree that tests just a few of the attributes.

11
00:01:02,769 --> 00:01:09,769
We might calculate the distance from an unknown
sample to the nearest training sample,

12
00:01:10,460 --> 00:01:14,970
or a result my depend on a linear combination
of attributes.

13
00:01:14,970 --> 00:01:21,630
We're going to look at all of these simple
structures in the next few lessons.

14
00:01:21,630 --> 00:01:23,850
There's no universally best learning algorithm.

15
00:01:23,850 --> 00:01:27,469
The success of a machine learning method depends
on the domain.

16
00:01:27,469 --> 00:01:33,700
Data mining really is an experimental science.

17
00:01:33,700 --> 00:01:37,259
We're going to look at OneR rule learner,

18
00:01:37,259 --> 00:01:39,950
where one attribute does all the work.

19
00:01:39,950 --> 00:01:42,770
It's extremely simple, very trivial, actually,

20
00:01:42,770 --> 00:01:47,570
but we're going to start with simple things
and build up to more complex things.

21
00:01:47,570 --> 00:01:52,439
OneR learns what you might call a one-level
decision tree,

22
00:01:52,439 --> 00:01:56,509
or a set of rules that all test one particular
attribute.

23
00:01:56,509 --> 00:02:03,170
A tree that branches only at the root node
depending on the value of a particular attribute,

24
00:02:03,170 --> 00:02:09,819
or, equivalently, a set of rules that test
the value of that particular attribute.

25
00:02:09,819 --> 00:02:11,230
The basic version of OneR,

26
00:02:11,230 --> 00:02:14,400
there's one branch for each value of the attribute.

27
00:02:14,400 --> 00:02:17,680
We choose which attribute first,

28
00:02:17,680 --> 00:02:20,900
and we make one branch for each possible value
of the attribute.

29
00:02:20,900 --> 00:02:26,090
Each branch assigns the most frequent class
that comes down that branch.

30
00:02:26,090 --> 00:02:30,739
The error rate is the proportion of instances
that don't belong to the majority class of

31
00:02:30,739 --> 00:02:32,319
their corresponding branch.

32
00:02:32,319 --> 00:02:36,640
We choose the attribute with the smallest
error rate.

33
00:02:36,640 --> 00:02:39,190
Let's look at what this actually means.

34
00:02:39,190 --> 00:02:41,310
Here's the algorithm.

35
00:02:41,310 --> 00:02:46,150
For each attribute, we're going to make some
rules.

36
00:02:46,150 --> 00:02:47,870
For each value of the attribute,

37
00:02:47,870 --> 00:02:52,599
we're going to make a rule that counts how
often each class appears,

38
00:02:52,599 --> 00:02:54,560
finds the most frequent class,

39
00:02:54,560 --> 00:02:59,090
makes the rule assign that most frequent class
to this attribute value combination,

40
00:02:59,090 --> 00:03:03,030
and then we're going to calculate the error
rate of this attribute's rules.

41
00:03:03,030 --> 00:03:07,439
We're going to repeat that for each of the
attributes in the dataset,

42
00:03:07,439 --> 00:03:10,760
and choose the attribute with the smallest
error rate.

43
00:03:10,760 --> 00:03:15,049
Here's the weather data again.

44
00:03:15,049 --> 00:03:18,099
What OneR does, is it looks at each attribute
in turn,

45
00:03:18,099 --> 00:03:23,409
outlook, temperature, humidity, and wind,
and forms rules based on that.

46
00:03:23,409 --> 00:03:30,409
For outlook, there are three possible values:
sunny, overcast, and rainy.

47
00:03:30,470 --> 00:03:35,000
We just count out of the 5 sunny instances,

48
00:03:35,000 --> 00:03:42,000
2 of them are yeses and 3 of them are nos.

49
00:03:51,730 --> 00:03:53,469
We're going to choose a rule,

50
00:03:53,469 --> 00:03:55,640
if it's sunny choose no.

51
00:03:55,640 --> 00:03:58,459
We're going to get 2 errors out of 5.

52
00:03:58,459 --> 00:04:07,110
For overcast, all of the 4 overcast values
of outlook lead to yes values for the class play.

53
00:04:07,110 --> 00:04:09,170
So, we're going to choose the rule,

54
00:04:09,170 --> 00:04:15,280
if outlook is overcast, then yes, giving us
0 errors.

55
00:04:15,280 --> 00:04:17,269
Finally, for outlook is rainy,

56
00:04:17,269 --> 00:04:18,220
we're going to choose yes,

57
00:04:18,220 --> 00:04:22,490
as well, and that would also give us 2 errors
out of the 5 instances.

58
00:04:22,490 --> 00:04:26,890
We've got a total number of errors if we branch
on outlook of 4.

59
00:04:26,890 --> 00:04:32,970
We can branch on temperature and do the same
thing.

60
00:04:32,970 --> 00:04:34,220
When temperature is hot,

61
00:04:34,220 --> 00:04:36,220
there are 2 nos and 2 yeses.

62
00:04:36,220 --> 00:04:38,300
We just choose arbitrarily in the case of
a tie,

63
00:04:38,300 --> 00:04:40,020
so we'll choose if it's hot,

64
00:04:40,020 --> 00:04:43,410
let's predict no, getting 2 errors.

65
00:04:43,410 --> 00:04:44,720
If temperature is mild,

66
00:04:44,720 --> 00:04:47,660
we'll predict yes, getting 2/6 errors,

67
00:04:47,660 --> 00:04:49,760
and if the temperature is cool,

68
00:04:49,760 --> 00:04:54,990
we'll predict yes, getting 1 out of the 4
instances as an error.

69
00:04:54,990 --> 00:04:58,260
And the same for humidity and wind.

70
00:04:58,260 --> 00:05:04,100
We look at the total error values; we choose
the rule with the lowest total error value -- either

71
00:05:04,100 --> 00:05:05,970
outlook or humidity.

72
00:05:05,970 --> 00:05:07,860
That's a tie, so we'll just choose arbitrarily,

73
00:05:07,860 --> 00:05:09,150
and choose outlook.

74
00:05:09,150 --> 00:05:11,370
That's how OneR works,

75
00:05:11,370 --> 00:05:14,300
it's as simple as that.

76
00:05:14,300 --> 00:05:15,100
Let's just try it.

77
00:05:15,100 --> 00:05:15,760
Here's Weka.

78
00:05:15,760 --> 00:05:22,760
I'm going to open the
nominal weather data.

79
00:05:24,590 --> 00:05:26,520
I'm going to go to Classify.

80
00:05:26,520 --> 00:05:32,480
This is such a trivial dataset that the results
aren't very meaningful,

81
00:05:32,480 --> 00:05:36,000
but if I just run ZeroR to start off with,

82
00:05:36,000 --> 00:05:39,860
I get an error rate of 64%.

83
00:05:39,860 --> 00:05:43,620
If I now choose OneR,

84
00:05:45,450 --> 00:05:47,370
and run that.

85
00:05:47,370 --> 00:05:51,320
I get a rule, and the rule I get is branched
on outlook,

86
00:05:51,320 --> 00:05:53,070
if it's sunny then choose no,

87
00:05:53,070 --> 00:05:56,460
overcast choose yes, and rainy choose yes.

88
00:05:56,460 --> 00:06:01,370
We get 10 out of 14 instances correct on the
training set.

89
00:06:01,370 --> 00:06:03,780
We're evaluating this using cross-validation.

90
00:06:03,780 --> 00:06:06,210
Doesn't really make much sense on such a small
dataset.

91
00:06:06,210 --> 00:06:09,160
Interesting, though, that the [success] rate
we get,

92
00:06:09,160 --> 00:06:12,100
42% is pretty bad, worse than ZeroR.

93
00:06:12,100 --> 00:06:14,440
Actually, with any 2-class problem,

94
00:06:14,440 --> 00:06:19,700
you would expect to get a success rate of
at least 50%.

95
00:06:19,700 --> 00:06:22,110
Tossing a coin would give you 50%.

96
00:06:22,110 --> 00:06:27,440
This OneR scheme is not performing very well
on this trivial dataset.

97
00:06:27,440 --> 00:06:34,290
Notice that the rule it finally prints out
since we're using 10-fold cross-validation,

98
00:06:34,290 --> 00:06:38,940
it does the whole thing 10 times and then
on the 11th time calculates a rule from the

99
00:06:38,940 --> 00:06:43,710
entire dataset and that's what it prints out.

100
00:06:43,710 --> 00:06:48,000
That's where this rule comes from.

101
00:06:48,000 --> 00:06:51,170
OneR, one attribute does all the work.

102
00:06:51,170 --> 00:06:55,730
This is a very simple method of machine learning
described in 1993,

103
00:06:55,730 --> 00:07:02,420
20 years ago in a paper called "Very Simple
Classification Rules Perform Well on Most

104
00:07:02,420 --> 00:07:04,390
Commonly Used Datasets"

105
00:07:04,390 --> 00:07:10,300
by a guy called Rob Holte, who lives in Canada.

106
00:07:10,300 --> 00:07:15,880
He did an experimental evaluation of the OneR
method on 16 commonly used datasets.

107
00:07:15,880 --> 00:07:20,850
He used cross-validation just like we've told
you to evaluate these things,

108
00:07:20,850 --> 00:07:26,440
and he found that the simple rules from OneR
often outperformed far more complex methods

109
00:07:26,440 --> 00:07:30,910
that had been proposed for these datasets.

110
00:07:30,910 --> 00:07:34,410
How can such a simple method work so well?

111
00:07:34,410 --> 00:07:37,230
Some datasets really are simple,

112
00:07:37,230 --> 00:07:39,950
and others are so small, noisy, or complex

113
00:07:39,950 --> 00:07:42,010
that you can't learn anything from them.

114
00:07:42,010 --> 00:07:46,200
So, it's always worth trying the simplest
things first.

115
00:07:46,200 --> 00:07:50,850
Section 4.1 of the course text talks about
OneR.

116
00:07:50,850 --> 00:07:55,190
Now it's time for you to go and do the activity
associated with this lesson.

117
00:07:55,190 --> 00:07:56,770
Bye for now!