1
00:00:16,520 --> 00:00:23,450
Hello, again, and welcome to Data Mining with
Weka, back here in New Zealand. In this class,

2
00:00:23,450 --> 00:00:26,780
Class 4, we're going to look at some pretty
cool machine learning methods.

3
00:00:26,780 --> 00:00:33,210
We're going to look at linear regression,
classification by regression, logistic regression,

4
00:00:33,210 --> 00:00:38,210
support vector machines, and ensemble learning.
The last few of these are contemporary methods,

5
00:00:38,210 --> 00:00:42,000
which haven't been around very long. They
are kind of state-of-the-art machine learning

6
00:00:42,000 --> 00:00:43,390
methods.

7
00:00:43,390 --> 00:00:49,960
Remember, there are 5 classes in this course,
so next week is Class 5, the last class. We'll

8
00:00:49,960 --> 00:00:56,260
be tidying things up and summarizing things
then. You're well over halfway through; you're

9
00:00:56,260 --> 00:00:59,210
doing well. Just hang on in there.

10
00:00:59,210 --> 00:01:05,420
In this lesson, we're going to start by looking
at classification boundaries for different

11
00:01:05,420 --> 00:01:10,470
machine learning methods. We're going to use
Weka's Boundary Visualizer, which is another

12
00:01:10,470 --> 00:01:12,690
Weka tool that we haven't encountered yet.

13
00:01:12,690 --> 00:01:16,080
I'm going to use a 2-dimensional dataset.

14
00:01:16,080 --> 00:01:26,170
I've prepared iris.2d.arff. It's a
2-dimensional version of the iris dataset.

15
00:01:26,170 --> 00:01:31,170
I took the regular iris dataset and deleted
a couple of attributes -- sepallength and

16
00:01:31,170 --> 00:01:36,920
sepalwidth -- leaving me with this 2D dataset,
and the class.

17
00:01:36,920 --> 00:01:44,850
We're going to look at that using the Boundary
Visualizer. You get that from this Visualization

18
00:01:44,850 --> 00:01:51,409
menu on the Weka Chooser. There are a lot
of tools in Weka, and we're just going to

19
00:01:51,409 --> 00:01:57,009
look at this one here, the Boundary Visualizer.
I'm going to open the same file in the Boundary

20
00:01:57,009 --> 00:02:06,000
Visualizer, the 2-dimensional iris dataset.
Here we've got a plot of the data.

21
00:02:06,000 --> 00:02:11,200
You can see that we're plotting petalwidth on the
y-axis against petallength on the x-axis.

22
00:02:11,200 --> 00:02:17,489
This is a picture of the dataset with the
3 classes setosa in red, versicolor in green,

23
00:02:17,489 --> 00:02:20,859
and virginica in blue.

24
00:02:20,859 --> 00:02:26,309
I'm going to choose a classifier. Let's begin
with the OneR classifier, which is in rules.

25
00:02:30,700 --> 00:02:33,700
I'm going to "plot training data" and just
going to let it rip.

26
00:02:34,880 --> 00:02:41,420
The color diagram shows the decision boundaries, with the training data superimposed on it. 

27
00:02:43,489 --> 00:02:44,260
Let's look at what

28
00:02:44,260 --> 00:02:51,260
OneR does to this dataset in the Explorer.

29
00:02:53,220 --> 00:02:56,380
OneR has chosen to split on petalwidth.

30
00:02:56,389 --> 00:03:00,140
If it's less than a certain amount, we get a
setosa; if it's intermediate, we get a versicolor;

31
00:03:00,140 --> 00:03:02,769
and if it's greater than the upper boundary,
we get a viriginica.

32
00:03:02,769 --> 00:03:08,949
It's the same as what's being shown here.
We're splitting on petalwidth. If it's less

33
00:03:08,949 --> 00:03:14,650
than a certain amount, we get a setosa; in
the middle, a versicolor; and at the top,

34
00:03:14,650 --> 00:03:16,790
a virginica.

35
00:03:17,850 --> 00:03:22,379
This is a spatial representation of the decision
boundary that OneR creates on this dataset.

36
00:03:22,379 --> 00:03:26,069
That's what the Boundary Visualizer does;
it draws decision boundaries.

37
00:03:26,069 --> 00:03:30,859
It shows here that OneR chooses an attribute
-- in this case petalwidth -- to split on.

38
00:03:30,859 --> 00:03:35,309
It might have chosen petallength, in which
case we'd have vertical decision boundaries.

39
00:03:35,309 --> 00:03:40,669
Either way, we're going to get stripes from
OneR.

40
00:03:40,669 --> 00:03:45,779
I'm going to go ahead and look at some boundaries
for other schemes.

41
00:03:45,779 --> 00:03:50,790
Let's look at IBk, which is a "lazy" classifier.

42
00:03:50,790 --> 00:03:55,629
That's the instance-based learner we looked
at in the last class.

43
00:03:55,629 --> 00:03:58,749
I'm going to run that.

44
00:03:58,749 --> 00:04:02,139
Here we get a different kind of pattern.

45
00:04:02,139 --> 00:04:03,489
I'll just stop it there.

46
00:04:03,489 --> 00:04:05,849
We've got diagonal lines.

47
00:04:05,849 --> 00:04:12,409
Down here are the setosas underneath this
diagonal line; the versicolors in the intermediate

48
00:04:12,409 --> 00:04:16,979
region; and the virginicas, by and large,
in the top right-hand corner.

49
00:04:16,979 --> 00:04:18,979
Remember what [IBk] does.

50
00:04:18,979 --> 00:04:21,759
It takes a test instance.

51
00:04:21,759 --> 00:04:28,800
Let's say we had an instance here, just on
this side of the boundary, in the red.

52
00:04:28,900 --> 00:04:31,810
Then it chooses the nearest instance to that.

53
00:04:31,810 --> 00:04:36,150
That would be this one, I guess.

54
00:04:36,150 --> 00:04:38,199
That's kind of the nearer than this one here.

55
00:04:38,199 --> 00:04:38,939
This is a red point.

56
00:04:38,939 --> 00:04:45,449
If I were to cross over the boundary here,
it would choose a green class, because this

57
00:04:45,449 --> 00:04:47,169
would be the nearest instance then.

58
00:04:47,169 --> 00:04:54,110
If you think about it, this boundary goes
halfway between this nearest red point and

59
00:04:54,110 --> 00:04:55,780
this nearest green point.

60
00:04:55,780 --> 00:05:01,689
Similarly, if I take a point up here, I guess
the two nearest instances are this blue one

61
00:05:01,689 --> 00:05:03,659
and this green one.

62
00:05:03,659 --> 00:05:05,710
This blue one is closer.

63
00:05:05,710 --> 00:05:08,710
In this case, the boundary goes along this
straight line here.

64
00:05:08,710 --> 00:05:14,669
You can see that it's not just a single line:
this is a piecewise linear line, so this part

65
00:05:14,669 --> 00:05:19,189
of the boundary goes exactly halfway between
these two points quite close to it.

66
00:05:19,189 --> 00:05:24,300
Down here, the boundary goes exactly halfway
between these two points.

67
00:05:24,300 --> 00:05:27,949
It's the perpendicular bisector of the line
joining these points.

68
00:05:27,949 --> 00:05:31,990
So we get a piecewise linear boundary made
up of little pieces.

69
00:05:31,990 --> 00:05:36,919
It's kind of interesting to see what happens
if we change the parameter: if we look at,

70
00:05:36,919 --> 00:05:41,849
say, 5 nearest neighbors instead of just 1.

71
00:05:41,849 --> 00:05:53,870
Now we get a slightly blurry picture, because
whereas down here in the pure red region the

72
00:05:53,870 --> 00:06:00,199
5 nearest neighbors to a point are all red
points, if we look in the intermediate region

73
00:06:00,199 --> 00:06:05,520
here, then the nearest neighbors to a point
here -- this is going to be in the 5, and

74
00:06:05,520 --> 00:06:08,969
this might be another one in the 5, and there
might be a couple more down here in the 5.

75
00:06:08,969 --> 00:06:14,509
So we get an intermediate color here, and
IBk takes a vote.

76
00:06:14,509 --> 00:06:21,400
If we had 3 reds and 2 greens, then we'd be
in the red region and that would be depicted

77
00:06:21,400 --> 00:06:24,659
as this darker red here.

78
00:06:24,659 --> 00:06:29,300
If it had been the other way round with more
greens than reds, we'd be in the green region.

79
00:06:29,300 --> 00:06:33,710
So we've got a blurring of these boundaries.

80
00:06:33,710 --> 00:06:36,979
These are probabilistic descriptions of the
boundary.

81
00:06:36,979 --> 00:06:43,979
Let me just change k to 20 and see what happens.

82
00:06:46,949 --> 00:06:52,219
Now we get the same shape, but even more blurry
boundaries.

83
00:06:52,219 --> 00:06:57,469
The Boundary Visualizer reveals the way that
machine learning schemes are thinking, if

84
00:06:57,469 --> 00:06:58,210
you like.

85
00:06:58,210 --> 00:07:02,180
The internal representation of the dataset.

86
00:07:02,180 --> 00:07:08,520
They help you think about the sorts of things
that machine learning methods do.

87
00:07:08,520 --> 00:07:11,340
Let's choose another scheme.

88
00:07:11,340 --> 00:07:13,529
I'm going to choose NaiveBayes.

89
00:07:13,529 --> 00:07:19,529
When we talked about NaiveBayes, we only talked
about discrete attributes.

90
00:07:19,529 --> 00:07:27,000
With continuous attributes, I'm going to choose
a supervised discretization method.

91
00:07:27,009 --> 00:07:32,550
Don't worry about this detail, it's the most
common way of using NaiveBayes with

92
00:07:32,550 --> 00:07:34,720
numeric attributes.

93
00:07:36,120 --> 00:07:38,430
Let's look at that picture.

94
00:07:40,120 --> 00:07:41,370
This is interesting.

95
00:07:41,370 --> 00:07:46,080
When you think about NaiveBayes, it treats
each of the two attributes as contributing

96
00:07:46,080 --> 00:07:48,550
equally and independently to the decision.

97
00:07:48,550 --> 00:07:53,099
It sort of decides what it should be along
this dimension and decides what it should

98
00:07:53,099 --> 00:07:56,490
be along this dimension and multiples the
two together.

99
00:07:56,490 --> 00:08:00,499
Remember the multiplication that went on in
NaiveBayes.

100
00:08:00,499 --> 00:08:05,499
When you multiple these things together, you
get a checkerboard pattern of probabilities,

101
00:08:05,499 --> 00:08:06,840
multiplying up the probabilities.

102
00:08:06,840 --> 00:08:10,289
That's because the attributes are being treated
independently.

103
00:08:10,289 --> 00:08:16,629
That's a very different kind of decision boundary
from what we saw with instance-based learning.

104
00:08:16,629 --> 00:08:21,370
That's what's so good about the Boundary Visualizer:
it helps you think about how things are working

105
00:08:21,370 --> 00:08:21,870
inside.

106
00:08:21,870 --> 00:08:24,809
I'm going to do one more example.

107
00:08:24,809 --> 00:08:30,650
I'm going to do J48, which is in trees.

108
00:08:30,650 --> 00:08:37,650
Here we get this kind of structure.

109
00:08:39,190 --> 00:08:46,190
Let's take a look at what happens in the Explorer
if we choose J48.

110
00:08:48,230 --> 00:08:55,230
We get this little decision tree: split first
on petalwidth; if it's less than 0.6 it's

111
00:08:57,300 --> 00:08:59,260
a setosa for sure.

112
00:08:59,260 --> 00:09:05,620
Then split again on petalwidth; if it's greater
than 1.7, it's a virginica for sure.

113
00:09:05,620 --> 00:09:11,540
Then, in between, split on petallength and
then again on petalwidth, getting a mixture

114
00:09:11,540 --> 00:09:14,339
of versicolors and viriginicas.

115
00:09:14,339 --> 00:09:20,010
We split first on petalwidth; that's this
split here.

116
00:09:20,010 --> 00:09:22,690
Remember the vertical axis is the petalwidth
axis.

117
00:09:22,690 --> 00:09:26,510
If it's less than a certain amount, it's a
setosa for sure.

118
00:09:26,510 --> 00:09:28,930
Then we split again on the same axis.

119
00:09:28,930 --> 00:09:32,899
If it's greater than a certain amount, it's
a virginica for sure.

120
00:09:32,899 --> 00:09:38,980
If it's in the intermediate region, we split
on the other axis, which is petallength.

121
00:09:38,980 --> 00:09:47,000
Down here, it's a versicolor for sure, and
here we're going to split again on the petalwidth attribute.

122
00:09:49,670 --> 00:10:04,700
Let's change the minNumObj parameter, which
controls the minimum size of the leaves.

123
00:10:04,709 --> 00:10:08,430
If we increase that, we're going to get a
simpler tree.

124
00:10:08,430 --> 00:10:12,100
We discussed this parameter in one of the
lessons of Class 3.

125
00:10:12,100 --> 00:10:19,100
If we run now, then we get a simpler version,
corresponding to the simpler rules we get

126
00:10:19,209 --> 00:10:20,399
with this parameter set.

127
00:10:20,399 --> 00:10:27,399
Or we can set the parameter to a higher value,
say 10, and run it again.

128
00:10:28,610 --> 00:10:34,089
We get even simpler rules, very similar to
the rules produced by OneR.

129
00:10:37,200 --> 00:10:40,410
We've looked at classification boundaries.

130
00:10:40,410 --> 00:10:45,990
Classifiers create boundaries in instance
space and different classifiers have different

131
00:10:45,990 --> 00:10:48,410
capabilities for carving up instance space.

132
00:10:48,410 --> 00:10:53,630
That's called the "bias" of the classifier
-- the way in which it's capable of carving

133
00:10:53,630 --> 00:10:56,560
up the instance space.

134
00:10:56,560 --> 00:11:02,300
We looked at OneR, IBk, NaiveBayes, and J48,
and found completely different biases, completely

135
00:11:02,300 --> 00:11:04,579
different ways they carve up the instance
space.

136
00:11:04,579 --> 00:11:11,440
Of course, this kind of visualization is restricted
to numeric attributes and 2-dimensional plots,

137
00:11:11,440 --> 00:11:17,529
so it's not a very general tool, but it certainly
helps you think about these different classifiers.

138
00:11:17,529 --> 00:11:23,930
You can read about classification boundaries
in Section 17.3 of the course text.

139
00:11:23,930 --> 00:11:27,840
Now off you go and do the activity associated
with this lesson.

140
00:11:27,840 --> 00:11:29,220
Good luck! We'll see you later.

141
00:11:29,220 --> 00:11:30,400
Bye!