﻿1
00:00:16,250 --> 00:00:19,950
Hi! Welcome back to Data Mining with Weka.

2
00:00:19,950 --> 00:00:21,540
This is Class 2.

3
00:00:21,540 --> 00:00:28,150
In the first class, we downloaded Weka and
we looked around the Explorer and a few datasets;

4
00:00:28,150 --> 00:00:31,380
we used a classifier, the J48 classifier;

5
00:00:31,380 --> 00:00:37,870
we used a filter to remove attributes and
to remove some instances;

6
00:00:37,870 --> 00:00:44,870
we visualized some data—we visualized classification
errors on a dataset;

7
00:00:45,030 --> 00:00:50,040
and along the way we looked at a few datasets,
the weather data, both the nominal and numeric

8
00:00:50,040 --> 00:00:55,899
version, the glass data, and the iris dataset.

9
00:00:55,899 --> 00:00:58,670
This class is all about evaluation.

10
00:00:58,670 --> 00:01:03,840
In Lesson 1.4, we built a classifier using J48.

11
00:01:03,840 --> 00:01:09,289
In this first lesson of the second class,
we're going to see what it's like to actually

12
00:01:09,289 --> 00:01:11,420
be a classifier ourselves.

13
00:01:11,420 --> 00:01:17,300
Then, later on in subsequent lessons in this
class, we're going to look at more about evaluation,

14
00:01:17,300 --> 00:01:23,000
training and testing, baseline accuracy and
cross-validation.

15
00:01:23,000 --> 00:01:27,480
First of all, we're going to see what it's
like to be a classifier.

16
00:01:27,480 --> 00:01:30,800
We're going to construct a decision tree ourselves,
interactively.

17
00:01:30,800 --> 00:01:32,840
I'm going to just open up Weka here.

18
00:01:32,840 --> 00:01:35,140
The Weka Explorer.

19
00:01:35,140 --> 00:01:43,170
I'm going to load the segment-challenge dataset.

20
00:01:43,170 --> 00:01:50,040
segment-challenge.arff -- that's the one I
want.

21
00:01:50,040 --> 00:01:52,700
We're going to look at this dataset.

22
00:01:52,700 --> 00:01:56,100
Let's first of all look at the class.

23
00:01:56,100 --> 00:02:04,540
The class values are brickface, sky, foliage,
cement, window, path, and grass.

24
00:02:04,540 --> 00:02:07,880
It looks like this is kind of an image analysis
dataset.

25
00:02:07,880 --> 00:02:15,230
When we look at the attributes, we see things
like the centroid of columns and rows, pixel

26
00:02:15,230 --> 00:02:22,110
counts, line densities, means of intensities,
and various other things.

27
00:02:23,700 --> 00:02:29,110
Saturation, hue, and the class, as I said
before, is different kinds of texture: bricks,

28
00:02:30,780 --> 00:02:33,650
sky, foliage, and so on.

29
00:02:33,650 --> 00:02:37,500
That's the segment challenge dataset.

30
00:02:37,500 --> 00:02:42,740
I'm going to select the user classifier.

31
00:02:42,740 --> 00:02:45,490
The user classifier is a tree classifier.

32
00:02:45,490 --> 00:02:48,680
We'll see what it does in just a minute.

33
00:02:48,680 --> 00:02:53,250
That's the user classifier.

34
00:02:53,250 --> 00:02:58,290
Before I start, this is really quite important.

35
00:02:58,290 --> 00:03:00,890
I'm going to use a supplied test set.

36
00:03:00,890 --> 00:03:13,180
I'm going to set the test set, which is used
to evaluate the classifier to be segment-test.

37
00:03:13,180 --> 00:03:18,510
The training set is segment-challenge, the
test set is segment-test.

38
00:03:20,610 --> 00:03:22,340
Now we're all set.

39
00:03:23,160 --> 00:03:29,340
I'm going to start the classifier.

40
00:03:29,860 --> 00:03:36,620
What we see is a window with two panels: the
Tree Visualizer and the Data Visualizer.

41
00:03:36,620 --> 00:03:40,460
Let's start with the Data Visualizer.

42
00:03:40,460 --> 00:03:46,140
We looked at visualization in the last class,
how you can select different attributes for

43
00:03:46,140 --> 00:03:48,430
the x and y.

44
00:03:48,430 --> 00:03:55,430
I'm going to plot the region-centroid-row
against the intensity-mean.

45
00:04:09,630 --> 00:04:11,460
That's the plot I get.

46
00:04:21,200 --> 00:04:26,520
Now, we're going to select a class.

47
00:04:26,520 --> 00:04:31,090
I'm going to
select Rectangle.

48
00:04:33,890 --> 00:04:43,150
If I draw out with my mouse a rectangle here,
I'm going to have a rectangle that's pretty

49
00:04:43,150 --> 00:04:48,270
well pure reds, as far as I can see.

50
00:04:48,270 --> 00:04:54,020
I'm going to submit this rectangle.

51
00:04:54,020 --> 00:04:59,520
You can see that that area has gone and the
picture has been rescaled.

52
00:04:59,520 --> 00:05:00,750
I'm building up a tree here.

53
00:05:00,750 --> 00:05:07,750
If I look at the Tree Visualizer, I've got
a tree.

54
00:05:09,460 --> 00:05:15,860
We've split on these two attributes, region-centroid-row
and intensity-mean.

55
00:05:15,860 --> 00:05:19,000
Here we've got sky, these are all sky classes.

56
00:05:19,000 --> 00:05:23,610
Here we've got a mixture of brickface, foliage,
cement, window, path, and grass.

57
00:05:23,610 --> 00:05:26,110
We're kind of going to build up this tree.

58
00:05:26,110 --> 00:05:30,390
What I want to do is to take this node and
refine it a bit more.

59
00:05:30,390 --> 00:05:32,780
Here is the Data Visualizer again.

60
00:05:32,780 --> 00:05:39,780
I'm going to select a rectangle containing
these items here, and submit that.

61
00:05:41,470 --> 00:05:44,500
They've gone from this picture.

62
00:05:44,500 --> 00:05:53,240
You can see that here, I've created this split,
another split on region-centroid-row and

63
00:05:53,240 --> 00:05:55,520
intensity-mean, and here, this is almost all
path.

64
00:05:55,520 --> 00:06:01,710
233 path instances, and then a mixture here.

65
00:06:01,710 --> 00:06:03,750
This is a pure node we've got over there.

66
00:06:03,750 --> 00:06:05,920
This is almost a pure node.

67
00:06:05,920 --> 00:06:07,120
This is the one I want to work on.

68
00:06:07,120 --> 00:06:11,500
I'm going to cover some of those instances
now.

69
00:06:11,500 --> 00:06:15,210
Let's take this lot here and submit that.

70
00:06:15,210 --> 00:06:22,210
Then I'm going to take this lot here and submit
that.

71
00:06:23,250 --> 00:06:30,120
Maybe I'll take those ones there and submit
that.

72
00:06:30,120 --> 00:06:33,900
This little cluster here seems pretty uniform.

73
00:06:33,900 --> 00:06:34,410
Submit that.

74
00:06:34,410 --> 00:06:38,110
I haven't actually changed the axes, but,
of course, at any time, I could change these

75
00:06:38,110 --> 00:06:43,930
axes to better separate the remaining classes.

76
00:06:43,930 --> 00:06:45,800
I could kind of mess around with these.

77
00:06:45,800 --> 00:06:51,180
Actually, a quick way to do it is to click
here on these bars.

78
00:06:51,180 --> 00:06:55,750
Left click for x and right click for y.

79
00:06:55,750 --> 00:07:02,750
I can quickly explore different pairs of axes
to see if I can get a better split.

80
00:07:07,370 --> 00:07:08,300
Here's the tree I've created.

81
00:07:08,300 --> 00:07:11,680
I'm going to fit it to the screen.

82
00:07:11,680 --> 00:07:12,300
It looks like this.

83
00:07:12,300 --> 00:07:18,650
You can see that we have successively elaborated
down this branch here.

84
00:07:18,650 --> 00:07:25,650
When I finish with this, I can accept the
tree.

85
00:07:26,250 --> 00:07:33,250
Actually, before I do that, let me just show
you that we were selecting rectangles here,

86
00:07:33,500 --> 00:07:37,650
but I've got other things I can select: a
polygon or a polyline.

87
00:07:37,650 --> 00:07:43,520
If I don't want to use rectangles, I can use
polygons or polylines.

88
00:07:43,520 --> 00:07:47,170
If you like, you can experiment with those
to select different shaped areas.

89
00:07:51,000 --> 00:07:58,900
There's an area I've got selected I just can't
quite finish it off.

90
00:07:58,900 --> 00:08:03,150
Alright, I right clicked to finish it off.

91
00:08:03,150 --> 00:08:04,900
I could submit that.

92
00:08:04,900 --> 00:08:06,590
I'm not confined to rectangles;

93
00:08:06,590 --> 00:08:08,940
I can use different shapes.

94
00:08:08,940 --> 00:08:10,430
I'm not going to do that.

95
00:08:10,430 --> 00:08:12,000
I'm satisfied with this tree for the moment.

96
00:08:12,000 --> 00:08:13,920
I'm going to accept the tree.

97
00:08:13,920 --> 00:08:18,420
Once I do this, there is no going back, so
you want to be sure.

98
00:08:18,420 --> 00:08:21,840
If I accept the tree, "Are you sure?" -- yes.

99
00:08:21,840 --> 00:08:26,110
Here, I've got a confusion matrix, and I can
look at the errors.

100
00:08:26,110 --> 00:08:35,320
My tree classifies 78% of the instances correctly,
nearly 79% correctly, and 21% incorrectly.

101
00:08:35,320 --> 00:08:40,500
That's not too bad, especially considering
how quickly I built that tree.

102
00:08:42,480 --> 00:08:44,870
It's over to you now.

103
00:08:44,870 --> 00:08:49,480
I'd like you to play around and see if you
can do better than this by spending a little

104
00:08:49,480 --> 00:08:52,780
bit longer on getting a nice tree.

105
00:08:52,780 --> 00:08:56,010
I'd like you to reflect on a couple of things.

106
00:08:56,010 --> 00:08:59,300
First of all, what strategy you're using to
build this tree.

107
00:08:59,300 --> 00:09:04,220
Basically, we're covering different regions
of the instance space, trying to get pure

108
00:09:04,220 --> 00:09:07,430
regions to create pure branches.

109
00:09:07,430 --> 00:09:10,670
This is kind of like a bottom-up covering
strategy.

110
00:09:10,670 --> 00:09:15,890
We cover this area and this area and this
area.

111
00:09:15,890 --> 00:09:17,510
That's not how J48 works.

112
00:09:17,510 --> 00:09:23,110
When it builds its trees, it tries to do a
judicious split through the whole dataset.

113
00:09:23,110 --> 00:09:30,110
At the very top level, it'll split the entire
dataset into two in a way that doesn't necessarily

114
00:09:30,220 --> 00:09:34,760
split out particular classes, but makes it
easier when it starts working on each half

115
00:09:34,760 --> 00:09:40,920
of the dataset further splitting in a top-down
manner in order to try and produce an optimal tree.

116
00:09:40,920 --> 00:09:46,370
It will produce trees much better than the
one that I just produced with the user classifier.

117
00:09:46,370 --> 00:09:52,350
I'd also like you to reflect on what it is
we're trying to do here.

118
00:09:52,350 --> 00:09:57,940
Given enough time, you could produce a 'perfect'
tree for the dataset, but don't forget that

119
00:09:57,940 --> 00:10:01,500
the dataset that we've loaded is the training
dataset.

120
00:10:01,500 --> 00:10:07,690
We're going to evaluate this tree on a different
dataset, the test dataset, which hopefully

121
00:10:07,690 --> 00:10:11,530
comes from the same source, but is not identical
to the training dataset.

122
00:10:11,530 --> 00:10:15,240
We're not trying to precisely fit the training
dataset;

123
00:10:15,240 --> 00:10:21,290
we're trying to fit it in a way that generalizes
the kinds of patterns exhibited in the dataset.

124
00:10:21,290 --> 00:10:26,550
We're looking for something that will perform
well on the test data.

125
00:10:26,550 --> 00:10:32,800
That highlights the importance of evaluation
in machine learning.

126
00:10:32,800 --> 00:10:37,700
That's what this class is going to be about,
different ways of evaluating your classifier.

127
00:10:37,700 --> 00:10:40,230
That's it.

128
00:10:40,230 --> 00:10:45,260
There's some information in the course text
about the user classifier, which you can read

129
00:10:45,260 --> 00:10:47,050
if you like.

130
00:10:47,050 --> 00:10:52,950
Please go on and do the activity associated
with this lesson and produce your own classifier.

131
00:10:52,950 --> 00:10:58,850
Hopefully, you'll be able to do much better
than me given 5-10 minutes.

132
00:10:58,850 --> 00:11:05,850
Good luck!

