﻿1
00:00:18,740 --> 00:00:22,890
Hello! In the last lesson, we looked at
using a classifier in Weka, J48.

2
00:00:22,890 --> 00:00:27,020
In this lesson, we're going to look at
another of Weka's principle features:

3
00:00:27,020 --> 00:00:28,550
filters.

4
00:00:28,550 --> 00:00:31,650
One of the main messages of this course
is that it's really important when you're

5
00:00:31,650 --> 00:00:34,390
data mining to get close to your data,

6
00:00:34,390 --> 00:00:38,339
and to think about preprocessing it,
or filtering it in some way

7
00:00:38,339 --> 00:00:41,230
before applying a classifier.

8
00:00:42,500 --> 00:00:47,450
I'm going to start by using a filter
to remove an attribute from the weather data.

9
00:00:48,540 --> 00:00:53,600
Let me start up the Weka Explorer and open

10
00:00:53,600 --> 00:00:56,630
the weather data.

11
00:00:56,630 --> 00:00:57,970
That's the one.

12
00:00:57,970 --> 00:01:00,650
I'm going to remove

13
00:01:00,650 --> 00:01:02,180
the humidity attribute.

14
00:01:05,040 --> 00:01:07,010
That's attribute number 3.

15
00:01:07,010 --> 00:01:11,310
I can look at filters, just like we chose
classifier using this Choose button

16
00:01:11,310 --> 00:01:13,110
on the Classify panel,

17
00:01:13,110 --> 00:01:16,370
we choose filters by using
the Choose button here.

18
00:01:16,370 --> 00:01:19,190
There are a lot of different filters.

19
00:01:19,190 --> 00:01:23,040
Allfilter and MultiFilter are ways of
combining filters. We have supervised and

20
00:01:23,040 --> 00:01:27,470
unsupervised filters. Supervised filters
are ones that use a class value for

21
00:01:27,470 --> 00:01:28,810
their operation.

22
00:01:28,810 --> 00:01:32,450
They aren't so common as unsupervised
filters, which don't use the

23
00:01:32,450 --> 00:01:36,400
class value. There are attribute filters and
instance filters. We want to remove an attribute,

24
00:01:36,400 --> 00:01:39,520
so we're looking for an attribute filter.

25
00:01:39,520 --> 00:01:42,500
There are so many filters in Weka that
you just have to learn to kind of look around

26
00:01:42,500 --> 00:01:44,470
and find what you want.

27
00:01:44,470 --> 00:01:47,030
I'm going to look for removing an attribute.

28
00:01:49,560 --> 00:01:52,270
Here we go, Remove.

29
00:01:52,270 --> 00:01:53,170
Now, as before

30
00:01:53,170 --> 00:01:57,160
when we configured the J48
classifier, we clicked here.

31
00:01:57,160 --> 00:01:59,070
I'm going to click here, and we can

32
00:01:59,070 --> 00:01:59,880
configure the filter.

33
00:01:59,880 --> 00:02:04,670
This is "A filter that removes a
range of attributes from the dataset".

34
00:02:04,670 --> 00:02:08,060
I can specify a range of attributes here.

35
00:02:08,060 --> 00:02:11,460
I just want to remove one. I think it was
attribute number 3 we were going to remove.

36
00:02:13,490 --> 00:02:18,270
I can invert the selection and remove
all the other attributes and leave 3,

37
00:02:18,270 --> 00:02:21,000
but I'm just going to leave
it like that. Click OK,

38
00:02:21,000 --> 00:02:23,050
and watch humidity go

39
00:02:23,050 --> 00:02:27,500
when we apply the filter. Nothing
happens until you apply the filter.

40
00:02:27,500 --> 00:02:28,950
I've just applied it,

41
00:02:28,950 --> 00:02:33,990
and here we are, the humidity
attribute has been removed.

42
00:02:33,990 --> 00:02:39,530
Luckily, I can undo the effect of that
and put it back by pressing the Undo button.

43
00:02:39,530 --> 00:02:41,300
That's how to remove an attribute.

44
00:02:41,300 --> 00:02:44,370
Actually, the bad news is there is a
much easier way to remove an attribute.

45
00:02:44,370 --> 00:02:45,870
You don't need to use a filter at all.

46
00:02:45,870 --> 00:02:48,210
If you just want to remove an attribute,

47
00:02:48,210 --> 00:02:50,900
you can select it here and click
the Remove button at the bottom.

48
00:02:50,900 --> 00:02:53,580
It does the same job.

49
00:02:53,580 --> 00:02:54,740
Sorry about that.

50
00:02:54,740 --> 00:02:57,890
But filters are really
useful and can do much more

51
00:02:57,890 --> 00:02:59,860
complex things than that.

52
00:02:59,860 --> 00:03:02,540
Let's, for example, imagine removing,

53
00:03:02,540 --> 00:03:08,910
not an attribute, but let's remove
all instances where humidity has the value 'high'.

54
00:03:08,910 --> 00:03:13,520
That is, attribute number 3 has this first
value. That's going to remove seven

55
00:03:13,520 --> 00:03:15,840
instances from the dataset. There are

56
00:03:15,840 --> 00:03:17,950
fourteen instances
altogether, so we're going to get

57
00:03:17,950 --> 00:03:19,800
left with a reduced dataset of

58
00:03:19,800 --> 00:03:20,580
seven [instances].

59
00:03:23,340 --> 00:03:27,890
Let's look for a filter to
do that. We want to

60
00:03:27,890 --> 00:03:31,890
remove instances, so it's
going to be an instance filter.

61
00:03:31,890 --> 00:03:34,650
I just have to look down here and

62
00:03:34,650 --> 00:03:37,740
see if there is anything suitable.

63
00:03:37,740 --> 00:03:40,500
How about RemoveWithValues?

64
00:03:40,500 --> 00:03:44,270
The RemoveWithValues filter.

65
00:03:44,270 --> 00:03:47,200
I can click that to configure it,

66
00:03:47,200 --> 00:03:51,890
and I can click More to see
what it does. Here it says it

67
00:03:51,890 --> 00:03:55,740
"Filters instances according to
the value of an attribute",

68
00:03:55,740 --> 00:03:57,920
which is exactly what we want.

69
00:03:57,920 --> 00:04:01,890
We're going to set the attributeIndex;
we want the third

70
00:04:01,890 --> 00:04:03,710
attribute, humidity,

71
00:04:03,710 --> 00:04:07,590
and the first value. We can remove
a number of different values; we'll just remove

72
00:04:07,590 --> 00:04:09,040
the first value.

73
00:04:09,040 --> 00:04:10,490
Now we've configured that.

74
00:04:10,490 --> 00:04:13,160
Nothing happens until we apply the filter.

75
00:04:13,160 --> 00:04:17,340
Watch what happens when we apply it.

76
00:04:17,340 --> 00:04:20,730
We still have the humidity attribute
there, but we have zero

77
00:04:20,730 --> 00:04:24,290
elements with high humidity. In fact,
the dataset has been reduced to only

78
00:04:24,290 --> 00:04:27,940
seven instances.

79
00:04:27,940 --> 00:04:31,430
Recall that when you do anything here,
you can save the results. So, we could save that

80
00:04:31,430 --> 00:04:33,390
reduced dataset if we wanted, but

81
00:04:33,390 --> 00:04:34,980
I don't want to do that now.

82
00:04:34,980 --> 00:04:38,480
I'm going to undo this.

83
00:04:40,100 --> 00:04:43,320
We removed the instances where humidity is high.

84
00:04:43,320 --> 00:04:45,680
We have to think about,

85
00:04:45,680 --> 00:04:49,600
when we're looking for filters, whether we want
a supervised or an unsupervised filter,

86
00:04:49,600 --> 00:04:52,640
whether we want an attribute
filter or instance filter,

87
00:04:52,640 --> 00:04:54,080
and then just kind of use your

88
00:04:54,080 --> 00:05:00,340
common sense to look down the list
of filters to see which one you want.

89
00:05:00,340 --> 00:05:03,450
Sometimes when you filter data
you get much better classification.

90
00:05:03,450 --> 00:05:04,970
Here's a really simple example.

91
00:05:04,970 --> 00:05:07,730
I'm going to open the glass dataset

92
00:05:07,730 --> 00:05:10,080
that we saw before.

93
00:05:10,080 --> 00:05:14,610
Here's the glass dataset. I'm
going to use J48, which we did before.

94
00:05:15,580 --> 00:05:18,650
It's a tree classifier.

95
00:05:21,370 --> 00:05:24,010
I'm going to start that,

96
00:05:24,010 --> 00:05:27,190
and I get an accuracy of 66.8%.

97
00:05:30,230 --> 00:05:33,360
Let's remove Fe,

98
00:05:35,150 --> 00:05:38,700
that is, Iron. Remove this attribute,

99
00:05:38,700 --> 00:05:40,570
and we get a smaller dataset.

100
00:05:40,570 --> 00:05:45,670
Go and run J48 again.

101
00:05:45,670 --> 00:05:47,730
Now we get an accuracy of

102
00:05:47,730 --> 00:05:51,480
67.3%. So, we've improved the
accuracy a little bit

103
00:05:51,480 --> 00:05:54,370
by removing that attribute.

104
00:05:54,370 --> 00:05:58,210
Sometimes the effect is pretty dramatic.
Actually, in this dataset, I'm going to remove

105
00:05:58,210 --> 00:06:01,430
everything except the refractive index

106
00:06:01,430 --> 00:06:08,430
and Magnesium (Mg). I'm going
to remove all of these attributes.

107
00:06:09,290 --> 00:06:13,140
Left with a much smaller
dataset with two attributes.

108
00:06:13,140 --> 00:06:19,020
Apply J48 again.

109
00:06:19,020 --> 00:06:21,130
Now, I've got an even better result,

110
00:06:21,130 --> 00:06:24,210
68.7% accuracy.

111
00:06:24,210 --> 00:06:28,480
I can visualize that tree,
of course, remember by right

112
00:06:28,480 --> 00:06:31,990
clicking here and visualizing the tree,

113
00:06:31,990 --> 00:06:35,790
and have a look and see what it means.
It is much easier to visualize the trees

114
00:06:35,790 --> 00:06:36,640
when they are smaller.

115
00:06:37,960 --> 00:06:39,310
This is a good one to

116
00:06:39,310 --> 00:06:44,580
look at and consider what the
structure of this descision is.

117
00:06:44,580 --> 00:06:46,300
That's it for now.

118
00:06:46,300 --> 00:06:48,510
We've looked at: filters in Weka;

119
00:06:48,510 --> 00:06:51,850
supervised versus unsupervised,
attribute versus instance filters;

120
00:06:53,230 --> 00:06:55,450
to find the right filter you need to look;

121
00:06:55,450 --> 00:06:58,840
they can be very powerful and
judiciously removing attributes can

122
00:06:58,840 --> 00:07:02,820
both improve performance and
increase comprehensibility.

123
00:07:02,820 --> 00:07:05,680
If you'd like, for some background reading
on this, go to the textbook and

124
00:07:05,680 --> 00:07:10,040
have a look at Section 11.2 on
Loading and filtering files.

125
00:07:10,040 --> 00:07:14,080
Then, go and do the activity
associated with this lesson.

126
00:07:14,080 --> 00:07:16,600
Bye for now!

