Hello! In the last lesson, we looked at using a classifier in Weka, J48. In this lesson, we're going to look at another of Weka's principal features: filters. 

One of the main messages of this course is that it's really important when you're data mining to get close to your data, and to think about preprocessing it, or filtering it in some way, before applying a classifier. 

I'm going to start by using a filter to remove an attribute from the weather data. Let me start up the Weka Explorer and open the weather data. I'm going to remove the humidity attribute: that's attribute number 3. I can look at filters; just like we chose classifier using this Choose button on the Classify panel, we choose filters by using the Choose button here. 

There are a lot of different filters. Allfilter and MultiFilter are ways of combining filters. We have supervised and unsupervised filters. Supervised filters are ones that use a class value for their operation. They aren't so common as unsupervised filters, which don't use the class value. There are attribute filters and instance filters. We want to remove an attribute. So we're looking for an attribute filter. There are so many filters in Weka that you just have to learn to look around and find what you want.

I'm going to look for removing an attribute. Here we go, Remove. Now, before when we configured the J48 classifier, we clicked here. I'm going to click here, and we can configure the filter. This is "A filter that removes a range of attributes from the dataset". I can specify a range of attributes here. I just want to remove one. I think it was attribute number 3 we were going to remove. I can invert the selection and remove all the other attributes and leave 3, but I'm just going to leave it like that. Click OK, and watch humidity go when we apply the filter. Nothing happens until you apply the filter. I've just applied it, and here we are, the humidity attribute has been removed. Luckily I can undo the effect of that and put it back by pressing the Undo button. That's how to remove an attribute. 

Actually, the bad news is there is a much easier way to remove an attribute: you don't need to use a filter at all. If you just want to remove an attribute, you can select it here and click the "Remove" button at the bottom. It does the same job. Sorry about that.

But filters are really useful, and can do much more complex things than that. Let's, for example, imagine removing, not an attribute, but let's remove all instances where humidity has the value "high". That is, attribute number 3 has this first value. That's going to remove 7 instances from the dataset. There are 14 instances altogether, so we're going to get left with a reduced dataset of 7 instances. Let's look for a filter to do that. We want to remove instances, so it's going to be an instance filter. I just have to look down here and see if there is anything suitable. 

How about RemoveWithValues? -- the RemoveWithValues filter. I can click that to configure it, and I can click More to see what it does. Here it says it "Filters instances according to the value of an attribute", which is exactly what we want. We're going to set the attributeIndex; we want the third attribute (humidity), and the first value. We can remove a number of different values; we'll just remove the first value. Now we've configured that. Nothing happens until we apply the filter. Watch what happens when we apply it.

We still have the humidity attribute there, but we have zero elements with high humidity. In fact, the dataset has been reduced to only 7 instances. Recall that when you do anything here, you can save the results. So, we could save that reduced dataset if we wanted, but I don't want to do that now. I'm going to undo this.

We removed the instances where humidity is high. We have to think about, when we're looking for filters, whether we want a supervised or an unsupervised filter, whether we want an attribute filter or instance filter, and then just use your common sense to look down the list of filters to see which one you want.

Sometimes when you filter data you get much better classification. Here's a really simple example. I'm going to open the glass dataset that we saw before. Here's the glass dataset. I'm going to use J48, which we did before. It's a tree classifier. I'm going to start that, and I get an accuracy of 66.8%. Let's remove Fe, that is, Iron. Remove this attribute, and we get a smaller dataset. Go and run J48 again. Now we get an accuracy of 67.3%. So we've improved the accuracy a little bit by removing that attribute.

Sometimes the effect is pretty dramatic. Actually, in this dataset, I'm going to remove everything except the refractive index and Magnesium (Mg). I'm going to remove all of these attributes, and am left with a much smaller dataset with two attributes. Apply J48 again. Now, I've got an even better result, 68.7% accuracy.

I can visualize that tree, of course -- remember? -- by right-clicking here and visualizing the tree, and have a look and see what it means. It is much easier to visualize trees when they are smaller. This is a good one to look at and consider what the structure of this decision is.

That's it for now. We've looked at filters in Weka; supervised versus unsupervised, attribute versus instance filters. To find the right filter you need to look. They can be very powerful, and judiciously removing attributes can both improve performance and increase comprehensibility.

If you'd like some background reading on this, go to the textbook and have a look at Section 11.2 on Loading and Filtering Files. Then go and do the activity associated with this lesson. Bye for now!