Hi! Welcome back to Data Mining with Weka. This is Class 2. In the first class, we downloaded Weka and we looked around the Explorer and a few datasets; we used a classifier, the J48 classifier; we used a filter to remove attributes and to remove some instances; we visualized some data -- we visualized classification errors on a dataset; and along the way we looked at a few datasets, the weather data, both the nominal and numeric version, the glass data, and the iris dataset.

This class is all about evaluation. In Lesson 1.4, we built a classifier using J48. In this first lesson of the second class, we're going to see what it's like to actually be a classifier ourselves. Then, later on, in subsequent lessons in this class, we're going to look at more about evaluation, training and testing, baseline accuracy, and cross-validation.

First of all, we're going to see what it's like to be a classifier. We're going to construct a decision tree ourselves, interactively.

I'm going to just open up Weka here; the Weka Explorer. I'm going to load the segment-challenge dataset. segment-challenge.arff -- that's the one I want.

We're going to look at this dataset. Let's first of all look at the class. The class values are brickface, sky, foliage, cement, window, path, and grass. It looks like this is kind of an image analysis dataset. When we look at the attributes, we see things like the centroid of columns and rows, pixel counts, line densities, means of intensities, and various other things -- saturation, hue. And the class, as I said before, is different kinds of texture: bricks, sky, foliage, and so on. 

That's the segment challenge dataset. Now I'm going to select the User Classifier. The User Classifier is a tree classifier. We'll see what it does in just a minute. That's the User Classifier. Before I start -- this is really quite important -- I'm going to use a supplied test set. I'm going to set the test set, which is used to evaluate the classifier, to be segment-test. The training set is segment-challenge, the test set is segment-test. 

Now we're all set. I'm going to start the classifier. What we see is a window with two panels: the Tree Visualizer and the Data Visualizer. Let's start with the Data Visualizer. We looked at visualization in the last class -- how you can select different attributes for the x and y. I'm going to plot the region-centroid-row against the intensity-mean. That's the plot I get. 

Now we're going to select a class. I'm going to select Rectangle. If I draw out with my mouse a rectangle here, I'm going to have a rectangle that's pretty well pure reds, as far as I can see. I'm going to "submit" this rectangle. You can see that that area has gone and the picture has been rescaled. 

I'm building up a tree here. If I look at the Tree Visualizer, I've got a tree. We've split on these two attributes, region-centroid-row and intensity-mean. Here we've got sky, these are all sky classes. Here we've got a mixture of brickface, foliage, cement, window, path, and grass. We're kind of going to build up this tree: what I want to do is to take this node and refine it a bit more. 

Here is the Data Visualizer again. I'm going to select a rectangle containing these items here, and submit that. They've gone from this picture. You can see that here, I've created this split, another split on region-centroid-row and intensity-mean, and here, this is almost all "path" -- 233 "path" instances, and then a mixture here. This is a pure node we've got over there. This is almost a pure node. This is the one I want to work on. I'm going to cover some of those instances now. Let's take this lot here and submit that. Then I'm going to take this lot here and submit that. Maybe I'll take those ones there and submit that. This little cluster here seems pretty uniform: submit that. I haven't actually changed the axes, but, of course, at any time, I could change these axes to better separate the remaining classes. I could kind of mess around with these. Actually, a quick way to do it is to click here on these bars. Left-click for x and right-click for y. I can quickly explore different pairs of axes to see if I can get a better split.

Here's the tree I've created. I'm going to fit it to the screen. It looks like this. You can see that we have successively elaborated down this branch here. 

When I finish with this, I can "accept" the tree. Actually, before I do that, let me just show you that we were selecting rectangles here, but I've got other things I can select: a polygon or a polyline. If I don't want to use rectangles, I can use polygons or polylines. If you like, you can experiment with those to select different shaped areas. 

There's an area I've got selected -- I just can't quite finish it off. Alright, I right-clicked to finish it off. I could submit that -- I'm not confined to rectangles; I can use different shapes. 

I'm not going to do that. I'm satisfied with this tree for the moment. I'm going to accept the tree. Once I do this, there's no going back, so you want to be sure. If I accept the tree, "Are you sure?" -- yes. 

Here, I've got a confusion matrix, and I can look at the errors. My tree classifies 78% of the instances correctly, nearly 79% correctly, and 21% incorrectly. That's not too bad, especially considering how quickly I built that tree.

It's over to you now. I'd like you to play around and see if you can do better than this by spending a little bit longer on getting a nice tree. 

I'd like you to reflect on a couple of things: first of all, what strategy you're using to build this tree. Basically, we're covering different regions of the instance space, trying to get pure regions, to create pure branches. This is kind of like a bottom-up covering strategy. We cover this area and this area and this area. That's not how J48 works. When it builds trees, it tries to do a judicious split through the whole dataset. At the very top level, it'll split the entire dataset into two in a way that doesn't necessarily separate out particular classes, but makes it easier when it starts working on each half of the dataset; further splitting in a top-down manner in order to try and produce an optimal tree. It will produce trees much better than the one that I just produced with the User Classifier.

I'd also like you to reflect on what it is we're trying to do here. Given enough time, you could produce a "perfect" tree for the dataset. But don't forget that the dataset that we've loaded is the training dataset. We're going to evaluate this tree on a different dataset, the test dataset, which hopefully comes from the same source, but is not identical to the training dataset. We're not trying to precisely fit the training dataset; we're trying to fit it in a way that generalizes to the kinds of patterns exhibited in the dataset. 

We're looking for something that will perform well on the test data. That highlights the importance of evaluation in machine learning. That's what this class is going to be about, different ways of evaluating your classifier. 

That's it. There's some information in the course text about the User Classifier, which you can read if you like. Please go on and do the activity associated with this lesson and produce your own classifier. Hopefully, you'll be able to do much better than me, given 5-10 minutes. Good luck!