Hi! you probably learned a bit about flowers if you did the activity associated with the last lesson. Now, we're going to actually build a classifier: Lesson 1.4 Building a classifier. We're going to use a system called J48 -- I'll tell you why it's called J48 in a minute -- to analyze the glass dataset that we looked at in the last lesson. I've got the glass dataset open here. I'm going to go to the Classify panel. I choose a classifier here. There are different kinds of classifiers. Weka has Bayes classifiers, functions classifiers, lazy classifiers, meta classifiers, and so on. We're going to use a tree classifier: J48 is a tree classifier. I'm going to open "trees" and click J48. Here is the J48 classifier. Let's run it. If we just press "Start", we've got the dataset, we've got the classifier, and lo and behold, it's done it. It's a bit of an anticlimax, really. Weka makes things very easy for you to do. The problem is understanding what it is that you have done. Let's take a look. Here is some information about the dataset, the glass dataset: the number of instances and attributes. Then it's printed out a representation of a tree here. We'll look at these trees later on, but just note that this tree has 30 leaves and 59 nodes altogether. The overall accuracy is 66.8%. So it's done pretty well. Down at the bottom, we've got a confusion matrix. Remember there were about seven different kinds of glass. This is building window, made of float glass. You can see that 50 of these have been classified as 'a', which is correctly classified. 15 of them have been classified as 'b', which is building window, non-float glass, so those are errors, and 3 have been classified as 'c', and so on. This is a confusion matrix. Most of the weight is down the main diagonal, which we like to see because that indicates correct classifications. Everything off the main diagonal indicates a misclassification. That's the confusion matrix. Let's investigate this a bit further. We're going to open a configuration panel for J48. Remember, I chose it by clicking the Choose button. Now, if I click it here, I get a configuration panel. I clicked J48 in this menu, and I get a configuration panel, which gives a bunch of parameters. I'm not going to really talk about these parameters. Let's just look at one of them, the "unpruned" parameter, which by default is "false". What we've just done is to build a pruned tree, because "unpruned" is "false". We can change this to make it "true", and build an unpruned tree. We've changed the configuration. We can run it again. It just ran again, and now we have a potentially different result. Let's just have a look. We have 67% correct classification. What did we have before? These are the runs. This is the previous run, and there we had 66.8%. Now, in this run that we've just done with the unpruned tree, we've got 67% accuracy, and the tree is the same size. That's one option. I'm just going to look at another option, and then we'll look at some trees. I'm going to click the configuration panel again, and I'm going to change the "minNumObj" parameter. What is that? It's the minimum number of instances per leaf. I'm going to change that from 2 up to 15, to have larger leaves. These are the leaves of the tree here, and these numbers in brackets are the number of instances that get to the leaf. When there are two numbers, this means that one incorrectly classified instance got to this leaf and five correctly classified instances got there. You can see that all of these leaves are pretty small, with sometimes just two or three -- here is one with 31 instances. We've constrained this number, the tree is going to be generated, and this number is always going to be 15 or more. Let's run it again. Now we've got a worse result, 61% correct classification, but a much smaller tree, with only eight leaves. Now, we can visualize this tree. If I right click on the line -- these are the lines that describe each of the runs that we've done, and this is the third run -- if I right click on that, I get a little menu, and I can visualize the tree. There it is. If I right click on empty space, I can fit this to the screen. This is the decision tree. This says first look at the Barium (Ba) content. If it's large, then it must be headlamps. If it's small, then Magnesium (Mg). If that's small, then let's look at potassium (K), and if that's small, then we've got tableware. That sounds like a pretty good thing to me; I don't want too much potassium in my tableware. This is a visualization of the tree, and it's the same tree that you can see by looking here. This is a different representation of the same tree. I'll just show you one more thing about this configuration panel, the "More" button. This gives you more information about the classifier, about J48. It's always useful to look at that to see where these classifiers have come from. In this case, let me explain why it's called J48. It's based on a famous system called C4.5, which was described in a book. The book is referenced here. In fact, I think I've got it on my shelf. This book, "C4.5: Programs for Machine Learning", by an Australian computer scientist called Ross Quinlan. He started out with a system called ID3 -- I think that might have been in his PhD thesis -- and then C4.5 became quite famous. This kind of morphed through various versions into C4.5. It became famous; the book came out, and so on. He continued to work on this system. It went up to C4.8, and then he went commercial. Up until then, these were all open source systems. When we built Weka, we took the latest version of C4.5, which was C4.8, and we rewrote it. Weka's written in Java, so we called it J48. Maybe it's not a very good name, but that's the name that stuck. There's a little bit of history for you. We've talked about classifiers in Weka. I've shown you where you find the classifiers. We classified the glass dataset. We looked at how to interpret the output from J48, in particular the confusion matrix. We looked at the configuration panel for J48. We looked at a couple of options: pruned versus unpruned trees, and the option to avoid small leaves. I told you how J48 corresponds to the machine learning system that most people know as C4.5. C4.5 and C4.8 were really pretty similar, so we just talk about J48 as if it's synonymous with C4.5. You can read about this in the book -- Section 11.1 about Building a Decision Tree and Examining the Output. Now, off you go, and do the activity associated with this lesson. See you again soon!