Hi! In the last class, we looked at a bare-bones algorithm for constructing decision trees. To get an industrial strength decision-tree induction algorithm, we need to add some more complicated stuff, notably pruning. We're going to talk in this [lesson] about pruning decision trees.

Here's a guy pruning a tree, and that's a good image to have in your mind when we're talking about decision trees. We're looking at those little twigs and little branches around the edge of the tree, seeing if they're worthwhile, and snipping them off if they're not contributing. That way we'll get a decision tree that might perform worse on the training data, but perhaps generalizes better to independent test data. Of course, that's what we want.

Here's the weather data again. I'm sorry to keep harking back to the weather data, but it's just a nice simple example that we all know now. I've added here a new attribute -- I call it an "ID code" attribute -- which is different for each instance. I've just given them an identification code: a, b, c, and so on. Let's just think from the last lesson what's going to happen when we consider which is the best attribute to split on at the root, the first decision. We're going to be looking for the information gain from each of our attributes separately. We're going to gain a lot of information by choosing the ID code. Actually, if you split on the ID code, that tells you *everything* about the instance we're looking at. So that's going to be a maximal amount of information gain, and clearly we're going to split on that attribute at the root node of the decision tree. But that's not going to generalize very well -- it's not going to generalize at all -- to new weather instances.

To get around this problem, having constructed a decision tree, decision tree algorithms then automatically prune it back. You don't see any of this, it just happens when you start the algorithm in Weka. 

How do we prune? There are some simple techniques for pruning, and some more complicated techniques for pruning. A very simple technique is to not continue splitting if the nodes get very small. I said in the last lesson that we're going to keep splitting until each node has just one class associated with it. Well, perhaps that's not such a good idea -- if we have a very small node with a couple of instances, it's probably not worth splitting that node. That's actually a parameter in J48. 

I've got Weka going here. I'm going to choose J48 and look at the parameters. There's a parameter called "minNumObj". If I mouse over that parameter, it says "The minimum number of instances per leaf". The default value for that is 2. 

The second thing we do is to build a full tree and then work back from the leaves. It turns out to be better to build a full tree and prune back rather than trying to do forward pruning as you're building the tree. We apply a statistical test at each stage. That's the "confidenceFactor" parameter; it's here, the default value is 0.25 -- "the confidence factor used for pruning (smaller values incur more pruning)." 

Sometimes it's good to prune an interior node, and to raise the subtree beneath that interior node up one level. That's called "subtreeRaising"; it's this parameter here; we can switch it on or switch it off: "whether to consider the subtree raising operation during pruning." Subtree raising increases the complexity of the algorithm, so it would work faster if you turned off subtree raising on a large problem. 

I'm not going to talk about the details of these methods. Pruning is a messy and complicated subject, and it's not particularly illuminating. Actually, I don't really recommend playing around with these parameters here: the default values on J48 tend to do a pretty good job. 

I guess it's become apparent to you now that the need to prune is really a result of the original unpruned tree overfitting the training dataset. This is another instance of overfitting. Sometimes simplifying a decision tree gives better results -- not just a smaller, more manageable tree, but actually better results. 

I'm going to open the diabetes data. I'm going to choose J48, and I'm just going to run it with the default parameters. I get an accuracy of 73.8%, evaluated using cross-validation. The size of the tree is 20 leaves, and a total of 39 nodes. That's 19 interior nodes and 20 leaf nodes. 

Let's switch off pruning -- J48 prunes by default, so we're going to switch off pruning. We've got an "unpruned" option here, which is "false" -- which means it's pruning. I'm going to change that to "true" -- which means it's not pruning any more -- and run it again.

Now we get a slightly worse result, 72.7%, probably not significantly worse. We get a slightly larger tree -- 22 leaves and 43 nodes. That's a double whammy, really. We've got a bigger tree, which is harder to understand, and we've got a slightly worse prediction result. We would prefer the pruned tree, I think, on this dataset.

I'm going to show you a more extreme example with the "breast cancer" data. I don't think we've looked at the breast cancer data before. The class is "no-recurrence-events" versus "recurrence-events", and there are attributes like age, menopause, tumor size, and so on. I'm going to classify this with J48 in the default configuration. I need to switch on pruning -- that is, make unpruned "false" -- and then run it. 

I get an accuracy of 75.5%, and I get a fairly small tree with 4 leaves and 2 internal nodes. I can look at that tree here, or I can visualize the tree. We get this nice, simple little decision structure, which is quite comprehensible and performs pretty well, 75% accuracy. 

I'm going to switch off pruning -- make "unpruned" true -- and run it again. First of all, I get a much worse result, 69.6% -- probably significantly worse than the 75.5% I had before. More importantly, I get a huge tree, with 152 leaves and 179 total nodes. It's massive. If I try to visualize that, I probably won't be able to see very much. I can try to fit that to my screen, and it's still impossible to see what's going on here. In fact, if I look at the textual description of the tree, it's just extremely complicated. That's a bad thing. Here, an unpruned tree is a very bad idea. We get a huge tree which does quite a bit worse than a much simpler decision structure. J48 does pruning by default and, in general, you should let it do pruning according to the default parameters. That would be my recommendation.

We've talked about J48, or, in other words, C4.5. Remember, in Lesson 1.4 we talked about the progression from C4.5 by Ross Quinlan -- here's a picture of Ross, an Australian computer scientist, at the bottom of the screen -- the progression from C4.5 to J48, which is the Java implementation essentially equivalent to C4.5. It's a very popular method. It's simple and easy to use. Decision trees are very attractive because you can look at them and see what the structure of the decision is, see what's important about your data. 

There are many different pruning methods, and their main effect is to change the size of the tree. They have a small effect on the accuracy, and [sometimes] make the accuracy worse. They often have a huge effect on the size of the tree, as we just saw with the breast cancer data. 

Pruning is actually a general technique to guard against overfitting, and it can be applied to structures other than trees -- like decision rules. 

There's a lot more we could say about decision trees. For example, we've been talking about univariate decision trees -- that is, ones that have a single test at each node. You can imagine a multivariate tree, where there is a compound test: the test at the node might be "if this attribute is that AND that attribute is something else". You can imagine more complex decision trees produced by more complex decision tree algorithms. In general, C4.5/J48 is a popular and useful workhorse algorithm for data mining.

You can read a lot more about decision trees if you go to the course text. Section 6.1 tells you about pruning and gives you the mathematical details of the pruning methods that I've just sketched here.  

It's time for you to do the activity, and I'll see you in the next lesson. Bye for now!