Hi! Before we go on to talk about some more simple classifier methods, we need to talk about overfitting.

Any machine learning method may "overfit" the training data. That's when it produces a classifier that fits the training data too tightly, and doesn't generalize well to independent test data.

Remember the user classifier that you built at the beginning of Class 2, when you built a classifier yourself? Imagine tediously putting a tiny circle around every single training data point. You could build a classifier (very laboriously) that would be 100% correct on the training data, but probably wouldn't generalize very well to independent test data. That's overfitting. It's a general problem. We're going to illustrate it with OneR.

We're going to look at the numeric version of the weather problem, where temperature and humidity are numbers and not nominal values. If you think about how OneR works, when it comes to make a rule on the attribute "temperature", it's going to make complex rule that branches 14 different ways perhaps for the 14 different instances of the dataset. Each rule is going to have zero errors; it's going to get it exactly right. So it's going to look like if we branch on "temperature" we get a perfect rule, with a total error count of zero.

In fact, OneR has a parameter that limits the complexity of rules. I'm not going to talk about how it works. It's pretty simple, but it's just a bit distracting and it's not very important how it works. The point is that the parameter allows you to limit the complexity of the rules that are produced by OneR.

Let's open the numeric weather data. We go to OneR, and choose it. There's OneR. Let's just create a rule. Here the rule is based on the "outlook" attribute. This is exactly what happened in the last lesson with the nominal version of the weather data. 

Let's just remove the "outlook" attribute, and try it again. Now let's see what happens when we classify with OneR. Now it branches on "humidity". If humidity is less than 82.5%, it's a "yes" day, if it's greater than 82.5%, it's a "no" day; and that gets 10 out of 14 instances correct. So far so good, that's using the default setting of OneR's parameter that controls the complexity of the rules it generates.

We can go and look at OneR. Remember, you can configure a classifier by clicking on it. We see that there's a parameter called minBucketSize, and it's set to 6 by default, which is a good compromise value. I'm going to change that value to 1, and then see what happens.

Run OneR again, and now I get a different kind of rule. It's branching many different ways on the "temperature" attribute. This rule is overfitted to the dataset. It's a very accurate rule on the training data, but it won't generalize well to independent test data. 

Now let's see what happens with a more realistic dataset. I'll open diabetes, which is a numeric dataset. All the attributes are numeric, and the class is either tested_negative or tested_positive. Let's run ZeroR to get a baseline figure for this dataset. Here I get 65% for the baseline. So we really ought to be able to do better than that. 

Now let's run OneR, with default parameter settings -- that is, a value of 6 for OneR's parameter that controls rule complexity. We get 71.5%. That's pretty good. We're evaluating using cross-validation, and OneR outperforms the baseline accuracy by quite a bit -- 71% versus 65%. 

If we look at the rule, it branches on "plas". This is the plasma-glucose concentration. So depending on which of these regions the plasma-glucose concentration falls into, we're going to predict a negative or a positive outcome. That seems like quite a sensible rule. 

Now, let's change OneR's parameter to make it overfit. We'll configure OneR, find the minBucketSize parameter, and change it to 1. When we run OneR again, we get 57% accuracy, quite a bit lower than the ZeroR baseline of 65%. And if you look at the rule -- here it is -- it's testing a different attribute, "pedi", which -- if you look at the comments in the ARFF file -- happens to be the diabetes pedigree function, whatever that is. You can see that this attribute has a lot of different values, and it looks like we're branching on pretty well every single one. That gives us lousy performance when evaluated by cross-validation, which is what we're doing now.

But if you were to evaluate it on the training set, you would expect to see very good performance. Yes, here we get 87.5% accuracy on the training set, which is very good for this dataset. Of course, that figure is completely misleading; the rule is strongly overfitted to the training dataset, and doesn't generalize well to independent test sets. That's a good example of overfitting.

Overfitting is a general phenomenon that plagues all machine learning methods. We've illustrated it by playing around with the parameter of the OneR method, but it happens with all machine learning methods. It's one reason why you should never evaluate on the training set. 

Overfitting can occur in more general contexts. Let's suppose you've got a dataset and you choose a very large number of machine learning methods -- say a million different machine learning methods -- and choose the best for your dataset, using cross-validation. 

Well, because you've used so many machine learning methods, you can't expect to get the same performance on new test data. You've chosen so many that the one you've ended up with is going to be overfitted to the dataset you're using. It's not sufficient just to use cross-validation and believe the results. In this case, you might divide the data three ways, into a training set, a test set, and a validation set. Choose the method using the training and test set. By all means, use your million machine learning methods and choose the best on the training and test set, or the best using cross-validation on the training set. But then, leave aside this separate validation set for use at the end, once you've chosen your machine learning method, and evaluate it on that to get a much more realistic assessment of how it would perform on independent test data.

Overfitting is a really big problem in machine learning. You can read a bit more about OneR and what this parameter actually does in the course text in Section 4.1. 

Off you go now and do the activity associated with this class. Bye for now.