Hi! This is Lesson 2.2 in Data Mining with Weka, and here we're going to look at training and testing in a little bit more detail. 

Here's the situation. We've got a machine learning algorithm, and we feed into it training data, and it produces a classifier -- a basic machine learning situation. For that classifier, we can test it with some independent test data. We can put that into the classifier and get some evaluation results, and, separately, we can deploy the classifier in some real situation to make predictions on fresh data coming from the environment. 

It's really important in classification that when you're looking at your evaluation results, you only get reliable evaluation results if the test data is different from the training data. That's what we're going to look at in this lesson.

What if you've only got one dataset? If you just have one dataset, you should divide it into two parts: maybe use some of it for training and some of it for testing -- perhaps two-thirds of it for training and one-third of it for testing. It's really important that the training data is different from the test data.

Both training and test sets are produced by independent sampling from an infinite population. That's the basic scenario here, but they're different independent samples. It's not the same data. If it is the same data, then your evaluation results are misleading. They don't reflect what you should actually expect on new data when you deploy your classifier. 

Here we're going to look at the segment dataset, which we used in the last lesson. I'm going to open segment-challenge. I'm going to use a supplied test set. First of all, I'm going to use the J48 tree learner. I'm going to use a supplied test set, and I will set it to the appropriate segment-test file, segment-test.arff. I'm going to open that. Now we've got a test set, and let's see how it does. In the last lesson, on the same data with the User Classifier, I think I got 79% accuracy. J48 does much better; it gets 96% accuracy on the same test set. 

Suppose I was to evaluate it on the training set? I can do that by just specifying under Test Options, "Use Training Set". Now it will train again and evaluate it on the training set, which is not what you're supposed to do, because you get misleading results. Here, it's saying the accuracy is 99% on the training set. That is not representative of what we would get using this on independent data. 

If we had just one dataset, if we didn't have a test set, we could do a percentage split. Here's a percentage split: this is going to be 66% training data and 34% test data. It's going to make a random split of the dataset. If I run that, I get 95%. That's just about the same as what we got when we had an independent test set, just slightly worse. 

If I were to run it again, if we had a different split, we'd expect a slightly different result, but actually, I get exactly the same result, 95.098% accuracy. That's because Weka, before it does a run, re-initializes the random number generator. The reason is to make sure that you can get repeatable results. If it didn't do that, then the results that you got would not be repeatable. However, if you wanted to have a look at the differences that you might get on different runs, then there is a way of resetting the random number between each run. We're going to look at that in the next lesson. 

That's this lesson. The basic assumption of machine learning is that the training and test sets are independently sampled from an infinite population, the same population. If you have just one dataset, you should hold part of it out for testing, maybe 33% as we just did, or perhaps 10%. We would expect a slight variation in results each time if we hold out a different set, but Weka produces the same results each time, by design by making sure it re-initializes the random number generator each time. We ran J48 on the segment-challenge dataset.

If you'd like, you can go and look at the course text on Training and Testing, Section 5.1, and please go and do the activity associated with this lesson.

Bye for now!