Hi! Welcome back for another few minutes in New Zealand. In the last lesson, Lesson 5.1, we learned that Weka only helps you with a small part of the overall data mining process, the technical part, which is perhaps the easy part. In this lesson, we're going to learn that there are many pitfalls and pratfalls even in that part.

Let me just define these for you. A "pitfall" is a hidden or unsuspected danger or difficulty, and there are plenty of those in the field of machine learning. A "pratfall" is a stupid and humiliating action, which is very easy to do when you're working with data.

The first lesson is that you should be skeptical. In data mining it's very easy to cheat. Whether you're cheating consciously or unconsciously, it's easy to mislead yourself or mislead others about the significance of your results. 

For a reliable test, you should use a completely fresh sample of data that has never been seen before. You should save something for the very end, that you don't use until you've selected your algorithm, decided how you're going to apply it, and the filters, and so on. At the very, very end, having done all that, run it on some fresh data to get an estimate of how it will perform. Don't be tempted to then change it to improve it so that you get better results on that data. Always do your final run on fresh data.

We've talked a lot about overfitting, and this is basically the same kind of problem. Of course, you know not to test on the training set. We've talked about that endlessly throughout this course. Data that's been used for development in any way is tainted. Any time you use some data to help you make a choice of the filter, or the classifier, or how you're going to treat your problem, then that data is tainted. You should be using completely fresh data to get evaluation results. Leave some evaluation data aside for the very end of the process. That's the first piece of advice.

Another thing I haven't told you about in this course so far is missing values. In real datasets, it's very common that some of the data values are missing. They haven't been recorded. They might be unknown; we might have forgotten to record them; they might be irrelevant. There are two basic strategies for dealing with missing values in a dataset. You can omit instances where the attribute value is missing, or somehow find a way of omitting that particular attribute in that instance. Or you can treat missing as a separate possible value.

You need to ask yourself, is there significance in the fact that a value is missing? They say that if you've got something wrong with you and go to the doctor, and he does some tests on you: if you just record the tests that he does -- not the results of the test, but just the ones he chooses to do -- there's a very good chance that you can work out what's wrong with you just from the existence of the tests, not from their results. That's because the doctor chooses tests intelligently. The fact that he doesn't choose a test doesn't mean that that value is missing, or accidentally not there. There's huge significance in the fact that he's chosen not to do certain tests. This is a situation where "missing" should be treated as a separate possible value. There's significance in the fact that a value is missing.

But in other situations, a value might be missing simply because a piece of equipment malfunctioned, or for some other reason -- maybe someone forgot something. Then there's no significance in the fact that it's missing. Pretty well all machine learning algorithms deal with missing values. In an ARFF file, if you put a question mark as a data value, that's treated as a missing value. All methods in Weka can deal with missing values. But they make different assumptions about them. If you don't appreciate this, it's easy to get misled.

Let me just take two simple and well known (to us) examples -- OneR and J48. They deal with missing values in different ways. 

I'm going to load the nominal weather data and run OneR on it: I get 43%. Let me run J48 on it, to get 50%. I'm going to edit this dataset by changing the value of "outlook" for the first four "no" instances to "missing". That's how we do it here in this editor. If we were to write this file out in ARFF format, we'd find that these values are written into the file as question marks. 

Now, if we look at "outlook", you can see that it says here there are 4 missing values. If you count up these labels -- 2, 4, and 4 -- that's 10 labels. Plus another 4 that are missing, to make the 14 instances. 

Let's go back to J48 and run it again. We still get 50%, the same result. Of course, this is a tiny dataset, but the fact is that the results here are not affected by the fact that a few of the values are missing. However, if we run OneR, I get a much higher accuracy, a 93% accuracy. The rule that I've got is "branch on outlook", which is what we had before I think. Here it says there are 4 possibilities: if it's sunny, it's a yes; if it's overcast it's a yes; if it's rainy, it's a yes; and if it's missing, it's a no. Here, OneR is using the fact that a value is missing as significant, as something you can branch on. Whereas if you were to look at a J48 tree, it would never have a branch that corresponded to a missing value. It treats them differently. It is very important to know and remember.

The final thing I want to tell you about in this lesson is the "no free lunch" theorem. There's no free lunch in data mining. Here's a way to illustrate it. Suppose you've got a 2-class problem with 100 binary attributes. Let's say you've got a huge training set with a million instances and their classifications in the training set. The number of possible instances is 2 to the 100 (2^100), because there are 100 binary attributes. And you know 10^6 of them. So you don't know the classes of 2^100 - 10^6 examples. Let me tell you that  2^100 - 10^6 is 99.999...% of 2^100. There's this huge number of examples that you just don't know the classes of. How could you possibly figure them out? If you apply a data mining scheme to this, it will figure them out, but how could you possibly figure out all of those things just from the tiny amount of data that you've been given. 

In order to generalize, every learner must embody some knowledge or assumptions beyond the data it's given. Each learning algorithm implicitly provides a set of assumptions. The best way to think about those assumptions is to think back to the Boundary Visualizer we looked at in Lesson 4.1. You saw that different machine learning schemes are capable of drawing different kinds of boundaries in instance space. These boundaries correspond to a set of assumptions about the sort of decisions we can make. There's no universal best algorithm; there's no free lunch. There's no single best algorithm. Data mining is an experimental science, and that's why we've been teaching you how to experiment with data mining yourself.

This is just a summary. Be skeptical: when people tell you about data mining results and they say that it gets this kind of accuracy, then to be sure about that you want to have them test their classifier on your new, fresh data that they've never seen before. Overfitting has many faces. Different learning schemes make different assumptions about missing values, which can really change the results. There is no universal best learning algorithm. Data mining is an experimental science, and it's very easy to be misled by people quoting the results of data mining experiments.

That's it for now. Off you go and do the activity. We'll see you in the next lesson. Bye for now!