Hi! This is Lesson 4.2 on Linear Regression. Back in Lesson 1.3, we actually mentioned the difference between a classification problem and a regression problem. A classification problem is when what you're trying to predict is a nominal value, whereas in a regression problem what you're trying to predict is a numeric value. We've seen examples of datasets with nominal and numeric attributes before, but we've never looked at the problem of regression, of trying to predict a numeric value as the output of a machine learning scheme.

That's what we're doing in this [lesson], linear regression. We've only had nominal classes so far, so now we're going to look at numeric classes. This is a classical statistical method, dating back more than 2 centuries. This is the kind of picture you see. You have a cloud of data points in 2 dimensions, and we're trying to fit a straight line to this cloud of data points and looking for the best straight-line fit. Only in our case we might have more than 2 dimensions, there might be multiple dimensions. It's still a standard problem. Let's just look at the 2-dimensional case here. You can write a straight line equation in this form, with weights w0 plus w1a1 plus w2a2, and so on. Just think about this in one dimension where there's only one "a". Forget about all the things at the end here, just consider w0 plus w1a1. That's the equation of this line -- it's the equation of a straight line -- where w0 and w1 are two constants to be determined from the data. 

This, of course, is going to work most naturally with numeric attributes, because we're multiplying these attribute values by weights. We'll worry about nominal attributes in just a minute.

We're going to calculate these weights from the training data -- w0, w1, and w2. Those are what we're going to calculate from the training data. Then, once we've calculated the weights, we're going to predict the value for the first training instance, a1. The notation gets really horrendous here. I know it looks pretty scary, but it's pretty simple. We're using this linear sum with these weights that we've calculated, using the attribute values of the first [training] instance in order to get the predicted value for that instance.

We're going to get predicted values for the training instances using this rather horrendous formula here. I know it looks pretty scary, but it's actually not so scary. These w's are just numbers that we've calculated from the training data, and then these things here are the attribute values of the first training instance a1 -- that 1 at the top here means it's the first training instance. This 1, 2, 3 means it's the first, second, and third attribute. We can write this in this neat little sum form here, which looks a little bit better. Notice, by the way, that we're defining a0 -- the zeroth attribute value -- to be 1. That just makes this formula work. 

For the first training instance, that gives us this number x, the predicted value for the first training instance and this particular value of a1. 

Then we're choosing the weights to minimize the squared error on the training data. This is the actual x value for this i'th training instance. This is the predicted value for the i'th training instance. We're going to take the difference between the actual and the predicted value, square them up, and add them all together. And that's what we're trying to minimize.

We get the weights by minimizing this sum of squared errors. That's a mathematical job; we don't need to worry about the mechanics of doing that. It's a standard matrix problem. It works fine if there are more instances than attributes. You couldn't expect this to work if you had a huge number of attributes and not very many instances. But providing there are more instances than attributes -- and usually there are, of course -- that's going to work ok.

If we did have nominal values, if we just have a 2-valued/binary-valued, we could just convert it to 0 and 1 and use those numbers. If we have multi-valued nominal attributes, you'll have a look at that in the activity at the end of this lesson.

We're going to open a regression dataset and see what it does: cpu.arff. This is a regular kind of dataset. It's got numeric attributes, and the most important thing here is that it's got a numeric class -- we're trying to predict a numeric value. We can run LinearRegression; it's in the functions category. We just run it, and this is the output.

We've got the model here. The class has been predicted as a linear sum. These are the weights I was talking about. It's this weight times this attribute value plus this weight times this attribute value, and so on. Minus -- and this is w0, the constant weight, not modified by an attribute. 

This is a formula for computing the class. When you use that formula, you can look at the success of it in terms of the training data. The correlation coefficient, which is a standard statistical measure, is 0.9. That's pretty good. Then there are various other error figures here that are printed. On the slide, you can see the interpretation of these error figures. It's really hard to know which one to use. They all tend to produce the same sort of picture, but I guess the exact one you should use depends on the application. There's the mean absolute error and the root mean squared error, which is the standard metric to use. 

That's linear regression. I'm actually going to look at nonlinear regression here. A "model tree" is a tree where each leaf has one of these linear regression models. We create a tree like this, and then at each leaf we have a linear model, which has got those coefficients. It's like a patchwork of linear models, and this set of 6 linear patches approximates a continuous function. 

There's a method under "trees" with the rather mysterious name of M5P. If we just run that, that produces a model tree. Maybe I should just visualize the tree. Now I can see the model tree, which is similar to the one on the slide. You can see that each of these -- in this case 5 -- leaves has a linear model -- LM1, LM2, LM3, … And if we look back here, the linear models are defined like this: LM1 has this linear formula; this linear formula for LM2; and so on. 

We chose trees > M5P, we ran it, and we looked at the output. We could compare these performance figures --  92-93% correlation, mean absolute error of 30, and so on -- with the ones for regular linear regression, which got a slightly lower correlation, and a slightly higher absolute error -- in fact, I think all these error figures are slightly higher. That's something we'll be asking you to do in the activity associated with this lesson. 

Linear regression is a well-founded, venerable mathematical technique. Practical problems often require non-linear solutions. The M5P method builds trees of regression models, with linear models at each leaf of the tree. You can read about this in the course text in Section 4.6. Off you go now and do the activity associated with this lesson. 

See you soon. Bye!