﻿1
00:00:18,230 --> 00:00:23,180
Hi! This is Lesson 4.2 on Linear Regression.

2
00:00:23,180 --> 00:00:28,840
Back in Lesson 1.3, we actually mentioned
the difference between a classification problem

3
00:00:28,840 --> 00:00:31,300
and a regression problem.

4
00:00:31,300 --> 00:00:35,809
A classification problem is when what you're
trying to predict is a nominal value, whereas

5
00:00:35,809 --> 00:00:41,400
in a regression problem what you're trying
to predict is a numeric value.

6
00:00:41,400 --> 00:00:46,400
We've seen examples of datasets with nominal
and numeric attributes before, but we've never

7
00:00:46,400 --> 00:00:51,230
looked at the problem of regression, of trying
to predict a numeric value as the output of

8
00:00:51,230 --> 00:00:53,110
a machine learning scheme.

9
00:00:53,110 --> 00:00:56,690
That's what we're doing in this [lesson],
linear regression.

10
00:00:56,690 --> 00:01:02,859
We've only had nominal classes so far, so
now we're going to look at numeric classes.

11
00:01:02,859 --> 00:01:08,340
This is a classical statistical method, dating
back more than 2 centuries.

12
00:01:08,340 --> 00:01:11,190
This is the kind of picture you see.

13
00:01:11,190 --> 00:01:15,450
You have a cloud of data points in 2 dimensions,
and we're trying to fit a straight line to

14
00:01:15,450 --> 00:01:21,560
this cloud of data points and looking for
the best straight-line fit.

15
00:01:21,560 --> 00:01:25,590
Only in our case we might have more than 2
dimensions, there might be multiple dimensions.

16
00:01:25,590 --> 00:01:28,509
It's still a standard problem.

17
00:01:28,509 --> 00:01:31,649
Let's just look at the 2-dimensional case
here.

18
00:01:31,649 --> 00:01:39,560
You can write a straight line equation in
this form, with weights w0 plus w1a1 plus

19
00:01:39,560 --> 00:01:41,179
w2a2, and so on.

20
00:01:41,179 --> 00:01:44,209
Just think about this in one dimension where
there's only one "a".

21
00:01:44,209 --> 00:01:51,209
Forget about all the things at the end here,
just consider w0 plus w1a1.

22
00:01:51,770 --> 00:01:55,560
That's the equation of this line -- it's the
equation of a straight line -- where w0 and

23
00:01:55,560 --> 00:01:59,920
w1 are two constants to be determined from
the data.

24
00:01:59,920 --> 00:02:06,289
This, of course, is going to work most naturally
with numeric attributes, because we're multiplying

25
00:02:06,289 --> 00:02:08,849
these attribute values by weights.

26
00:02:08,849 --> 00:02:13,129
We'll worry about nominal attributes in just
a minute.

27
00:02:13,129 --> 00:02:19,260
We're going to calculate these weights from
the training data -- w0, w1, and w2.

28
00:02:19,260 --> 00:02:22,239
Those are what we're going to calculate from
the training data.

29
00:02:22,239 --> 00:02:27,930
Then, once we've calculated the weights, we're
going to predict the value for the first training

30
00:02:27,930 --> 00:02:29,010
instance, a1.

31
00:02:29,010 --> 00:02:31,670
The notation gets really horrendous here.

32
00:02:31,670 --> 00:02:33,599
I know it looks pretty scary, but it's pretty
simple.

33
00:02:33,599 --> 00:02:38,049
We're using this linear sum with these weights
that we've calculated, using the attribute

34
00:02:38,049 --> 00:02:45,049
values of the first [training] instance in order
to get the predicted value for that instance.

35
00:02:48,239 --> 00:02:54,450
We're going to get predicted values for the
training instances using this rather horrendous

36
00:02:54,450 --> 00:02:55,749
formula here.

37
00:02:55,749 --> 00:02:58,810
I know it looks pretty scary, but it's actually
not so scary.

38
00:02:58,810 --> 00:03:04,549
These w's are just numbers that we've calculated
from the training data, and then these things

39
00:03:04,549 --> 00:03:09,680
here are the attribute values of the first
training instance a1 -- that 1 at the top

40
00:03:09,680 --> 00:03:12,409
here means it's the first training instance.

41
00:03:12,409 --> 00:03:16,840
This 1, 2, 3 means it's the first, second,
and third attribute.

42
00:03:16,840 --> 00:03:21,170
We can write this in this neat little sum
form here, which looks a little bit better.

43
00:03:21,170 --> 00:03:28,040
Notice, by the way, that we're defining a0
-- the zeroth attribute value -- to be 1.

44
00:03:28,040 --> 00:03:31,260
That just makes this formula work.

45
00:03:31,260 --> 00:03:38,510
For the first training instance, that gives
us this number x, the predicted value for

46
00:03:38,519 --> 00:03:45,519
the first training instance and this particular
value of a1.

47
00:03:47,889 --> 00:03:54,139
Then we're choosing the weights to minimize
the squared error on the training data.

48
00:03:54,139 --> 00:03:58,639
This is the actual x value for this i'th training
instance.

49
00:03:58,639 --> 00:04:02,249
This is the predicted value for the i'th training
instance.

50
00:04:02,249 --> 00:04:05,579
We're going to take the difference between
the actual and the predicted value, square

51
00:04:05,579 --> 00:04:07,410
them up, and add them all together.

52
00:04:07,410 --> 00:04:09,680
And that's what we're trying to minimize.

53
00:04:09,680 --> 00:04:15,370
We get the weights by minimizing this sum
of squared errors.

54
00:04:15,370 --> 00:04:20,190
That's a mathematical job; we don't need to
worry about the mechanics of doing that.

55
00:04:20,190 --> 00:04:23,639
It's a standard matrix problem.

56
00:04:23,639 --> 00:04:26,750
It works fine if there are more instances
than attributes.

57
00:04:26,750 --> 00:04:31,660
You couldn't expect this to work if you had
a huge number of attributes and not very many instances.

58
00:04:31,669 --> 00:04:35,530
But providing there are more instances than
attributes -- and usually there are, of course

59
00:04:35,530 --> 00:04:38,110
-- that's going to work ok.

60
00:04:38,110 --> 00:04:44,170
If we did have nominal values, if we just
have a 2-valued/binary-valued, we could just

61
00:04:44,170 --> 00:04:47,170
convert it to 0 and 1 and use those numbers.

62
00:04:47,170 --> 00:04:52,210
If we have multi-valued nominal attributes,
you'll have a look at that in the activity

63
00:04:52,210 --> 00:04:58,250
at the end of this lesson.

64
00:04:58,250 --> 00:05:05,250
We're going to open a regression dataset and
see what it does: cpu.arff.

65
00:05:06,100 --> 00:05:07,400
This is a regular kind of dataset.

66
00:05:07,400 --> 00:05:11,750
It's got numeric attributes, and the most
important thing here is that it's got a numeric

67
00:05:11,750 --> 00:05:15,690
class -- we're trying to predict a numeric
value.

68
00:05:15,690 --> 00:05:22,690
We can run LinearRegression; it's in the functions
category.

69
00:05:24,060 --> 00:05:28,030
We just run it, and this is the output.

70
00:05:28,030 --> 00:05:29,530
We've got the model here.

71
00:05:29,530 --> 00:05:32,580
The class has been predicted as a linear sum.

72
00:05:32,580 --> 00:05:34,320
These are the weights I was talking about.

73
00:05:34,320 --> 00:05:39,060
It's this weight times this attribute value
plus this weight times this attribute value,

74
00:05:39,060 --> 00:05:39,960
and so on.

75
00:05:39,960 --> 00:05:46,960
Minus -- and this is w0, the constant weight,
not modified by an attribute.

76
00:05:48,490 --> 00:05:51,170
This is a formula for computing the class.

77
00:05:51,170 --> 00:05:55,940
When you use that formula, you can look at
the success of it in terms of the training data.

78
00:05:55,940 --> 00:06:01,710
The correlation coefficient, which is a standard
statistical measure, is 0.9.

79
00:06:01,710 --> 00:06:02,700
That's pretty good.

80
00:06:02,700 --> 00:06:06,720
Then there are various other error figures
here that are printed.

81
00:06:06,720 --> 00:06:11,300
On the slide, you can see the interpretation
of these error figures.

82
00:06:11,300 --> 00:06:14,630
It's really hard to know which one to use.

83
00:06:14,630 --> 00:06:19,050
They all tend to produce the same sort of
picture, but I guess the exact one you should

84
00:06:19,050 --> 00:06:21,700
use depends on the application.

85
00:06:23,420 --> 00:06:27,900
There's the mean absolute error and the root
mean squared error, which is the standard

86
00:06:27,900 --> 00:06:33,270
metric to use.

87
00:06:33,270 --> 00:06:33,920
That's linear regression.

88
00:06:33,920 --> 00:06:38,700
I'm actually going to look at nonlinear regression
here.

89
00:06:38,700 --> 00:06:45,080
A "model tree" is a tree where each leaf has
one of these linear regression models.

90
00:06:45,080 --> 00:06:50,040
We create a tree like this, and then at each
leaf we have a linear model, which has got

91
00:06:50,040 --> 00:06:51,100
those coefficients.

92
00:06:51,100 --> 00:07:00,220
It's like a patchwork of linear models, and
this set of 6 linear patches approximates

93
00:07:00,220 --> 00:07:02,290
a continuous function.

94
00:07:02,290 --> 00:07:12,990
There's a method under "trees" with the rather
mysterious name of M5P.

95
00:07:12,990 --> 00:07:18,440
If we just run that, that produces a model
tree.

96
00:07:19,520 --> 00:07:23,370
Maybe I should just visualize the tree.

97
00:07:25,100 --> 00:07:32,920
Now I can see the model tree, which is similar
to the one on the slide.

98
00:07:32,920 --> 00:07:38,280
You can see that each of these -- in this
case 5 -- leaves has a linear model -- LM1,

99
00:07:38,280 --> 00:07:45,730
LM2, LM3, ... And if we look back here, the
linear models are defined like this: LM1 has

100
00:07:45,730 --> 00:07:52,730
this linear formula; this linear formula for
LM2; and so on.

101
00:07:58,510 --> 00:08:03,150
We chose trees > M5P, we ran it, and we looked
at the output.

102
00:08:03,150 --> 00:08:13,070
We could compare these performance figures
-- 92-93% correlation, mean absolute error

103
00:08:13,070 --> 00:08:20,360
of 30, and so on -- with the ones for regular
linear regression, which got a slightly lower

104
00:08:20,360 --> 00:08:24,960
correlation, and a slightly higher absolute
error -- in fact, I think all these error

105
00:08:24,960 --> 00:08:26,930
figures are slightly higher.

106
00:08:26,930 --> 00:08:33,930
That's something we'll be asking you to do
in the activity associated with this lesson.

107
00:08:34,220 --> 00:08:40,270
Linear regression is a well-founded, venerable
mathematical technique.

108
00:08:40,270 --> 00:08:45,540
Practical problems often require non-linear
solutions.

109
00:08:45,540 --> 00:08:50,640
The M5P method builds trees of regression
models, with linear models at each leaf of

110
00:08:50,640 --> 00:08:51,320
the tree.

111
00:08:51,320 --> 00:08:56,210
You can read about this in the course text
in Section 4.6.

112
00:08:56,210 --> 00:08:59,990
Off you go now and do the activity associated
with this lesson.

113
00:08:59,990 --> 00:09:01,160
See you soon.

114
00:09:01,160 --> 00:09:02,300
Bye!

