1
00:00:16,770 --> 00:00:19,360
Hi! I'm sitting here in New Zealand.

2
00:00:19,360 --> 00:00:21,440
It's on the globe behind me.

3
00:00:21,440 --> 00:00:24,700
That's New Zealand, at the top of the world,
surrounded by water.

4
00:00:24,700 --> 00:00:27,540
But that's not where I'm from originally.

5
00:00:27,540 --> 00:00:30,790
I moved here about 20 years ago.

6
00:00:30,790 --> 00:00:35,790
Here on this map, of course, this is New Zealand
-- Google puts things with the north at the

7
00:00:35,790 --> 00:00:37,480
top, which is probably what you're used to.

8
00:00:37,480 --> 00:00:42,430
I came here from the University of Calgary
in Canada, where I was for many years.

9
00:00:42,430 --> 00:00:45,540
I used to be head of computer science for
the University of Calgary.

10
00:00:45,540 --> 00:00:50,860
But, originally, I'm from Belfast, Northern
Ireland, which is here in the United Kingdom.

11
00:00:50,860 --> 00:00:55,410
So, my accent actually is Northern Irish,
not New Zealand.

12
00:00:55,410 --> 00:00:59,010
This is not a New Zealand accent.

13
00:00:59,010 --> 00:01:03,440
We're going to talk here in the last lesson
of Class 3 about another machine learning

14
00:01:03,440 --> 00:01:08,330
method called the nearest neighbor, or instance-based,
machine learning method.

15
00:01:08,330 --> 00:01:15,330
When people talk about rote learning, they
just talk about remember stuff without really

16
00:01:15,360 --> 00:01:17,550
thinking about it.

17
00:01:17,550 --> 00:01:20,110
It's the simplest kind of learning.

18
00:01:20,110 --> 00:01:23,030
Nearest neighbor implements rote learning.

19
00:01:23,030 --> 00:01:28,750
It just remembers the training instances,
and then, to classify a new instance, it searches

20
00:01:28,750 --> 00:01:33,520
the training set for one that is most like
the new instance.

21
00:01:33,520 --> 00:01:36,530
The representation of the knowledge here is
just the set of instances.

22
00:01:36,530 --> 00:01:38,680
It's a kind of lazy learning.

23
00:01:38,680 --> 00:01:42,610
The learner does nothing until it has to do
some predictions.

24
00:01:42,610 --> 00:01:46,970
Confusingly, it's also called instance-based
learning.

25
00:01:46,970 --> 00:01:52,100
Nearest neighbor learning and instance-based
learning are the same thing.

26
00:01:52,100 --> 00:01:56,390
Here is just a little picture of 2-dimensional
instance space.

27
00:01:56,390 --> 00:02:02,330
The blue points and the white points are two
different classes -- yes and no, for example.

28
00:02:02,330 --> 00:02:06,330
Then we've got an unknown instance, the red
one.

29
00:02:06,330 --> 00:02:07,800
We want to know which class it's in.

30
00:02:07,800 --> 00:02:12,820
So, we simply find the closest instance in
each of the classes and see which is closest.

31
00:02:12,820 --> 00:02:15,050
In this case, it's the blue class.

32
00:02:15,050 --> 00:02:19,280
So, we would classify that red point as though
it belonged to the blue class.

33
00:02:19,280 --> 00:02:25,450
If you think about this, that's implicitly
drawing a line between the two clouds of points.

34
00:02:25,450 --> 00:02:32,410
It's a straight line here, the perpendicular
bisector of the line that joins the two closest points.

35
00:02:32,410 --> 00:02:36,570
The nearest neighbor method produces a linear
decision boundary.

36
00:02:36,570 --> 00:02:39,470
Actually, it's a little bit more complicated
than that.

37
00:02:39,470 --> 00:02:46,700
It produces a piece-wise linear decision boundary
with sometimes a bunch of little linear pieces

38
00:02:46,700 --> 00:02:48,690
of the decision boundary.

39
00:02:50,840 --> 00:02:54,860
Of course, the trick is what do we mean by
"most like".

40
00:02:54,860 --> 00:03:00,480
We need a similarity function, and conventionally,
people use the regular distance function,

41
00:03:00,480 --> 00:03:07,920
the Euclidean distance, which is the sum of
the squares of the differences between the attributes.

42
00:03:07,920 --> 00:03:11,480
Actually, it's the square root of the sum
of the squares, but since we're just comparing

43
00:03:11,480 --> 00:03:14,220
two instances, we don't need to take the square root.

44
00:03:15,130 --> 00:03:20,510
Or, you might use the Manhattan or city block
distance, which is the sum of the absolute

45
00:03:20,510 --> 00:03:22,290
differences between the attribute values.

46
00:03:23,410 --> 00:03:26,830
Of course, I've been talking about numeric
attributes here.

47
00:03:26,830 --> 00:03:33,290
If attributes are nominal, we need the difference
between different attribute values.

48
00:03:34,050 --> 00:03:38,210
Conventionally, people just say the distance
is 1 if the attribute values are different

49
00:03:38,210 --> 00:03:39,960
and 0 if they are the same.

50
00:03:39,960 --> 00:03:44,820
It might be a good idea with nearest neighbor
learning to normalize the attributes so that

51
00:03:44,820 --> 00:03:49,780
they all lie between 0 and 1, so the distance
isn't skewed by some attribute that happens

52
00:03:49,780 --> 00:03:54,270
to be on some gigantic scale.

53
00:03:54,270 --> 00:03:57,070
What about noisy instances.

54
00:03:57,070 --> 00:04:04,000
If we have a noisy dataset, then by accident
we might find an incorrectly classified training

55
00:04:04,000 --> 00:04:06,850
instance as the nearest one to our test instance.

56
00:04:06,850 --> 00:04:12,360
You can guard against that by using the k-nearest-neighbors.

57
00:04:12,360 --> 00:04:17,280
k might be 3 or 5, and you look for the 3
or the 5 nearest neighbors and choose the

58
00:04:17,280 --> 00:04:22,030
majority class amongst those when classifying
an unknown point.

59
00:04:22,030 --> 00:04:24,540
That's the k-nearest-neighbor method.

60
00:04:24,540 --> 00:04:31,650
In Weka, it's called IBk (instance-based learning
with parameter k), and it's in the lazy class.

61
00:04:31,650 --> 00:04:38,000
Let's open the glass dataset.

62
00:04:40,230 --> 00:04:46,490
Go to Classify and choose the lazy classifier
IBk.

63
00:04:48,430 --> 00:04:49,960
Let's just run it.

64
00:04:49,960 --> 00:04:56,960
We get an accuracy of 70.6%.

65
00:04:57,150 --> 00:04:59,650
The model is not really printed here, because
there is no model.

66
00:04:59,650 --> 00:05:02,520
It's just the set of training instances.

67
00:05:03,440 --> 00:05:05,960
We're using 10-fold cross-validation, of course.

68
00:05:05,960 --> 00:05:12,960
Let's change the value of k, this kNN is the
k value.

69
00:05:15,070 --> 00:05:16,470
It's set by default to 1.

70
00:05:16,470 --> 00:05:23,470
(The number of neighbors to use.) We'll change
that to, say, 5 and run that.

71
00:05:24,840 --> 00:05:31,380
In this case, we get a slightly worse result,
67.8% with k as 5.

72
00:05:32,120 --> 00:05:35,010
This is not such a noisy dataset, I guess.

73
00:05:35,010 --> 00:05:41,810
If we change it to 20 and run it again.

74
00:05:41,810 --> 00:05:44,840
We get 65% accuracy, slightly worse again.

75
00:05:45,440 --> 00:05:52,280
If we had a noisy dataset, we might find that
the accuracy figures improved as k got little

76
00:05:52,280 --> 00:05:53,490
bit larger.

77
00:05:53,490 --> 00:05:55,960
Then, it would always start to decrease again.

78
00:05:55,960 --> 00:06:00,940
If we set k to be an extreme value, close
to the size of the whole dataset, then we're

79
00:06:00,940 --> 00:06:06,550
taking the distance of the test instance
to all of the points in the dataset and averaging

80
00:06:06,550 --> 00:06:10,090
those, which will probably give us something
close to the baseline accuracy.

81
00:06:10,090 --> 00:06:15,440
Here, if I set k to be a ridiculous value
like 100.

82
00:06:15,440 --> 00:06:22,150
I'm going to take the 100 nearest instances
and average their classes.

83
00:06:22,150 --> 00:06:29,070
We get an accuracy of 35%, which, I think
is pretty close to the baseline accuracy for

84
00:06:29,070 --> 00:06:30,090
this dataset.

85
00:06:30,090 --> 00:06:36,710
Let me just find that out with ZeroR, the
baseline accuracy is indeed 35%.

86
00:06:40,280 --> 00:06:42,000
Nearest neighbor is a really good method.

87
00:06:42,000 --> 00:06:44,090
It's often very accurate.

88
00:06:44,090 --> 00:06:45,430
It can be slow.

89
00:06:45,430 --> 00:06:52,050
A simple implementation would involve scanning
the entire training dataset to make each prediction,

90
00:06:52,050 --> 00:06:56,690
because we've got to calculate the distance
of the unknown test instance from all of the

91
00:06:56,690 --> 00:06:59,320
training instances to see which is closest.

92
00:06:59,320 --> 00:07:03,190
There are more sophisticated data structures
that can make this faster, so you don't need

93
00:07:03,190 --> 00:07:06,300
to scan the whole dataset every time.

94
00:07:06,300 --> 00:07:09,900
It assumes all attributes are equally important.

95
00:07:09,900 --> 00:07:14,810
If that wasn't the case, you might want to
look at schemes for selecting or weighting

96
00:07:14,810 --> 00:07:18,240
attributes depending on their importance.

97
00:07:18,240 --> 00:07:23,220
If we've got noisy instances, than we can
use a majority vote over the k nearest neighbors,

98
00:07:23,220 --> 00:07:27,370
or we might weight instances according to
their prediction accuracy.

99
00:07:27,370 --> 00:07:32,850
Or, we might try to identify reliable prototypes,
one for each of the classes.

100
00:07:32,850 --> 00:07:34,270
This is a very old method.

101
00:07:34,270 --> 00:07:37,650
Statisticians have used k-nearest-neighbor
since the 1950's.

102
00:07:37,650 --> 00:07:41,210
There's an interesting theoretical result.

103
00:07:41,210 --> 00:07:49,090
If the number (n) of training instances approaches
infinity, and k also gets larger in such a

104
00:07:49,090 --> 00:07:58,550
way that k/n approaches 0, but k also approaches
infinity, the error of the k-nearest-neighbor

105
00:07:58,550 --> 00:08:03,190
method approaches the theoretical minimum
error for that dataset.

106
00:08:03,190 --> 00:08:08,800
There is a theoretical guarantee that with
a huge dataset and large values of k, you're

107
00:08:08,800 --> 00:08:12,800
going to get good results from nearest neighbor
learning.

108
00:08:12,800 --> 00:08:16,800
There's a section in the text, Section 4.7
on Instance-based learning.

109
00:08:16,800 --> 00:08:19,830
This is the last lesson of Class 3.

110
00:08:19,830 --> 00:08:23,300
Off you go and do the activity, and I'll see
you in Class 4.

111
00:08:23,300 --> 00:08:25,030
Bye for now!