﻿1
00:00:16,510 --> 00:00:23,510
Hi! In this lesson, Lesson 2.5, I want to
introduce you to the standard way of evaluating

2
00:00:23,680 --> 00:00:28,740
the performance of a machine learning algorithm,
which is called cross-validation.

3
00:00:28,740 --> 00:00:35,740
A couple of lessons back, we looked at evaluating
on an independent test set, and we also talked

4
00:00:36,640 --> 00:00:41,160
about evaluating on the training set (don't
do that).

5
00:00:41,160 --> 00:00:47,190
We also talked about evaluating using the
holdout method by taking the one dataset and

6
00:00:47,190 --> 00:00:51,370
holding out a little bit for testing and using
the rest for training.

7
00:00:51,370 --> 00:00:57,129
There is a fourth option on Weka's Classify
panel, which is called cross-validation, and

8
00:00:57,129 --> 00:01:02,049
that's what we're going to talk about here.

9
00:01:02,649 --> 00:01:07,680
Cross-validation is a way of improving upon
repeated holdout.

10
00:01:07,680 --> 00:01:14,310
We tried using the holdout method with different
random-number seeds each time.

11
00:01:14,310 --> 00:01:17,659
That's called repeated holdout.

12
00:01:17,659 --> 00:01:21,860
Cross-validation is a systematic way of doing
repeated holdout that actually improves upon

13
00:01:21,860 --> 00:01:26,080
it by reducing the variance of the estimate.

14
00:01:26,080 --> 00:01:30,680
We take a training set and we create a classifier.

15
00:01:30,680 --> 00:01:34,370
Then we're looking to evaluate the performance
of that classifier, and there is a certain

16
00:01:34,370 --> 00:01:39,020
amount of variance in that evaluation, because
it's all statistical underneath.

17
00:01:39,020 --> 00:01:42,480
We want to keep the variance in the estimation
as low as possible.

18
00:01:42,480 --> 00:01:48,330
Cross-validation is a way of reducing the
variance, and a variant on cross-validation

19
00:01:48,330 --> 00:01:52,610
called stratified cross-validation reduces
it even further.

20
00:01:52,610 --> 00:01:58,580
I'm going to explain that in this class.

21
00:01:58,580 --> 00:02:03,440
In a previous lesson, we held out 10% for the
testing and we repeated that 10 times.

22
00:02:03,440 --> 00:02:06,310
That's the repeated holdout method.

23
00:02:06,310 --> 00:02:13,310
We've got one dataset, and we divided it independently
10 separate times into a training set and

24
00:02:14,170 --> 00:02:16,450
a test set.

25
00:02:16,450 --> 00:02:23,450
With cross-validation, we divide it just once,
but we divide into, say, 10 pieces.

26
00:02:23,820 --> 00:02:28,690
Then, we take 9 of the pieces and use them
for training,

27
00:02:28,690 --> 00:02:30,920
and the last piece we use for testing.

28
00:02:31,320 --> 00:02:37,630
Then, with the same division, we take another
9 pieces and use them for training and the

29
00:02:37,630 --> 00:02:39,960
held out piece for testing.

30
00:02:39,960 --> 00:02:44,610
We do the whole thing 10 times, using a different
segment for testing each time.

31
00:02:44,610 --> 00:02:50,100
In other words, we divide the dataset into
10 pieces, and then we hold out each of these

32
00:02:50,100 --> 00:02:57,100
pieces in turn for testing, train on the rest,
do the testing and average the 10 results.

33
00:02:57,160 --> 00:03:00,860
That would be 10-fold cross-validation.

34
00:03:00,860 --> 00:03:07,040
Divide the dataset into 10 parts (these are
called folds), hold out each part in turn

35
00:03:07,040 --> 00:03:07,960
and average the results.

36
00:03:07,960 --> 00:03:14,270
So, each data point in the dataset is used
once for testing and 9 times for training.

37
00:03:14,270 --> 00:03:17,000
That's 10-fold cross-validation.

38
00:03:17,000 --> 00:03:22,320
Stratified cross-validation is a simple variant
where, when we do the initial division into

39
00:03:22,320 --> 00:03:28,110
10 parts, we ensure that each fold has got
approximately the correct proportion of each

40
00:03:28,110 --> 00:03:29,150
of the class values.

41
00:03:29,150 --> 00:03:36,150
Of course, there are many different ways
of dividing a dataset into 10 equal parts,

42
00:03:36,150 --> 00:03:40,600
we just make sure we choose a division that
has approximately the right representation

43
00:03:40,600 --> 00:03:42,880
of class values in each of the folds.

44
00:03:42,880 --> 00:03:44,790
That's stratified cross-validation.

45
00:03:44,790 --> 00:03:50,880
It helps reduce the variance in the estimate
a little bit more.

46
00:03:50,880 --> 00:03:59,540
Then, once we've done the cross-validation,
what Weka does is run the algorithm an eleventh

47
00:03:59,540 --> 00:04:01,750
time on the whole dataset.

48
00:04:01,750 --> 00:04:05,580
That will then produce a classifier that we
might deploy in practice.

49
00:04:05,580 --> 00:04:11,790
We use 10-fold cross-validation in order to
get an evaluation result and estimate of the error,

50
00:04:11,790 --> 00:04:17,180
and then finally, we do classification
one more time to get an actual classifier

51
00:04:17,180 --> 00:04:20,000
to use in practice.

52
00:04:22,550 --> 00:04:24,050
That's what I wanted to tell you.

53
00:04:24,050 --> 00:04:28,150
Cross-validation is better than repeated holdout,
and we'll look at that in the next lesson.

54
00:04:28,150 --> 00:04:31,120
Stratified cross-validation is even better.

55
00:04:31,120 --> 00:04:37,760
Weka does stratified cross-validation by default.

56
00:04:37,960 --> 00:04:42,650
With 10-fold cross-validation, Weka invokes
the learning algorithm 11 times, one for each

57
00:04:42,650 --> 00:04:47,820
fold of the cross-validation and then a final
time on the entire dataset.

58
00:04:47,820 --> 00:04:52,190
The practical rule of thumb is that if you've
got lots of data, you can use a percentage

59
00:04:52,190 --> 00:04:54,740
split and evaluate it just once.

60
00:04:54,740 --> 00:05:01,670
Otherwise, if you don't have too much data,
you should use stratified 10-fold cross-validation.

61
00:05:01,670 --> 00:05:03,830
How big is lots? Well, this is what everyone
asks.

62
00:05:03,830 --> 00:05:10,830
How long is a piece of string, you know? It's
hard to say, but it depends on a few things.

63
00:05:11,150 --> 00:05:14,000
It depends on the number of classes in your
dataset.

64
00:05:14,000 --> 00:05:24,220
If you've got a two-class dataset, then if
you had, say 100-1000 datapoints, that would

65
00:05:24,220 --> 00:05:29,490
probably be good enough for a pretty reliable
evaluation.

66
00:05:29,490 --> 00:05:33,750
If you did 90% and 10% split in the training
and test set.

67
00:05:33,750 --> 00:05:39,560
If you had, say 10,000 data points in a two-class
problem, then I think you'd have lots and

68
00:05:39,560 --> 00:05:43,360
lots of data, you wouldn't need to go to cross-validation.

69
00:05:43,360 --> 00:05:50,130
If, on the other hand, you had 100 different
classes, then that's different, right?

70
00:05:50,130 --> 00:05:54,720
You would need a larger dataset, because you want
a fair representation of each class when you

71
00:05:54,920 --> 00:05:57,790
do the evaluation.

72
00:05:57,790 --> 00:06:00,780
It's really hard to say exactly; it depends
on the circumstances.

73
00:06:00,780 --> 00:06:05,790
If you've got thousands and thousands of data
points, you might just do things once with

74
00:06:05,790 --> 00:06:07,200
a holdout.

75
00:06:07,200 --> 00:06:14,100
If you've got less than a thousand data points,
even with a two-class problem, then you might

76
00:06:14,100 --> 00:06:15,930
as well do 10-fold cross-validation.

77
00:06:15,930 --> 00:06:18,440
It really doesn't take much longer.

78
00:06:18,440 --> 00:06:23,340
Well, it takes 10-times as long, but the times
are generally pretty short.

79
00:06:23,340 --> 00:06:29,770
You can read more about this in Section 5.3
of the course text on cross-validation.

80
00:06:29,770 --> 00:06:35,030
Now it's time for you to go and do the activity
associated with this [lesson].

81
00:06:35,030 --> 00:06:42,030
See you soon!

