﻿1
00:00:16,320 --> 00:00:21,939
Hello again! In the last lesson, we looked
at training and testing.

2
00:00:21,939 --> 00:00:29,820
We saw that we can evaluate a classifier on
an independent test set, or using a percentage split,

3
00:00:29,820 --> 00:00:35,730
with a certain percentage of the dataset
used to train and the rest used for testing,

4
00:00:35,730 --> 00:00:41,230
or -- and this is generally a very bad idea -- we
can evaluate it on the training set itself,

5
00:00:41,230 --> 00:00:45,550
which gives misleadingly optimistic performance
figures.

6
00:00:45,550 --> 00:00:51,820
In this lesson, we're going to look a little
bit more at training and testing.

7
00:00:51,820 --> 00:01:03,640
In fact, what we're going to do is repeatedly
train and test using percentage split.

8
00:01:03,640 --> 00:01:08,420
Now, in the last lesson, we saw that if you
simply repeat the training and testing, you

9
00:01:08,420 --> 00:01:13,610
get the same result each time because Weka
initializes the random number generator before

10
00:01:13,610 --> 00:01:18,500
it does each run to make sure that you know
what's going on when you do the same experiment

11
00:01:18,500 --> 00:01:19,220
again tomorrow.

12
00:01:19,220 --> 00:01:22,090
But, there is a way of overriding that.

13
00:01:22,090 --> 00:01:28,820
So, we will be using independent random numbers
on different occasions to produce a percentage

14
00:01:28,820 --> 00:01:34,210
split of the dataset into a training and test
set.

15
00:01:34,210 --> 00:01:37,780
I'm going to open the segment-challenge data
again.

16
00:01:37,780 --> 00:01:40,130
That's what we used before.

17
00:01:40,130 --> 00:01:44,700
Notice there are 1500 instances here;

18
00:01:44,700 --> 00:01:45,729
that's quite a lot.

19
00:01:45,729 --> 00:01:48,649
I'm going to go to Classify.

20
00:01:48,649 --> 00:01:54,549
I'm going to choose J48, our standard method,
I guess.

21
00:01:54,549 --> 00:02:00,710
I'm going to use a percentage split, and because
we've got 1500 instances, I'm going to choose

22
00:02:00,710 --> 00:02:05,329
90% for training and just 10% for testing.

23
00:02:05,329 --> 00:02:12,070
I reckon that 10% -- that's 150 instances -- for
testing is going to give us a reasonable estimate,

24
00:02:12,070 --> 00:02:16,720
and we might as well train on as many as we
can to get the most accurate classifier.

25
00:02:16,720 --> 00:02:25,520
I'm going to run this, and the accuracy figure
I get -- this is what I got in the last lesson --

26
00:02:25,520 --> 00:02:27,740
is 96.6667%.

27
00:02:29,340 --> 00:02:34,949
Now, this is misleadingly high accuracy here.

28
00:02:34,949 --> 00:02:41,000
I'm going to call that 96.7%, or 0.967.

29
00:02:41,000 --> 00:02:45,560
And then, I'm going to do it again and just
see how much variation we get of that figure

30
00:02:45,560 --> 00:02:49,500
initializing the random number generator
to different amounts each time.

31
00:02:50,460 --> 00:02:57,460
If I go to the More options menu, I get a
number of options here which are quite useful:

32
00:02:57,770 --> 00:03:00,150
outputting the model, we're doing that;

33
00:03:00,150 --> 00:03:01,680
outputting statistics;

34
00:03:01,680 --> 00:03:03,890
we can output different evaluation measures;

35
00:03:03,890 --> 00:03:05,770
we're doing the confusion matrix;

36
00:03:05,770 --> 00:03:08,060
we're storing the prediction for visualization;

37
00:03:08,060 --> 00:03:10,860
we can output the predictions if we want;

38
00:03:10,860 --> 00:03:14,370
we can do a cost-sensitive evaluation;

39
00:03:14,370 --> 00:03:20,980
and we can set the random seed for cross-validation
or percentage split.

40
00:03:20,980 --> 00:03:22,300
That's set by default to 1.

41
00:03:22,300 --> 00:03:26,170
I'm going to change that to 2, a different
random seed.

42
00:03:26,170 --> 00:03:31,490
We could also output the source code for the
classifier if we wanted, but I just want to

43
00:03:31,490 --> 00:03:32,950
change the random seed.

44
00:03:32,950 --> 00:03:35,450
Then I want to run it again.

45
00:03:35,450 --> 00:03:42,450
Before we got 0.967, and this time we get 0.94,
94%.

46
00:03:43,180 --> 00:03:45,310
Quite different, you see.

47
00:03:45,310 --> 00:03:52,090
If I were then to change this again to, say,
3, and run it again.

48
00:03:52,090 --> 00:03:53,900
Again I get 94%.

49
00:03:53,900 --> 00:04:03,830
If I change it again to 4 and run it again,
I get 96.7%.

50
00:04:03,830 --> 00:04:05,200
Let's do one more.

51
00:04:05,200 --> 00:04:12,200
Change it to 5, run it again, and now I get
95.3%.

52
00:04:14,330 --> 00:04:15,710
Here's a table with these figures in.

53
00:04:15,710 --> 00:04:21,480
If we run it 10 times, we get this set of
results.

54
00:04:21,480 --> 00:04:26,330
Given this set of experimental results, we
can calculate the mean and standard deviation.

55
00:04:26,330 --> 00:04:33,770
The sample mean is the sum of all of these
error figures -- or these success rates, I should say -- 

56
00:04:33,770 --> 00:04:37,200
divided by the number, 10 of
them.

57
00:04:37,200 --> 00:04:41,760
That's 0.949, about 95%.

58
00:04:41,760 --> 00:04:43,290
That's really what we would expect to get.

59
00:04:43,290 --> 00:04:46,910
That's a better estimate than the 96.7% that
we started out with.

60
00:04:46,910 --> 00:04:49,460
A more reliable estimate.

61
00:04:49,460 --> 00:04:51,420
We can calculate the sample variance.

62
00:04:51,420 --> 00:04:57,200
We take the deviation from the mean, we subtract
the mean from each of these numbers, we square that,

63
00:04:57,200 --> 00:05:02,560
add them up, and we divide, not by n,
but by n - 1.

64
00:05:02,560 --> 00:05:04,730
That might surprise you, perhaps.

65
00:05:04,730 --> 00:05:11,730
The reason for it being n - 1 is because we've
actually calculated the mean from this sample.

66
00:05:12,650 --> 00:05:19,060
When the mean is calculated from the sample,
you need to divide by n - 1, leading to a slightly larger

67
00:05:19,060 --> 00:05:22,090
variance estimate than if you were to divide
by n.

68
00:05:22,090 --> 00:05:32,740
We take the square root of that, and in this
case, we get a standard deviation of 1.8%.

69
00:05:32,740 --> 00:05:39,190
Now you can see that the real performance
of J48 on the segment-challenge dataset is

70
00:05:39,190 --> 00:05:44,460
approximately 95% accuracy, plus or minus
approximately 2%.

71
00:05:44,460 --> 00:05:50,550
Anywhere, let's say, between 93-97% accuracy.

72
00:05:50,550 --> 00:05:55,470
These figures that you get, that Weka puts
out for you, are misleading.

73
00:05:55,470 --> 00:06:04,720
You need to be careful how you interpret them,
because the result is certainly not 95.333%.

74
00:06:04,720 --> 00:06:08,550
There's a lot of variation on a lot of these
figures.

75
00:06:09,900 --> 00:06:13,870
Remember, the basic assumption is the training
and test sets are sampled independently from

76
00:06:13,870 --> 00:06:18,940
an infinite population, and you should expect
a slight variation in results -- perhaps more

77
00:06:18,940 --> 00:06:21,660
than just a slight variation in results.

78
00:06:21,660 --> 00:06:27,680
You can estimate the variation in results
by setting the random-number seed and repeating

79
00:06:27,680 --> 00:06:29,520
the experiment.

80
00:06:29,520 --> 00:06:33,520
You can calculate the mean and the standard
deviation experimentally, which is what we

81
00:06:33,520 --> 00:06:34,240
just did.

82
00:06:35,270 --> 00:06:38,740
Off you go now, and do the activity associated
with this lesson.

83
00:06:39,140 --> 00:06:40,240
I'll see you in the next lesson.

84
00:06:40,540 --> 00:06:42,090
Bye!

