1
00:00:16,309 --> 00:00:22,070
Hi! This is Lesson 2.2 in Data Mining with
Weka, and here we're going to look at training

2
00:00:22,070 --> 00:00:27,710
and testing in a little bit more detail.

3
00:00:27,710 --> 00:00:29,369
Here's a situation.

4
00:00:29,369 --> 00:00:34,519
We've got a machine learning algorithm, and
we feed into it training data, and it produces

5
00:00:34,519 --> 00:00:38,379
a classifier -- a basic machine learning situation.

6
00:00:38,379 --> 00:00:43,620
For that classifier, we can test it with some
independent test data.

7
00:00:43,620 --> 00:00:49,820
We can put that into the classifier and get
some evaluation results, and, separately,

8
00:00:49,820 --> 00:00:55,159
we can deploy the classifier in some real
situation to make predictions on fresh data

9
00:00:55,159 --> 00:00:58,589
coming from the environment.

10
00:00:58,589 --> 00:01:03,530
It's really important in classification, when
you're looking at your evaluation results,

11
00:01:03,530 --> 00:01:09,029
you only get reliable evaluation results if
the test data is different from the training data.

12
00:01:10,080 --> 00:01:14,250
That's what we're going to look at in this
lesson.

13
00:01:14,250 --> 00:01:19,090
What if you only have one dataset? If you
just have one dataset, you should divide it

14
00:01:19,090 --> 00:01:20,689
into two parts.

15
00:01:20,689 --> 00:01:24,189
Maybe use some of it for training and some
of it for testing.

16
00:01:24,189 --> 00:01:27,549
Perhaps, 2/3rds of it for training and 1/3rd
of it for testing.

17
00:01:27,549 --> 00:01:32,369
It's really important that the training data
is different from the test data.

18
00:01:32,369 --> 00:01:38,759
Both training and test sets are produced by
independent sampling from an infinite population.

19
00:01:38,759 --> 00:01:43,479
That's the basic scenario here, but they're
different independent samples.

20
00:01:43,479 --> 00:01:44,950
It's not the same data.

21
00:01:44,950 --> 00:01:49,479
If it is the same data, then your evaluation
results are misleading.

22
00:01:49,479 --> 00:01:56,479
They don't reflect what you should actually
expect on new data when you deploy your classifier.

23
00:01:57,060 --> 00:02:02,600
Here we're going to look at the segment dataset, which we used in the last lesson.

24
00:02:02,600 --> 00:02:09,600
I'm going to open the segment-challenge.

25
00:02:09,759 --> 00:02:12,640
I'm going to use a supplied test set.

26
00:02:12,640 --> 00:02:19,110
First of all, I'm going to use the J48 tree
learner.

27
00:02:19,110 --> 00:02:21,530
I'm going to use a supplied test set,

28
00:02:21,530 --> 00:02:25,579
and I will set it to the appropriate segment-test file, segment-test.arff.

29
00:02:32,879 --> 00:02:38,579
I'm going to open that. Now we've got
a test set, and let's see how it does.

30
00:02:38,879 --> 00:02:45,510
In the last lesson, on the same data with
the user classifier, I think I got 79% accuracy.

31
00:02:45,510 --> 00:02:49,140
J48 does much better;

32
00:02:49,140 --> 00:02:55,989
it gets 96% accuracy on the same test set.

33
00:02:55,989 --> 00:03:00,670
Suppose I was to evaluate it on the training
set? I can do that by just specifying under

34
00:03:00,670 --> 00:03:03,049
Test options Use training set.

35
00:03:03,049 --> 00:03:08,069
Now it will train it again and evaluate it
on the training set, which is not what you're

36
00:03:08,069 --> 00:03:12,319
supposed to do, because you get misleading
results.

37
00:03:12,319 --> 00:03:17,739
Here, it's saying the accuracy is 99% on the
training set.

38
00:03:17,739 --> 00:03:24,640
That is not representative of what we would
get using this on independent data.

39
00:03:24,640 --> 00:03:30,540
If we had just one dataset, if we didn't have
a test dataset, we could do a percentage split.

40
00:03:30,540 --> 00:03:31,900
Here's a percentage split.

41
00:03:31,900 --> 00:03:37,219
This is going to be 66% training data and
34% test data.

42
00:03:37,219 --> 00:03:40,200
That's going to make a random split of the
dataset.

43
00:03:40,200 --> 00:03:47,019
If I run that, I get 95%.

44
00:03:47,019 --> 00:03:50,160
That's just about the same as what we got
when we had an independent test set,

45
00:03:50,160 --> 00:03:52,009
just slightly worse.

46
00:03:54,109 --> 00:04:01,109
If I were to run it again, if we had a different
split, we'd expect a slightly different result,

47
00:04:01,819 --> 00:04:08,640
but actually, I get exactly the same results,
95.098%.

48
00:04:08,640 --> 00:04:14,719
That's because Weka, before it does a run,
it reinitializes the random number generator.

49
00:04:14,719 --> 00:04:18,220
The reason is to make sure that you can get
repeatable results.

50
00:04:18,220 --> 00:04:22,120
If it didn't do that, then the results that
you got would not be repeatable.

51
00:04:22,120 --> 00:04:27,940
However, if you wanted to have a look at the
differences that you might get on different

52
00:04:27,940 --> 00:04:32,560
runs, then there is a way of resetting the
random number between each run.

53
00:04:32,560 --> 00:04:37,880
We're going to look at that in the next lesson.

54
00:04:37,880 --> 00:04:38,630
That's this lesson.

55
00:04:38,630 --> 00:04:42,440
The basic assumption of machine learning is
that the training and test sets are independently

56
00:04:42,440 --> 00:04:46,729
sampled from an infinite population, the same
population.

57
00:04:46,729 --> 00:04:52,750
If you have just one dataset, you should hold
part of it out for testing, maybe 33% as we

58
00:04:52,750 --> 00:04:56,009
just did or perhaps 10%.

59
00:04:56,009 --> 00:05:00,550
We would expect a slight variation in results
each time if we hold out a different set,

60
00:05:00,550 --> 00:05:05,669
but Weka produces the same results each time
by design by making sure it reinitializes

61
00:05:05,669 --> 00:05:09,449
the random number generator each time.

62
00:05:09,449 --> 00:05:12,389
We ran J48 on the segment-challenge dataset.

63
00:05:12,389 --> 00:05:16,080
If you'd like, you can go and look at the
course text on

64
00:05:16,080 --> 00:05:18,180
Training and testing, Section 5.1,

65
00:05:18,180 --> 00:05:21,380
and please go and do the activity associated with this lesson.

66
00:05:21,580 --> 00:05:23,180
Bye for now!