1
00:00:17,449 --> 00:00:27,570
Hello again! We're up to the last lesson in
the fourth class, Lesson 4.6 on Ensemble Learning.

2
00:00:27,570 --> 00:00:34,250
In real life, when we have important decisions
to make, we often choose to make them using

3
00:00:34,250 --> 00:00:35,670
a committee.

4
00:00:35,670 --> 00:00:41,000
Having different experts sitting down together,
with different perspectives on the problem,

5
00:00:41,000 --> 00:00:47,870
and letting them vote, is often a very effective
and robust way of making good decisions.

6
00:00:48,580 --> 00:00:51,320
The same is true in machine learning.

7
00:00:51,320 --> 00:00:56,629
We can often improve predictive performance
by having a bunch of different machine learning

8
00:00:56,629 --> 00:01:01,629
methods, all producing classifiers for the
same problem, and then letting them vote when

9
00:01:01,629 --> 00:01:06,840
it comes to classifying an unknown test instance.

10
00:01:06,840 --> 00:01:10,360
One of the disadvantages is that this produces
output that is hard to analyze.

11
00:01:10,360 --> 00:01:15,990
There are actually approaches that try and
produce a single comprehensible structure,

12
00:01:15,990 --> 00:01:19,560
but we're not going to be looking at any of
those.

13
00:01:19,560 --> 00:01:23,219
So the output will be hard to analyze, but
you often get very good performance.

14
00:01:24,500 --> 00:01:28,289
It's a fairly recent technique in machine
learning.

15
00:01:29,200 --> 00:01:34,950
We're going to look at four methods, called
"bagging", "randomization", "boosting", and

16
00:01:34,950 --> 00:01:36,509
"stacking".

17
00:01:36,509 --> 00:01:40,469
They're all implemented in Weka, of course.

18
00:01:40,469 --> 00:01:45,450
With bagging, we want to produce several different
decision structures.

19
00:01:45,450 --> 00:01:52,110
Let's say we use J48 to produce decision trees,
then we want to produce slightly different decision trees.

20
00:01:52,119 --> 00:01:57,109
We can do that by having several different
training sets of the same size.

21
00:01:57,109 --> 00:02:02,429
We can get those by sampling the original
training set.

22
00:02:02,429 --> 00:02:08,039
In fact, in bagging, you sample the set "with
replacement", which means that sometimes you

23
00:02:08,039 --> 00:02:15,039
might get two of the same [instances] chosen
in your sample.

24
00:02:15,400 --> 00:02:20,040
We produce several different training sets,
and then we build a model for each one -- let's

25
00:02:20,040 --> 00:02:24,120
say a decision tree -- using the same machine
learning scheme, or using some other machine

26
00:02:24,120 --> 00:02:25,590
learning scheme.

27
00:02:25,590 --> 00:02:32,590
Then we combine the predictions of the different
models by voting, or if it's a regression

28
00:02:32,650 --> 00:02:38,439
situation you would average the numeric result
rather than voting on it.

29
00:02:38,439 --> 00:02:43,540
This is very suitable for learning schemes
that are called "unstable".

30
00:02:43,540 --> 00:02:48,620
Unstable learning schemes are ones where a
small change in the training data can make

31
00:02:48,620 --> 00:02:51,540
a big change in the model.

32
00:02:51,540 --> 00:02:53,670
Decision trees are a really good example of
this.

33
00:02:53,670 --> 00:02:57,040
You can get a decision tree and just make
a tiny little change in the training data

34
00:02:57,040 --> 00:03:00,799
and get a completely different kind of decision
tree.

35
00:03:00,799 --> 00:03:06,969
Whereas with NaiveBayes, if you think about
how NaiveBayes works, little changes in the

36
00:03:06,969 --> 00:03:11,409
training set aren't going to make much difference
to the result of NaiveBayes, so that's a "stable"

37
00:03:11,409 --> 00:03:13,799
machine learning method.

38
00:03:13,799 --> 00:03:18,450
In Weka we have a "Bagging" classifier in
the meta set.

39
00:03:18,450 --> 00:03:25,450
I'm going to choose meta > Bagging: here it
is.

40
00:03:27,739 --> 00:03:32,689
We can choose here the bag size -- this is
saying a bag size of 100%, which is going

41
00:03:32,689 --> 00:03:37,579
to sample the training set to get another
set the same size, but it's going to sample

42
00:03:37,579 --> 00:03:39,269
"with replacement".

43
00:03:39,269 --> 00:03:45,019
That means we're going to get different sets
of the same size every time we sample, but

44
00:03:45,019 --> 00:03:50,159
each set might contain repeats of the original
training [instances].

45
00:03:50,159 --> 00:03:54,890
Here we choose which classifier we want to
bag, and we can choose the number of bagging

46
00:03:54,890 --> 00:03:58,730
iterations here, and a random-number seed.

47
00:03:58,730 --> 00:04:02,140
That's the bagging method.

48
00:04:02,140 --> 00:04:05,140
The next one I want to talk about is "random
forests".

49
00:04:05,140 --> 00:04:09,239
Here, instead of randomizing the training
data, we randomize the algorithm.

50
00:04:09,239 --> 00:04:13,519
How you randomize the algorithm depends on
what the algorithm is.

51
00:04:13,519 --> 00:04:17,049
Random forests are when you're using decision
tree algorithms.

52
00:04:17,049 --> 00:04:23,220
Remember when we talked about how J48 works?
-- it selects the best attribute for splitting

53
00:04:23,220 --> 00:04:24,490
on each time.

54
00:04:24,490 --> 00:04:30,160
You can randomize this procedure by not necessarily
selecting the very best, but choosing a few

55
00:04:30,160 --> 00:04:33,180
of the best options, and randomly picking
amongst them.

56
00:04:33,180 --> 00:04:35,480
That gives you different trees every time.

57
00:04:35,480 --> 00:04:44,060
Generally, if you bag decision trees, if you
randomize them and bag the result, you get

58
00:04:44,060 --> 00:04:47,110
better performance.

59
00:04:47,110 --> 00:04:54,110
In Weka, we can look under "tree" classifiers
for RandomForest.

60
00:04:58,500 --> 00:05:01,080
Again, that's got a bunch of parameters.

61
00:05:01,080 --> 00:05:06,110
The maximum depth of the trees produced -- I
think 0 would be unlimited depth.

62
00:05:06,110 --> 00:05:07,970
The number of features we're going to use.

63
00:05:07,970 --> 00:05:15,810
We might select, say 4 features; we would
select from the top 4 features -- every time

64
00:05:15,810 --> 00:05:23,560
we decide on the decision to put in the tree,
we select that from among the top 4 candidates.

65
00:05:23,560 --> 00:05:29,170
The number of trees we're going to produce,
and so on.

66
00:05:29,170 --> 00:05:33,190
That's random forests.

67
00:05:33,190 --> 00:05:36,590
Here's another kind of algorithm: it's called
"boosting".

68
00:05:36,590 --> 00:05:42,960
It's iterative: new models are influenced
by the performance of previously built models.

69
00:05:42,960 --> 00:05:48,220
Basically, the idea is that you create a model,
and then you look at the instances that are

70
00:05:48,220 --> 00:05:49,880
misclassified by that model.

71
00:05:49,880 --> 00:05:53,960
These are the hard instances to classify,
the ones it gets wrong.

72
00:05:53,960 --> 00:06:01,060
You put extra weight on those instances to
make a training set for producing the next

73
00:06:01,060 --> 00:06:04,220
model in the iteration.

74
00:06:04,220 --> 00:06:10,100
This encourages the new model to become an
"expert" for instances that were misclassified

75
00:06:10,100 --> 00:06:13,650
by all the earlier models.

76
00:06:13,650 --> 00:06:17,960
The intuitive justification for this is that
in a real life committee, committee members

77
00:06:17,960 --> 00:06:24,960
should complement each other's expertise by
focusing on different aspects of the problem.

78
00:06:25,490 --> 00:06:30,900
In the end, to combine them we use voting,
but we actually weight models according to

79
00:06:30,900 --> 00:06:33,280
their performance.

80
00:06:33,280 --> 00:06:41,440
There's a very good scheme called AdaBoostM1,
which is in Weka and is a standard and very

81
00:06:41,440 --> 00:06:47,260
good boosting implementation -- it often produces
excellent results.

82
00:06:47,980 --> 00:06:54,260
There are few parameters to this as well;
particularly the number of iterations.

83
00:06:55,800 --> 00:07:00,060
The final ensemble learning method is called
"stacking".

84
00:07:00,060 --> 00:07:04,800
Here we're going to have base learners, just
like the learners we talked about previously.

85
00:07:04,800 --> 00:07:10,050
We're going to combine them not with voting,
but by using a meta-learner, another learner

86
00:07:10,050 --> 00:07:13,020
scheme that combines the output of the base
learners.

87
00:07:14,140 --> 00:07:19,500
We're going to call the base learners level-0
models, and the meta-learner is a level-1 model.

88
00:07:19,500 --> 00:07:24,780
The predictions of the base learners are input
to the meta-learner.

89
00:07:24,780 --> 00:07:28,990
Typically you use different machine learning
schemes as the base learners to get different

90
00:07:28,990 --> 00:07:32,300
experts that are good at different things.

91
00:07:32,300 --> 00:07:37,430
You need to be a little bit careful in the
way you generate data to train the level-1

92
00:07:37,430 --> 00:07:43,080
model: this involves quite a lot of cross-validation,
I won't go into that.

93
00:07:43,080 --> 00:07:52,090
In Weka, there's a meta classifier called
"Stacking", as well as "StackingC" -- which

94
00:07:52,090 --> 00:07:55,950
is a more efficient version of Stacking.

95
00:07:55,950 --> 00:08:06,400
Here is Stacking; you can choose different
meta-classifiers here, and the number of stacking folds.

96
00:08:07,920 --> 00:08:13,400
We can choose different classifiers; different
level-0 classifiers, and a different meta-classifier.

97
00:08:20,270 --> 00:08:25,990
In order to create multiple level-0 models,
you need to specify a meta-classifier as the

98
00:08:25,990 --> 00:08:28,190
level-0 model.

99
00:08:28,190 --> 00:08:34,579
It gets a little bit complicated; you need
to fiddle around with Weka to get that working.

100
00:08:34,579 --> 00:08:35,190
That's it then.

101
00:08:35,190 --> 00:08:40,430
We've been talking about combining multiple
models into ensembles to produce an ensemble

102
00:08:40,430 --> 00:08:43,940
for learning, and the analogy is with committees
of humans.

103
00:08:44,540 --> 00:08:48,420
Diversity helps, especially when learners
are unstable.

104
00:08:49,980 --> 00:08:52,100
And we can create diversity in different ways.

105
00:08:52,100 --> 00:08:56,630
In bagging, we create diversity by resampling
the training set.

106
00:08:56,630 --> 00:09:02,130
In random forests, we create diversity by
choosing alternative branches to put in our

107
00:09:02,130 --> 00:09:03,570
decision trees.

108
00:09:03,570 --> 00:09:09,310
In boosting, we create diversity by focusing
on where the existing model makes errors;

109
00:09:09,310 --> 00:09:14,130
and in stacking, we combine results from a
bunch of different kinds of learner using

110
00:09:14,130 --> 00:09:18,150
another learner, instead of just voting.

111
00:09:18,150 --> 00:09:24,280
There's a chapter in the course text on Ensemble
learning -- it's quite a large topic, really.

112
00:09:24,280 --> 00:09:29,560
There's an activity that you should go and
do before we proceed to the next class, the

113
00:09:29,560 --> 00:09:31,940
last class in this course.

114
00:09:31,940 --> 00:09:36,680
We'll learn about putting it all together,
taking a more global view of the machine learning

115
00:09:36,680 --> 00:09:39,180
process.

116
00:09:39,180 --> 00:09:40,370
We'll see you then.

117
00:09:40,370 --> 00:09:41,820
Bye for now!