﻿1
00:00:19,000 --> 00:00:24,460
Hi! Well, Class 2 of Data Mining with Weka
has just started.

2
00:00:24,469 --> 00:00:27,369
Class 1 has gone by without too many hitches.

3
00:00:27,369 --> 00:00:31,710
I've enjoyed looking at your comments on the
mailing list.

4
00:00:31,710 --> 00:00:35,910
Thank you for answering each other's questions,
and thank you, Peter, for answering so many

5
00:00:35,910 --> 00:00:36,879
technical questions.

6
00:00:36,879 --> 00:00:43,879
I just thought I'd talk briefly about some
of the common issues that have arisen.

7
00:00:45,190 --> 00:00:48,930
First of all, there is the website and how
to get to the course.

8
00:00:48,930 --> 00:00:52,190
Some people were going straight to the YouTube
videos, and if you just go straight to the

9
00:00:52,190 --> 00:00:54,160
YouTube videos you don't see the activities.

10
00:00:54,160 --> 00:00:57,829
You should be seeing this picture when you
go to the course.

11
00:00:57,829 --> 00:01:04,290
This is the website, and you need to go and
look at the course from here.

12
00:01:04,290 --> 00:01:07,280
Go to Class 1 and Class 2 and so on from here.

13
00:01:07,280 --> 00:01:11,479
This is the entry point to the course.

14
00:01:11,479 --> 00:01:16,450
Another problem for some has been installing
Weka on your computer.

15
00:01:16,450 --> 00:01:22,040
I guess I should have said that since Weka
is written in Java you need Java on your computer,

16
00:01:22,040 --> 00:01:26,820
or you need to install Java first, or install
Java as part of Weka.

17
00:01:26,820 --> 00:01:31,150
One of the problems people were having is
that they didn't have Java installed.

18
00:01:31,150 --> 00:01:35,369
Let me just show you how to test for whether
you have Java installed or not.

19
00:01:35,369 --> 00:01:41,810
If you go to your Windows Start Menu -- this
is just on Windows -- if you type 'cmd' you

20
00:01:41,810 --> 00:01:47,350
get a command line, or command window.

21
00:01:47,350 --> 00:01:52,119
I call this the "black screen of death," actually:
we often don't like to see this.

22
00:01:52,119 --> 00:02:01,990
But if you just simply type 'java' and it
comes back with this, then Java is installed

23
00:02:01,990 --> 00:02:03,409
on your computer.

24
00:02:03,409 --> 00:02:08,289
If it comes back with 'cannot find Java' or
something like that, then you need to first

25
00:02:08,289 --> 00:02:13,360
of all figure out how to get Java going on
your computer, and then get Weka going.

26
00:02:13,360 --> 00:02:16,950
So, I just thought I'd mention that.

27
00:02:16,950 --> 00:02:22,060
A number of people have made comments about
the book.

28
00:02:22,060 --> 00:02:26,290
Someone asked if it was really necessary to
do the readings; they were finding the course

29
00:02:26,290 --> 00:02:28,010
quite easy.

30
00:02:28,010 --> 00:02:31,319
The answer is that it's not really necessary
to do the readings, and you're *supposed*

31
00:02:31,319 --> 00:02:32,950
to find the course easy.

32
00:02:32,950 --> 00:02:39,900
It might get a little tougher in the weeks
to come, but still, it's a pretty easy course.

33
00:02:39,900 --> 00:02:44,120
The readings are there for additional background,
and you certainly shouldn't feel that you

34
00:02:44,120 --> 00:02:45,510
have to do them at all.

35
00:02:45,510 --> 00:02:47,920
You can do the whole course without looking
at the book.

36
00:02:47,920 --> 00:02:52,599
We're interested in ensuring that people at
all sorts of different levels who start out

37
00:02:52,599 --> 00:02:56,090
this course can succeed.

38
00:02:56,090 --> 00:02:58,080
You don't have to read the book.

39
00:02:58,080 --> 00:03:01,870
Someone else asked if the second edition of
the book is ok.

40
00:03:01,870 --> 00:03:06,620
This is the second edition of Data Mining
with Weka: you can see the cover here.

41
00:03:06,620 --> 00:03:13,620
I kind of like this one -- it's got a chameleon
hidden here amongst New Zealand fern leaves.

42
00:03:14,750 --> 00:03:23,330
The third edition, this one here, is the latest
edition -- this has got a tiger hidden in

43
00:03:23,330 --> 00:03:25,870
the grass.

44
00:03:25,870 --> 00:03:28,150
The answer is: the second edition is fine.

45
00:03:28,150 --> 00:03:34,430
Either of those editions are just fine if
you're looking at the readings.

46
00:03:34,430 --> 00:03:38,849
Someone else said "I hate having to read it
online".

47
00:03:38,849 --> 00:03:44,000
I completely agree with you! I would love
to be able to provide you with a free physical

48
00:03:44,000 --> 00:03:50,159
copy of the whole book, but unfortunately
I'm not able to do that.

49
00:03:50,159 --> 00:03:51,860
The realities of publishing.

50
00:03:51,860 --> 00:03:56,769
I guess the publisher is trying to increase
sales, and they're hoping that you will be

51
00:03:56,769 --> 00:04:00,750
tempted to go out and buy a copy and recommend
it to your friends.

52
00:04:00,750 --> 00:04:04,549
This book makes a great Christmas present,
by the way -- so you can give a copy to all

53
00:04:04,549 --> 00:04:10,420
of your friends at Christmas! We can't provide
you with complete PDF file that you can take

54
00:04:10,420 --> 00:04:12,790
away, and we can't provide you with a physical
copy.

55
00:04:12,790 --> 00:04:18,280
I'm really sorry about that, but it's just
the way it is.

56
00:04:18,280 --> 00:04:21,599
The next thing that I wanted to mention was
the irises.

57
00:04:21,599 --> 00:04:23,199
This is kind of funny I thought.

58
00:04:23,199 --> 00:04:30,199
Let me just go to Class 1 here.

59
00:04:30,939 --> 00:04:34,550
This, of course, is how you're supposed to
be looking at the course.

60
00:04:34,550 --> 00:04:40,169
This is Lesson 1.3, and this is where you
can watch the video from.

61
00:04:40,169 --> 00:04:46,339
Then, if you go to the activity -- that's
this menu item here -- this gives you the

62
00:04:46,339 --> 00:04:47,009
activity.

63
00:04:47,009 --> 00:04:53,559
We had a little bit of a problem with this
question, with these iris pictures.

64
00:04:53,559 --> 00:05:02,690
Originally, we had these a's, b's and c's
permuted in different order for each of the

65
00:05:02,699 --> 00:05:05,580
possible answers.

66
00:05:05,580 --> 00:05:10,110
We were trying to make sure that you really
concentrate on reading these answers and don't

67
00:05:10,110 --> 00:05:12,639
just quickly scan through them.

68
00:05:12,639 --> 00:05:20,130
We were hoist by our own petard! -- that's
an English phrase that means you're injured

69
00:05:20,139 --> 00:05:25,119
by the device that you intended to use to
injure others, or, in our case, confused by

70
00:05:25,119 --> 00:05:29,349
the device that we intended to confuse you.

71
00:05:29,349 --> 00:05:34,180
We screwed up, and a couple of our answers
were permuted versions of the same thing.

72
00:05:34,180 --> 00:05:35,529
We've fixed that now.

73
00:05:35,529 --> 00:05:40,210
This is the current page, so they are all
a, b, c, in the right order.

74
00:05:40,210 --> 00:05:46,309
We thought we should make it simpler, because
it seems like even we couldn't understand

75
00:05:46,309 --> 00:05:47,330
the way we had it originally.

76
00:05:47,330 --> 00:05:51,639
I thought that was quite funny actually.

77
00:05:51,639 --> 00:05:53,930
The next thing is about the algorithms.

78
00:05:53,930 --> 00:05:57,909
People want to learn about the details of
the algorithms and how they work.

79
00:05:57,909 --> 00:06:02,749
Are you going to learn about those? Is there
a MOOC class that goes into the algorithms

80
00:06:02,749 --> 00:06:07,939
provided by Weka, rather than the mechanics
of running it?

81
00:06:07,939 --> 00:06:12,939
The answer is "yes": you will be learning
something about these algorithms.

82
00:06:12,939 --> 00:06:17,509
I've put the syllabus here; it's on the course
webpage.

83
00:06:17,509 --> 00:06:20,050
You can see from this syllabus what we're
going to be doing.

84
00:06:20,050 --> 00:06:26,559
We're going to be looking at, for example,
the J48 algorithm for decision trees and pruning

85
00:06:26,559 --> 00:06:32,529
decision trees; the nearest neighbor algorithm
for instance-based learning; and linear regression;

86
00:06:32,529 --> 00:06:33,710
classification by regression.

87
00:06:33,710 --> 00:06:40,360
We'll look at quite a few algorithms in Classes
3 and 4.

88
00:06:40,360 --> 00:06:46,330
I'm not going to tell you about the algorithms
in gory detail, however: they can get quite

89
00:06:46,330 --> 00:06:47,020
tricky inside.

90
00:06:47,020 --> 00:06:52,789
What I want to do is to communicate the overall
way that they work -- the idea behind the

91
00:06:52,789 --> 00:06:55,899
algorithms -- rather than the details.

92
00:06:55,899 --> 00:07:00,159
The book does give you full details of exactly
how these algorithms work inside.

93
00:07:00,159 --> 00:07:03,499
We're not going to be able to cover them in
that much detail in the course, but we will

94
00:07:03,499 --> 00:07:08,999
be talking about how the algorithms work and
what they do.

95
00:07:08,999 --> 00:07:14,709
I forgot to say when I was talking about those
irises is that someone pointed out that the

96
00:07:14,709 --> 00:07:18,580
Iris Versicolor is Quebec's floral emblem.

97
00:07:18,580 --> 00:07:21,499
Thank you very much for pointing that out!
I didn't know that.

98
00:07:21,499 --> 00:07:27,600
I lived in Canada for 11 years, and I didn't
know that the Iris Versicolor was Quebec's flower.

99
00:07:27,600 --> 00:07:32,249
That was very nice to learn; thank you.

100
00:07:32,249 --> 00:07:36,529
The next thing I want to talk about: someone
asked about using Naive Bayes.

101
00:07:36,529 --> 00:07:42,559
How can we use the NaiveBayes classifier algorithm
on a dataset, and how can we test for particular

102
00:07:42,559 --> 00:07:46,039
data, whether it fits into particular classes?

103
00:07:46,039 --> 00:07:50,439
Let me go to Weka here.

104
00:07:50,439 --> 00:07:56,189
We're going to be covering this in future
lessons -- Lesson 3.3 on Naive Bayes and so

105
00:07:56,189 --> 00:07:57,689
on -- but I'll just show you.

106
00:07:57,689 --> 00:07:58,989
All of this is very easy.

107
00:07:58,989 --> 00:08:03,349
If I go to Classify, and I want to run Naive
Bayes, I just need to find NaiveBayes.

108
00:08:03,349 --> 00:08:09,899
I happen to know it's in the bayes section,
and I can run it here.

109
00:08:09,899 --> 00:08:10,520
Just like that.

110
00:08:10,520 --> 00:08:12,009
We've just run NaiveBayes.

111
00:08:12,009 --> 00:08:19,009
I'll be doing this more slowly and looking
more at the output in Lesson 3.3.

112
00:08:19,699 --> 00:08:24,839
A natural thing to ask is if you had a particular
test instance, which way would Naive Bayes

113
00:08:24,839 --> 00:08:28,699
classify it, or any other kind of classifier?

114
00:08:28,699 --> 00:08:35,699
This is the weather data we're using here,
and I've created a file and called it weather.one.day.arff.

115
00:08:37,190 --> 00:08:43,120
It's a standard ARFF file, and I got it by
editing the weather.nominal.arff file.

116
00:08:43,120 --> 00:08:45,300
You can see that I've just got one day here.

117
00:08:45,300 --> 00:08:49,199
I've got the same header as for the regular
weather file and just one day -- but I could

118
00:08:49,199 --> 00:08:51,069
have several days if I wanted.

119
00:08:51,069 --> 00:08:56,800
I've put a question mark for the class, because
I want to know what class is predicted for that.

120
00:08:56,800 --> 00:09:02,680
We'll be talking about this in Lesson 2.1
-- you're probably doing it right now -- but

121
00:09:02,680 --> 00:09:05,579
we can use a "supplied test set".

122
00:09:05,579 --> 00:09:11,700
I'm going to set that one that I created,
which I called weather.one.day.arff,

123
00:09:11,700 --> 00:09:16,649
as my test set.

124
00:09:16,649 --> 00:09:20,899
I can run this and it will evaluate it on
the test set.

125
00:09:20,899 --> 00:09:25,970
On the "More options..." menu -- you'll be
learning about this in Lesson 4.3 -- there's

126
00:09:25,970 --> 00:09:29,160
an "Output predictions" option, here.

127
00:09:29,160 --> 00:09:40,490
If I now run it and look up here, I will find
instance number 1, the actual class was "?" -- I

128
00:09:40,490 --> 00:09:46,290
showed you that, that was what was in the
ARFF file -- and the predicted class is "no".

129
00:09:46,290 --> 00:09:48,860
There's some other information.

130
00:09:48,860 --> 00:09:54,329
This is how I can find out what predictions
would be on new test data.

131
00:09:54,329 --> 00:09:59,830
Actually, there's nothing stopping me from
setting as my test file

132
00:09:59,830 --> 00:10:02,010
the same as the training file.

133
00:10:02,019 --> 00:10:08,560
I can use weather.nominal.arff as my test
file, and run it again.

134
00:10:08,560 --> 00:10:15,490
Now, I can see these are the 14 instances
in the standard weather data.

135
00:10:15,490 --> 00:10:22,490
This is their actual class, this is the predicted
class, predicted by, in this case, Naive Bayes.

136
00:10:23,990 --> 00:10:28,300
There's a mark in this column whenever there's
an error, whenever the actual class differs

137
00:10:28,300 --> 00:10:30,269
from the predicted class.

138
00:10:30,269 --> 00:10:36,490
Again, we get that by, in the "More options..."
menu, checking "Output predictions".

139
00:10:36,490 --> 00:10:40,129
We're going to talk about that in other lessons.

140
00:10:40,129 --> 00:10:44,790
I just wanted to show you that it's very easy
to do these things in Weka.

141
00:10:44,790 --> 00:10:49,790
The final thing I just wanted to mention is,
if you're configuring a classifier -- any

142
00:10:49,790 --> 00:10:53,529
classifier, or indeed any filter -- there
are these buttons at the bottom.

143
00:10:53,529 --> 00:11:00,110
There's an "Open" and "Save" button, as well
as the OK button that we normally use.

144
00:11:00,110 --> 00:11:05,990
These buttons are not about opening files
in the Explorer, they're about saving configured

145
00:11:05,990 --> 00:11:07,470
classifiers.

146
00:11:07,470 --> 00:11:13,060
So you could set parameters here and save
that configuration with a name and a file,

147
00:11:13,060 --> 00:11:14,519
and then open it later on.

148
00:11:14,519 --> 00:11:20,339
We don't do that in this course, so we never
use these Open and Save buttons here in the

149
00:11:20,339 --> 00:11:22,680
GenericObjectEditor.

150
00:11:22,680 --> 00:11:29,680
This is the GenericObjectEditor that I get
by clicking a classifier or filter.

151
00:11:29,839 --> 00:11:32,230
Just ignore the Open and Save buttons here.

152
00:11:32,230 --> 00:11:37,439
They do not open ARFF files for you.

153
00:11:37,439 --> 00:11:39,199
That's all I wanted to say.

154
00:11:39,199 --> 00:11:40,339
Carry on with Class 2.

155
00:11:40,339 --> 00:11:42,379
It's great to see so many people doing this
course.

156
00:11:42,379 --> 00:11:46,400
Keep having fun, and we'll talk to you later.

157
00:11:46,400 --> 00:11:47,500
Bye for now!

