1
00:00:02,280 --> 00:00:09,280

2
00:00:17,880 --> 00:00:19,539
Hi! Welcome to the course 

3
00:00:19,539 --> 00:00:24,489
Data Mining with Weka. 
I'm Ian Witten from the University of Waikato

4
00:00:24,489 --> 00:00:28,399
 in New Zealand and I'm presenting the 
videos for this course which is being prepared

5
00:00:28,399 --> 00:00:30,779
by the Department of Computer Science

6
00:00:30,779 --> 00:00:35,940
at the University of Waikato. 
Data mining is a mature technology that a lot 

7
00:00:35,940 --> 00:00:36,680
of people are

8
00:00:36,680 --> 00:00:40,540
beginning to take very seriously, 
and a lot of other people find it very mysterious. 

9
00:00:40,540 --> 00:00:44,739
The real aim of this course is 
to take the mystery out of data 

10
00:00:44,739 --> 00:00:48,149
mining. This is a practical 
course on how to use the Weka 

11
00:00:48,149 --> 00:00:51,469
workbench, which you will
 download as part of the course, 

12
00:00:51,469 --> 00:00:55,890
for data mining. We explain
the basic principles of several popular 

13
00:00:55,890 --> 00:00:59,629
algorithms and how to 
use them in practical applications.

14
00:00:59,629 --> 00:01:03,440
In the world today, we're overwhelmed with data, 

15
00:01:03,440 --> 00:01:07,310
every time you swipe your credit card, 
every item you checkout out at the

16
00:01:07,310 --> 00:01:08,229
supermarkets

17
00:01:08,229 --> 00:01:12,429
every time you send a text, 
make a phone call, or send an email, 


18
00:01:12,429 --> 00:01:16,569
or type a key on a computer even, 
every time you walk past a security camera.

19
00:01:16,569 --> 00:01:20,659
 It all generates a little bit of 
data in a database. Data mining is about 

20
00:01:20,659 --> 00:01:21,290
going

21
00:01:21,290 --> 00:01:25,310
from the raw data to information.
 Information that can be used to make 

22
00:01:25,310 --> 00:01:26,189
predictions

23
00:01:26,189 --> 00:01:30,970
that are useful in the real world. Let me

24
00:01:30,970 --> 00:01:35,680
give you an example. 
You're at the supermarket checkout. 

25
00:01:35,680 --> 00:01:39,820
The till records every item you bought. 

26
00:01:39,820 --> 00:01:44,299
At the end, you hand them your loyalty 
card, and they give you a couple of percent off, 

27
00:01:44,299 --> 00:01:45,200
and 

28
00:01:45,200 --> 00:01:48,770
you give them your name and address 
and, indirectly, access to all sorts of

29
00:01:48,770 --> 00:01:50,579
demographic information about you

30
00:01:50,579 --> 00:01:54,110
 and people like you. 
Everybody likes a good bargain. 

31
00:01:54,110 --> 00:01:58,149
It's been a good day today, 
because thanks to those coupons they sent you in

32
00:01:58,149 --> 00:01:59,109
 the mail last week, 

33
00:01:59,109 --> 00:02:02,149
you've been able to stock up on some 
things you wouldn't normally have bought, 

34
00:02:02,149 --> 00:02:07,310
but that you  bought today because 
they are such a good deal. Next week, they'll send 

35
00:02:07,310 --> 00:02:08,690
you some more coupons, and 

36
00:02:08,690 --> 00:02:13,160
you'll go shopping again and buy some 
more stuff. They do some experiments on you,

37
00:02:13,160 --> 00:02:16,150
you know, they try to figure out 
how much more you would buy if the

38
00:02:16,150 --> 00:02:17,989
price was just that little bit less.

39
00:02:17,989 --> 00:02:22,250
These coupons are kind of a mechanism
for personalized pricing.

40
00:02:22,250 --> 00:02:25,540
They have access to all sorts of data 

41
00:02:25,540 --> 00:02:29,940
from you and people like you, in order 
to do these experiments and figure these things 

42
00:02:29,940 --> 00:02:30,519
out.

43
00:02:30,519 --> 00:02:34,970
Everybody wins: you get your bargains, they sell more stuff. 

44
00:02:34,970 --> 00:02:38,139
It sounds like a good deal 
to me. Here's another application. 

45
00:02:38,139 --> 00:02:41,810
Suppose you and your partner 
want a child, but you can't have one. 

46
00:02:41,810 --> 00:02:45,480
It's fun trying, but it can get 

47
00:02:45,480 --> 00:02:49,510
a little bit frustrating, and, ultimately, 
very frustrating, perhaps even tragic. 

48
00:02:49,510 --> 00:02:52,590
In artificial insemination, they 

49
00:02:52,590 --> 00:02:58,100
take some eggs from the woman's ovaries, 
and then they fertilize them with partner or donor 

50
00:02:58,100 --> 00:03:03,669
sperm, and then, they select
from amongst the embryos produced 

51
00:03:03,669 --> 00:03:06,680
some to implant back into the womb. 

52
00:03:06,680 --> 00:03:10,090
You want to select the ones 
with the best chance of success 

53
00:03:10,090 --> 00:03:13,310
of producing a live birth,
but you don't want too many 

54
00:03:13,310 --> 00:03:18,260
live births. The embryologist 
has access to all sorts of data 

55
00:03:18,260 --> 00:03:22,150
on these embryos. I think there are 50-100 

56
00:03:22,150 --> 00:03:26,430
pieces of information that they record 
about individual embryos, and they have

57
00:03:26,430 --> 00:03:27,190
historical

58
00:03:27,190 --> 00:03:31,120
data on which ones produced a live birth,

59
00:03:31,120 --> 00:03:35,079
 success. So here's an ideal 
situation for data mining. We have lots of 

60
00:03:35,079 --> 00:03:36,290
historical data

61
00:03:36,290 --> 00:03:40,209
 we have data on the present
situation, and we want to select

62
00:03:40,209 --> 00:03:44,540
those embryos that have the best 
chance of success. Now, that's a good application 

63
00:03:44,540 --> 00:03:45,419
of data mining, 

64
00:03:45,419 --> 00:03:49,310
bringing a live child to couple who
wants one.

65
00:03:49,310 --> 00:03:55,859
I talk about data mining 
and machine learning. Data mining is the 

66
00:03:55,859 --> 00:03:56,989
application,

67
00:03:56,989 --> 00:04:00,970
application, and machine learning 
is the algorithms we use. We're talking about using 

68
00:04:00,970 --> 00:04:02,690
machine learning algorithms

69
00:04:02,690 --> 00:04:06,430
for the purposes of data mining.

70
00:04:06,430 --> 00:04:09,579
This is Data Mining with Weka, 
so the next question is “What's Weka?” 

71
00:04:09,579 --> 00:04:13,130
This is a weka here, this little bird. 

72
00:04:13,130 --> 00:04:18,030
It's a flightless bird, kind 
of like its better known cousin 

73
00:04:18,030 --> 00:04:21,470
the kiwi, found only in the islands of New Zealand. 

74
00:04:21,470 --> 00:04:27,090
This is what it sounds like, 

75
00:04:27,090 --> 00:04:30,180
coming to you from New Zealand.

76
00:04:30,180 --> 00:04:34,780
However, in our context, Weka is a 
data mining work bench. It's an acronym for the

77
00:04:34,780 --> 00:04:35,610

78
00:04:35,610 --> 00:04:40,150
Waikato Environment for Knowledge Analysis. 
We just call it Weka. 

79
00:04:40,150 --> 00:04:43,719
It contains a large number 
of algorithms for classification, 

80
00:04:43,719 --> 00:04:47,590
and a lot of algorithms for 
data preprocessing, feature selection, 

81
00:04:47,590 --> 00:04:48,689
clustering, 

82
00:04:48,689 --> 00:04:51,969
finding association rules, 
things like that. It's a very 

83
00:04:51,969 --> 00:04:55,800
comprehensive work bench, 
and it's free, open source software 

84
00:04:55,800 --> 00:04:59,110
that you will download as part 
of this course in the next lesson. 

85
00:04:59,110 --> 00:05:02,289
It runs on

86
00:05:02,289 --> 00:05:06,819
any computer. It's written in Java, 
runs on Linux, Windows, Mac.

87
00:05:06,819 --> 00:05:10,110
You'll be able to download it 
and run it on your workstation and use it 

88
00:05:10,110 --> 00:05:15,650
during the course. You're going to learn 
how to load data into Weka and look at it; 

89
00:05:15,650 --> 00:05:19,419
you're going to learn about preprocessing, 
cleaning up data using filters;

90
00:05:19,419 --> 00:05:24,069
exploring it using visualizations, 
applying classification algorithms,

91
00:05:24,069 --> 00:05:27,620
 interpreting the output, 
understanding evaluation methods, 

92
00:05:27,620 --> 00:05:32,300
evaluation is very important in this area, 
understand various representations for 

93
00:05:32,300 --> 00:05:33,879
models, and how popular 

94
00:05:33,879 --> 00:05:37,810
machine learning algorithms work, 
and be aware of common pitfalls

95
00:05:37,810 --> 00:05:42,550
with data mining. The ultimate goal 
really is to empower you to use Weka on your

96
00:05:42,550 --> 00:05:46,610
own data, and, most importantly,
to understand what it is you are doing.

97
00:05:46,610 --> 00:05:51,210
This is the first class. In this class, 

98
00:05:51,210 --> 00:05:54,860
you're going to get started with Weka. 
You're going to install it;

99
00:05:54,860 --> 00:05:58,400
you're going to explore the Weka Explorer interface 

100
00:05:58,400 --> 00:06:02,449
and explore some data sets; build a classifier; 

101
00:06:02,449 --> 00:06:05,669
interpret the output of the classifier; use filters; 

102
00:06:05,669 --> 00:06:08,729
and visualize your data set. 
There's lots of things to do 

103
00:06:08,729 --> 00:06:11,879
in this class. Here's the structure of the course. 

104
00:06:11,879 --> 00:06:16,090
There are five classes altogether. Each class 

105
00:06:16,090 --> 00:06:20,360
consists of about six lessons.

106
00:06:20,360 --> 00:06:24,340
Class 1 is Getting started with Weka. 
Then, we're going to look at Evaluation 

107
00:06:24,340 --> 00:06:26,159
in Class 2, 

108
00:06:26,159 --> 00:06:30,249
Simple classifiers in Class 3, More 
classifiers in Class 4, and Putting it all 

109
00:06:30,249 --> 00:06:30,840
together

110
00:06:30,840 --> 00:06:34,370
in Class 5. These are the six lessons 

111
00:06:34,370 --> 00:06:39,189
in Class1. Each lesson comprises a short video, 

112
00:06:39,189 --> 00:06:42,879
5-10 minutes, like this one, followed by an activity.

113
00:06:42,879 --> 00:06:46,300
An activity that involves you doing something yourself. 

114
00:06:46,300 --> 00:06:49,800
You don't learn by me talking to you.
You learn by actual doing things. 

115
00:06:49,800 --> 00:06:51,680
So, we have lots of activities for you 

116
00:06:51,680 --> 00:06:55,800
that involve using the Weka workbench.
In the middle of the class is a mid-class

117
00:06:55,800 --> 00:06:57,680
assessment, and at the end there is 

118
00:06:57,680 --> 00:07:02,150
a post-class assessment. The marks 
for these are combined, and if you get 

119
00:07:02,150 --> 00:07:05,830
more than 70%, you will get 
a signed certificate from the 

120
00:07:05,830 --> 00:07:07,199
University of Waikato 

121
00:07:07,199 --> 00:07:10,309
certifying that you have completed this course.

122
00:07:10,309 --> 00:07:13,330
The activities are an important part 
of the course, but they are not part of the 

123
00:07:13,330 --> 00:07:14,159
assessment. 

124
00:07:14,159 --> 00:07:17,409
We really think you should do the 
activities, but you don't have to do them 

125
00:07:17,409 --> 00:07:18,800
for assessment purposes.

126
00:07:18,800 --> 00:07:22,809
it's up to you. As well as that, 
associated with the course is a textbook

127
00:07:22,809 --> 00:07:26,759
called Data Mining. It discusses 
data mining and Weka 

128
00:07:26,759 --> 00:07:30,620
in depth. It's a great book; I know 
it's a great book because I wrote it 

129
00:07:30,620 --> 00:07:31,400
myself

130
00:07:31,400 --> 00:07:35,460
with a couple of friends. The publisher 
has kindly agreed to make available 

131
00:07:35,460 --> 00:07:37,330
large chunk this textbook to you

132
00:07:37,330 --> 00:07:41,080
online so that you can use 
it for background reading. 

133
00:07:41,080 --> 00:07:45,069
It's only background reading; 
you don't have to read the textbook. 

134
00:07:45,069 --> 00:07:48,860
It's just if you want to delve into 
some of the ideas and algorithms in 

135
00:07:48,860 --> 00:07:49,559
more depth. 

136
00:07:49,559 --> 00:07:52,610
That's what it's there for. What you 

137
00:07:52,610 --> 00:07:55,659
need to do are the activities and the assessments, 

138
00:07:55,659 --> 00:07:59,939
and watch the videos, of course. That's it. 
I just thought I'd show you were I am. 

139
00:07:59,939 --> 00:08:03,629
I'm in New Zealand, that's where Weka is from. 

140
00:08:03,629 --> 00:08:07,699
That's where I'm sitting right now. 
This is the world as we see it in New Zealand. 

141
00:08:07,699 --> 00:08:10,729
We're at the top, you're probably 
down at the bottom somewhere.

142
00:08:10,729 --> 00:08:14,749
We're in the top, in the center, and that 
arrow to the North Island of New Zealand 

143
00:08:14,749 --> 00:08:18,529
is where the University of Waikato is.

144
00:08:18,529 --> 00:08:22,139
That's it for now. There is an 
activity associated with this lesson, 

145
00:08:22,139 --> 00:08:26,710
so go ahead and do it. Of course, you 
haven't learned very much in this lesson, so 

146
00:08:26,710 --> 00:08:28,589
it's not a very important activity.

147
00:08:28,589 --> 00:08:31,599
Don't worry about it too much. 
You're not expected to do a lot of reading to do 

148
00:08:31,599 --> 00:08:32,539
this activity.

149
00:08:32,539 --> 00:08:35,669
Just have a go and see how you get on, 

150
00:08:35,669 --> 00:08:40,099
and I'll see you again in the next 
lesson. I'm looking forward to that. 

151
00:08:40,099 --> 00:08:44,460
Goodbye for now.
