1
00:00:18,419 --> 00:00:22,310
Hi! Welcome back for another
five minutes in New Zealand

2
00:00:22,310 --> 00:00:24,330
with Data Mining with Weka.

3
00:00:24,330 --> 00:00:28,499
This is Lesson 1.3, and we're going
to look at exploring datasets

4
00:00:28,499 --> 00:00:32,230
in this lesson.

5
00:00:32,230 --> 00:00:36,130
We looked at this data file in the 
last lesson. It's the

6
00:00:36,130 --> 00:00:38,280
weather data

7
00:00:38,280 --> 00:00:42,580
toy dataset, of course. It has
fourteen days, or

8
00:00:42,580 --> 00:00:46,149
instances, and each instance, 
each day, is described by

9
00:00:46,149 --> 00:00:47,750
five attributes,

10
00:00:47,750 --> 00:00:49,070
four to do with the weather, and

11
00:00:49,070 --> 00:00:51,570
the last attribute,

12
00:00:51,570 --> 00:00:53,809
which we called the class value,

13
00:00:53,809 --> 00:00:57,490
the thing that we're trying to 
predict, whether or not to play this

14
00:00:57,490 --> 00:00:59,860
unspecified game.

15
00:00:59,860 --> 00:01:03,300
This is called a classification problem.

16
00:01:03,300 --> 00:01:05,330
We're trying to predict the class value.

17
00:01:05,330 --> 00:01:07,340
Let's open up Weka.

18
00:01:07,340 --> 00:01:09,320
It's here on my desktop.

19
00:01:09,320 --> 00:01:11,340
I'm going to go into the Explorer.

20
00:01:11,340 --> 00:01:13,150
We always use the Explorer.

21
00:01:13,150 --> 00:01:15,560
I'm going to open the file.

22
00:01:15,560 --> 00:01:20,460
I put the datasets in My Documents folder, 
 so I can see them here.

23
00:01:20,460 --> 00:01:21,400
Just open

24
00:01:21,400 --> 00:01:26,100
the Weka datasets and 
the nominal weather data.

25
00:01:26,100 --> 00:01:30,070
There's the weather data in Weka.

26
00:01:30,070 --> 00:01:31,430
As we saw last time,

27
00:01:31,430 --> 00:01:33,190


28
00:01:33,190 --> 00:01:37,140
you can see the size of the dataset, 
the number of instances—fourteen—

29
00:01:37,140 --> 00:01:39,270
you can see the attributes,

30
00:01:39,270 --> 00:01:41,900
you can click any of these attributes

31
00:01:41,900 --> 00:01:44,370
and get the values for those attributes

32
00:01:44,370 --> 00:01:46,700
up here in this panel.

33
00:01:46,700 --> 00:01:52,280
You also get at the bottom
a histogram of the attribute values

34
00:01:52,280 --> 00:01:55,390
with respect to the different
class values. The different class

35
00:01:55,390 --> 00:01:56,790
values are

36
00:01:56,790 --> 00:02:00,490
blue for yes, play and

37
00:02:00,490 --> 00:02:03,480
red for no, don't play.

38
00:02:03,480 --> 00:02:04,420
By default,

39
00:02:04,420 --> 00:02:07,370
the last attribute in Weka is always the class value.

40
00:02:07,370 --> 00:02:10,940
You can change this if you like. If you
change it here you can decide to

41
00:02:10,940 --> 00:02:17,549
predict a different one other than the last
attribute.

42
00:02:17,549 --> 00:02:23,549
That's the weather dataset, and
we've already explored that.

43
00:02:23,549 --> 00:02:27,540
As I said, it's a classification problem,
sometimes called a supervised learning

44
00:02:27,540 --> 00:02:29,310
problem. Supervised

45
00:02:29,310 --> 00:02:30,770
because you get to know the

46
00:02:30,770 --> 00:02:34,379
class values of the training instances.

47
00:02:34,379 --> 00:02:38,639
We take as inputted data 
set as classified examples, 

48
00:02:38,639 --> 00:02:42,199
these examples are independent 
examples with a class value attached.

49
00:02:42,199 --> 00:02:43,339


50
00:02:43,339 --> 00:02:47,089
The idea is to produce automatically 

51
00:02:47,089 --> 00:02:48,409
some kind of model 

52
00:02:48,409 --> 00:02:50,629
that can classify new examples. 

53
00:02:50,629 --> 00:02:52,959
That's the classification problem. 

54
00:02:52,959 --> 00:02:57,259
Here is what the examples 
look like. This is an instance, with 

55
00:02:57,259 --> 00:02:59,389
the different attribute values 

56
00:02:59,389 --> 00:03:01,019
a fixed set of features,

57
00:03:01,019 --> 00:03:02,290
 and then we add to that 

58
00:03:02,290 --> 00:03:05,589
the class to get the classified example. 

59
00:03:05,589 --> 00:03:10,499
That's what we have to 
have in our training dataset. 

60
00:03:10,499 --> 00:03:11,360


61
00:03:11,360 --> 00:03:14,920
These attributes or features 
can be discrete or continuous. 

62
00:03:14,920 --> 00:03:15,879
What we 

63
00:03:15,879 --> 00:03:18,659
looked at in the weather data were 

64
00:03:18,659 --> 00:03:20,560
discrete, or we call them nominal, 

65
00:03:20,560 --> 00:03:23,870
attribute values where they 
belong to a certain fixed set, 

66
00:03:23,870 --> 00:03:25,499
or they can be numeric 

67
00:03:25,499 --> 00:03:27,949
or continuous values. 

68
00:03:27,949 --> 00:03:32,339
Also, the class can be discrete or 
continuous. We're looking at a discrete class, 

69
00:03:32,339 --> 00:03:36,169
yes or no, in the case of the weather 
data. Another kind of machine 

70
00:03:36,169 --> 00:03:37,800
learning problem would involve 

71
00:03:37,800 --> 00:03:41,010
continuous classes, where 
you're trying to predict a number. 

72
00:03:41,010 --> 00:03:43,470
That's called a regression problem

73
00:03:43,470 --> 00:03:45,439
 in the trade.

74
00:03:45,439 --> 00:03:48,859
I'm going to have a look at a similar 

75
00:03:48,859 --> 00:03:52,509
dataset to the weather dataset. 

76
00:03:52,509 --> 00:03:53,209
The numeric weather

77
00:03:53,209 --> 00:03:54,829
dataset. 

78
00:03:54,829 --> 00:03:57,979
Let me just open that in Weka, 

79
00:03:57,979 --> 00:04:00,739
weather.numeric.arff. 

80
00:04:00,739 --> 00:04:02,840
Here it is. It's very similar, 

81
00:04:02,840 --> 00:04:05,389
almost identical in fact, 

82
00:04:05,389 --> 00:04:09,329
for 14 instances, 5 attributes, the same attributes. 

83
00:04:09,329 --> 00:04:12,229
Maybe I should just look at this dataset 

84
00:04:12,229 --> 00:04:13,769
in the edit panel. 

85
00:04:13,769 --> 00:04:17,600
You can see here that two of the 
attributes—temperature and humidity—

86
00:04:17,600 --> 00:04:21,739
are numeric attributes, whereas 
previously they were nominal

87
00:04:21,739 --> 00:04:25,660
attributes. So here there are numbers.

88
00:04:25,660 --> 00:04:29,830
What we see when we look at 
the attributes values for outlook, just as 

89
00:04:29,830 --> 00:04:30,719
before, we have 

90
00:04:30,719 --> 00:04:32,729
sunny, overcast and rainy. 

91
00:04:32,729 --> 00:04:36,159
For temperature, though, 
we can't enumerate the values, 

92
00:04:36,159 --> 00:04:38,189
there are too many numbers to enumerate. 

93
00:04:38,189 --> 00:04:42,910
We have the minimum and   maximum 
value, mean, and standard deviation. 

94
00:04:42,910 --> 00:04:44,740
That's what Weka gives you 

95
00:04:44,740 --> 00:04:46,039
for the numeric values.

96
00:04:46,039 --> 00:04:49,939


97
00:04:49,939 --> 00:04:53,099
I'm going to look at a different dataset. 

98
00:04:53,099 --> 00:04:57,360
I'm going to look at the glass dataset, 
which is a rather more extensive dataset. 

99
00:04:57,360 --> 00:04:59,639
It's a real world dataset, 

100
00:04:59,639 --> 00:05:02,610
not a terribly big one. 

101
00:05:02,610 --> 00:05:04,189
Let's open it. 

102
00:05:04,189 --> 00:05:07,150
Here we've got 214 instances 

103
00:05:07,150 --> 00:05:09,520
and 10 attributes. 

104
00:05:09,520 --> 00:05:13,229
Here are the 10 attributes, 
it's not clear what they are. 

105
00:05:13,229 --> 00:05:15,529
Let's look at the class, 

106
00:05:15,529 --> 00:05:17,650
by default the last 

107
00:05:17,650 --> 00:05:20,120
attribute shown. 

108
00:05:20,120 --> 00:05:24,400
There are seven values for the class, 
and the labels of these values give 

109
00:05:24,400 --> 00:05:26,819
you some indication of what 
this dataset is about. 

110
00:05:26,819 --> 00:05:31,469
We have headlamps, tableware, and containers. 

111
00:05:31,469 --> 00:05:34,250
Then we have building and vehicle windows, 

112
00:05:34,250 --> 00:05:36,080
both float and non-float. 

113
00:05:36,080 --> 00:05:37,560
You may not know this, but there are 

114
00:05:37,560 --> 00:05:40,349
different ways of making glass, and 

115
00:05:40,349 --> 00:05:43,439
the floating process is a way of making glass. 

116
00:05:43,439 --> 00:05:47,050
These are seven different kinds of glass. 

117
00:05:47,050 --> 00:05:50,209
What are the attribute values? 

118
00:05:50,209 --> 00:05:52,570
I don't know what you remember about physics, 

119
00:05:52,570 --> 00:05:53,679


120
00:05:53,679 --> 00:05:55,850
and I guess it doesn't 
matter if you don't remember. 

121
00:05:55,850 --> 00:05:59,369
RI stands for the refractive index. 

122
00:05:59,369 --> 00:06:02,399
It's always a good idea to check for 
reasonableness when you're looking at 

123
00:06:02,399 --> 00:06:04,739
datasets. It's really important to 

124
00:06:04,739 --> 00:06:06,830
get down and dirty with your data. 

125
00:06:06,830 --> 00:06:10,620
Here we're looking at the values of the 
refractive index—a minimum of 1.511,

126
00:06:10,620 --> 00:06:12,199


127
00:06:12,199 --> 00:06:14,650
a maximum of 1.534. 

128
00:06:14,650 --> 00:06:16,580
It's good to think about whether these are 

129
00:06:16,580 --> 00:06:20,310
reasonable values for refractive index. If you 
go to the web and have a look around, 

130
00:06:20,310 --> 00:06:21,710
you'll find that these are 

131
00:06:21,710 --> 00:06:22,699
good values for 

132
00:06:22,699 --> 00:06:24,539
the refractive index.

133
00:06:24,539 --> 00:06:25,940
Na. 

134
00:06:25,940 --> 00:06:29,429
If you did chemistry, you'll recognize Na as sodium. 

135
00:06:29,429 --> 00:06:33,350
Here, it looks like these are percentages, 

136
00:06:33,350 --> 00:06:36,370
the different percentages of sodium. 

137
00:06:36,370 --> 00:06:38,610
Magnesium, Mg, 

138
00:06:38,610 --> 00:06:43,159
and so on. We would expect Silicon (Si), 

139
00:06:43,159 --> 00:06:47,669
to make up the majority of glass. 
It varies between 69.81% 

140
00:06:47,669 --> 00:06:49,169


141
00:06:49,169 --> 00:06:51,229
and 75.41%.  

142
00:06:51,229 --> 00:06:57,289
These are percentages of 
different elements in the glass. 

143
00:06:57,289 --> 00:07:02,240
We can confirm our guesses 
here by looking at the data file itself. 

144
00:07:02,240 --> 00:07:04,569
Let me just find the glass data. 

145
00:07:04,569 --> 00:07:07,379
It's in Weka datasets, 

146
00:07:07,379 --> 00:07:08,030


147
00:07:08,030 --> 00:07:09,599


148
00:07:09,599 --> 00:07:12,419
and it's glass.arff. 

149
00:07:12,419 --> 00:07:14,580


150
00:07:14,580 --> 00:07:15,619
This is the ARFF 

151
00:07:15,619 --> 00:07:17,419
file format. 

152
00:07:17,419 --> 00:07:20,479
It starts with a bunch of comments about 

153
00:07:20,479 --> 00:07:24,969
the glass database. These lines beginning 
with percentage signs (%) are comments. 

154
00:07:24,969 --> 00:07:27,689
You can read about this. 
We don't have time to read it now.

155
00:07:27,689 --> 00:07:31,209
You can see about the 
attributes and it does say that 

156
00:07:31,209 --> 00:07:32,570
the attributes are 

157
00:07:32,570 --> 00:07:36,679
refractive index, sodium, magnesium, and so on. 

158
00:07:36,679 --> 00:07:39,050
And the type of glass, just like I said, is about 

159
00:07:39,050 --> 00:07:45,839
windows, containers, and tableware, and so on.

160
00:07:45,839 --> 00:07:48,999
We can get down to the end of the comments, 

161
00:07:48,999 --> 00:07:53,249
and here we have stuff for Weka. This is 
the ARFF format. The relation has a 

162
00:07:53,249 --> 00:07:54,479
name, 

163
00:07:54,479 --> 00:07:57,219
you'll see it printed in 
the interface when you look. 

164
00:07:57,219 --> 00:08:01,119
The attributes are defined, 
they are real valued attributes,

165
00:08:01,119 --> 00:08:03,269
 numeric attributes. 

166
00:08:03,269 --> 00:08:04,439
The type 

167
00:08:04,439 --> 00:08:08,440
attribute is nominal, and 
the different values of type are

168
00:08:08,440 --> 00:08:11,599
 enumerated here in quotes. 

169
00:08:11,599 --> 00:08:14,979
That defines the relation 
and the attributes. Then we have an

170
00:08:14,979 --> 00:08:19,459
 '@data' line, and following that in the 
ARFF format, are simply the instances, 

171
00:08:19,459 --> 00:08:24,239
one after the other, with the attribute 
values all on one line, ending with 

172
00:08:24,239 --> 00:08:26,430
class by default. This is the 

173
00:08:26,430 --> 00:08:29,219
class value for the first instance. 

174
00:08:29,219 --> 00:08:31,889
I think there are 214 

175
00:08:31,889 --> 00:08:33,670
instances here. 

176
00:08:33,670 --> 00:08:37,030
There's the last one. 

177
00:08:37,030 --> 00:08:39,829
That's the ARFF format. It is a very simple, 

178
00:08:39,829 --> 00:08:43,040
textual file format. 

179
00:08:43,040 --> 00:08:46,870
Now we've confirmed our guesses 
about these numbers being percentages 

180
00:08:46,870 --> 00:08:49,460
and different elements. 

181
00:08:49,460 --> 00:08:52,420
We can think about 

182
00:08:52,420 --> 00:08:56,310
this some more. It's important 
then, that these numbers are

183
00:08:56,310 --> 00:09:00,520
reasonable. If they went negative, for example, 

184
00:09:00,520 --> 00:09:03,670
that would indicate some kind of corrupted 
value. You can't have a negative 

185
00:09:03,670 --> 00:09:04,820
percentage. 

186
00:09:04,820 --> 00:09:08,560
We're expected silicon to 
be the majority component; 

187
00:09:08,560 --> 00:09:12,290
we're expecting the refractive index to be 
in this kind of range. It's always a good 

188
00:09:12,290 --> 00:09:14,749
idea when you get a dataset to just 

189
00:09:14,749 --> 00:09:16,870
click around in the Weka interface 

190
00:09:16,870 --> 00:09:20,090
and make sure things look real. 
Rather small amounts 

191
00:09:20,090 --> 00:09:24,220
of aluminum in glass. I guess that's not surprising; 

192
00:09:24,220 --> 00:09:27,260
I don't know very much about glass myself. 

193
00:09:27,260 --> 00:09:29,839
We're just kind of checking for reasonableness here—

194
00:09:29,839 --> 00:09:36,440
a very good thing to do.

195
00:09:36,440 --> 00:09:37,180
That's it then. 

196
00:09:37,180 --> 00:09:40,670
In this lesson, we've 
looked at the classification problem. 

197
00:09:40,670 --> 00:09:44,199
We've looked at the nominal weather 
data and the numeric weather data. 

198
00:09:44,199 --> 00:09:47,400
We've talked about 
nominal versus numeric attributes, 

199
00:09:47,400 --> 00:09:48,090
and we've 

200
00:09:48,090 --> 00:09:50,820
talked about the ARFF file format. 

201
00:09:50,820 --> 00:09:52,680
We've looked at the glass.arff 

202
00:09:52,680 --> 00:09:54,030
dataset, 

203
00:09:54,030 --> 00:09:57,970
and I've talked about sanity checking 
of attributes, and the importance of 

204
00:09:57,970 --> 00:10:00,850
getting down and dirty with your data. 

205
00:10:00,850 --> 00:10:04,410
If you'd like some further background 
on this, you can read Section 11.1 

206
00:10:04,410 --> 00:10:08,130
of the text and read about 
Preparing the data and Loading the data 

207
00:10:08,130 --> 00:10:10,080
into the Explorer.

208
00:10:10,080 --> 00:10:11,429
Whether or not you do that, 

209
00:10:11,429 --> 00:10:16,640
please go and look at the activity 
associated with this lesson.

210
00:10:16,640 --> 00:10:23,000
We'll see you soon. Bye!