﻿1
00:00:18,500 --> 00:00:23,340
Hi! I went to see that movie The Great Gatsby
last night. I thought that was really good.

2
00:00:23,340 --> 00:00:30,340
I hope you don't mind if I finish off my martini.
Anyway, one of the constantly recurring themes

3
00:00:32,710 --> 00:00:39,280
in this course is the necessity to get close
to your data, look at it in every possible

4
00:00:39,280 --> 00:00:46,280
way. In this last lesson of the first class,
we're going to look at visualizing your data.

5
00:00:46,410 --> 00:00:50,449
This is what we're going to do. We're going
to use the Visualize panel. I'm going to open

6
00:00:50,449 --> 00:00:57,449
the iris dataset. You came across the iris
dataset in one of the activities, I think.

7
00:00:57,739 --> 00:01:04,739
I'm using it because it has numeric attributes,
four numeric attributes: sepallength,sepalwidth,

8
00:01:06,210 --> 00:01:13,210
petallength, petalwidth. The class are the
three kinds of iris flower: Iris-setosa, Iris-versicolor,

9
00:01:15,090 --> 00:01:21,390
and Iris-virginica.
Let's go to the Visualize panel and visualize

10
00:01:21,390 --> 00:01:28,390
this data. There is a matrix of two dimensional
plots, a five-by-five matrix of plots.

11
00:01:31,470 --> 00:01:40,600
If I can select one of these plots, I'm going
to be looking at a plot of sepalwidth on
the x-axis and petalwidth on the y-axis.

12
00:01:42,150 --> 00:01:49,150
That's a plot of the data. The colors correspond
to the three classes. I can actually change

13
00:01:49,420 --> 00:01:52,920
the colors. If I don't like those, I could
select another color, but I'm going to leave

14
00:01:52,920 --> 00:01:59,040
them the way they are. I can look at individual
data points by clicking on them. This is talking

15
00:01:59,040 --> 00:02:06,040
about instance number 86 with a sepallength
of 6, sepalwidth of 3.4, and so on.

16
00:02:07,450 --> 00:02:14,450
That's a versicolor, which is why this spot is colored
red. We can look individual instances.

17
00:02:15,790 --> 00:02:20,620
We can change the x- and y-axis by changing on
the menus here. Better still, if we click

18
00:02:20,620 --> 00:02:25,840
on this little set of bars here, these represent
the attributes. I'm going to click on this

19
00:02:25,840 --> 00:02:32,840
and the x-axis will change to sepallength.
Here the x-axis is sepalwidth. Here the x-axis

20
00:02:33,610 --> 00:02:40,610
is petallength, and so on. If I right click,
then it will change the y-axis to sepallength.

21
00:02:41,590 --> 00:02:55,890
So, I can quickly browse around these different
plots. There is a Jitter slider.

22
00:02:56,700 --> 00:03:01,390
Sometimes, points sit right on top of each other, and
jitter just adds a little bit of randomness

23
00:03:01,390 --> 00:03:06,900
to the x- and the y-axis. With a little bit
of jitter on here, the darker spots represent

24
00:03:06,900 --> 00:03:14,320
multiple instances. If I click on one of those,
I can see that that point represents three

25
00:03:14,320 --> 00:03:20,020
separate instances, all of class iris-setosa,
and they all have the same value of petallength

26
00:03:20,020 --> 00:03:21,990
and sepalwidth.

27
00:03:21,990 --> 00:03:29,930
Both of which are being plotted on this graph. The sepalwidth and petallength are 3.0 and 1.4

28
00:03:29,930 --> 00:03:34,210
 for each of the three instances.

29
00:03:36,220 --> 00:03:43,220
If I click another one here. This one here are two
with very similar [sepalwidths] and petallengths,

30
00:03:47,560 --> 00:03:49,590
both of the class versicolor.

31
00:03:49,590 --> 00:03:54,190
The jitter slider helps you distinguish between
points that are in fact very close together.

32
00:03:54,190 --> 00:04:01,190
Another thing we can do is select bits of
this dataset. I'm going to choose select rectangle

33
00:04:01,690 --> 00:04:08,690
here. If I draw a rectangle now, I can select
these points. If I were to submit this rectangle,

34
00:04:09,450 --> 00:04:14,110
then all other points would be excluded and
just these points would appear on the graph,

35
00:04:14,110 --> 00:04:21,110
with the access re-scaled appropriately. Here
we go. I've submitted that rectangle, and

36
00:04:21,260 --> 00:04:26,450
you can see that there's just the red points
and green points there. I could save that

37
00:04:26,450 --> 00:04:33,050
if I wanted as a different dataset, or I could
reset it and maybe try another kind of selection

38
00:04:33,050 --> 00:04:37,550
like this, where I'm going to have some blue
points, some red and some green points and

39
00:04:37,550 --> 00:04:43,360
see what that looks like. This might be a
way of cleaning up outliers in your data,

40
00:04:43,360 --> 00:04:50,360
by selecting rectangles and saving the new dataset.

41
00:04:50,480 --> 00:04:57,480
That's visualizing the dataset itself. What
about visualizing the result of a classifier.

42
00:04:58,820 --> 00:05:05,820
Let's get rid of this visualize panel and
back to the Preprocess panel. I'm going to

43
00:05:07,010 --> 00:05:14,010
use a classifier. I'm going to use, guess
what, J48. Let's find it under trees. I'm

44
00:05:14,430 --> 00:05:21,430
going to run it. Then if I right click on
this entry here in the log area, I can view

45
00:05:25,920 --> 00:05:32,920
classifier errors. Here we've got the class
plotted against the predicted class. The square

46
00:05:33,770 --> 00:05:39,300
boxes represent errors. If I click on one
of these, I can, of course, change the different

47
00:05:39,300 --> 00:05:45,610
axes if I want. I can change the x-axis and
the y-axis, but I'm going to go back to class

48
00:05:45,610 --> 00:05:55,710
and predictedclass. If I click on one of these
boxes, I can see where the errors are.

49
00:05:57,210 --> 00:06:04,210
There are two instances where the predicted class
is versicolor and the actual class is virginica.

50
00:06:04,820 --> 00:06:10,550
We can see these in the confusion matrix.
The actual class is virginica, and the predicted

51
00:06:10,550 --> 00:06:17,550
class is versicolor, that's 'b'. This 2 entry
in the confusion matrix is represented by these

52
00:06:17,550 --> 00:06:28,680
2 instances here. If I look at another point,
say this one. Here I've got one instance which

53
00:06:28,680 --> 00:06:41,290
is in fact a setosa predicted to be a versicolor.
I can look at this plot and find out where

54
00:06:41,290 --> 00:06:48,290
the misclassifications are actually occurring,
the errors in the confusion matrix.

55
00:06:51,150 --> 00:06:56,710
Get down and dirty with your data and visualize
it. You can do all sorts of things. You can

56
00:06:56,710 --> 00:07:01,150
clean it up, detect outliers. You can look
at the classification errors. For example,

57
00:07:01,150 --> 00:07:06,390
there's a filter that allows you to add the
classifications as a new attribute.

58
00:07:06,390 --> 00:07:12,150
Let's just go and have a look at that. I'm going
to go and find a filter. We're going to add

59
00:07:12,150 --> 00:07:19,150
an attribute. It's supervised because it uses
a class. Add an attribute, and AddClassfication.

60
00:07:20,110 --> 00:07:25,890
Here I get to choose in the configuration
panel, the machine learning scheme. I'm going

61
00:07:25,890 --> 00:07:34,290
to choose J48, of course, and I'm going to
outputClassification—put that True.

62
00:07:34,290 --> 00:07:39,300
That's configured it, and I'm going to apply it.
It will add a new attribute. It's done it,

63
00:07:39,460 --> 00:07:46,010
and this attribute is the classification according
to J48. Weka is very powerful. You can do

64
00:07:46,010 --> 00:07:52,510
all sorts of things with classifiers and filters.
That's the end of the first class.

65
00:07:52,850 --> 00:07:58,930
There's a section of the book on Visualization. Please
go and do the activity associated with this

66
00:07:58,930 --> 00:08:05,930
lesson, and I'll see you in the next class. Bye!

