﻿1
00:00:17,630 --> 00:00:24,750
Hello again! This is the last class of Data
Mining with Weka, and we're going to step

2
00:00:24,750 --> 00:00:29,070
back a little bit and take a look at some
more global issues with regard to the data

3
00:00:29,070 --> 00:00:29,859
mining process.

4
00:00:29,859 --> 00:00:38,400
It's a short class with just four lessons:
the data mining process, pitfalls and pratfalls,

5
00:00:38,400 --> 00:00:41,730
data mining and ethics, and finally, a quick
summary.

6
00:00:42,760 --> 00:00:45,760
Let's get on with Lesson 5.1.

7
00:00:45,760 --> 00:00:50,570
This might be your vision of the data mining
process.

8
00:00:50,570 --> 00:00:53,100
You've got some data or someone gives you
some data.

9
00:00:53,100 --> 00:00:54,860
You've got Weka.

10
00:00:54,860 --> 00:01:00,720
You apply Weka to the data, you get some kind
of cool result from that, and everyone's happy.

11
00:01:02,820 --> 00:01:05,509
If so, I've got bad news for you.

12
00:01:05,509 --> 00:01:08,500
It's not going to be like that at all.

13
00:01:08,500 --> 00:01:11,579
Really, this would be a better way to think
about it.

14
00:01:11,579 --> 00:01:15,650
You're going to have a circle; you're going
to go round and round the circle.

15
00:01:15,650 --> 00:01:19,770
It's true that Weka is important -- it's in
the very middle of the circle here.

16
00:01:19,770 --> 00:01:26,069
It's going to be crucial, but it's only a
small part of what you have to do.

17
00:01:26,069 --> 00:01:30,590
Perhaps the biggest problem is going to be
to ask the right kind of question.

18
00:01:30,590 --> 00:01:37,380
You need to be answering a question, not just
vaguely exploring a collection of data.

19
00:01:38,420 --> 00:01:44,680
Then, you need to get together the data that
you can get hold of that gives you a chance

20
00:01:44,689 --> 00:01:49,329
of answering this question using data mining
techniques.

21
00:01:49,329 --> 00:01:50,950
It's hard to collect the data.

22
00:01:50,950 --> 00:01:56,670
You're probably going to have an initial dataset,
but you might need to add some demographic

23
00:01:56,670 --> 00:02:00,319
data, or some weather data, or some data about
other stuff.

24
00:02:00,319 --> 00:02:05,079
You're going to have to go to the web and
find more information to augment your dataset.

25
00:02:05,079 --> 00:02:11,819
Then you'll merge all that together: do some
database hacking to get a dataset that contains

26
00:02:11,819 --> 00:02:17,410
all the attributes that you think you might
need -- or that you think Weka might need.

27
00:02:17,410 --> 00:02:19,069
Then you're going to have to clean the data.

28
00:02:19,069 --> 00:02:24,890
The bad news is that real world data is always
very messy.

29
00:02:24,890 --> 00:02:29,610
That's a long and painstaking process of looking
around, looking at the data, trying to understand it,

30
00:02:29,610 --> 00:02:35,390
trying to figure out what the anomalies
are and whether it's good to delete them or not.

31
00:02:35,390 --> 00:02:37,260
That's going to take a while.

32
00:02:37,260 --> 00:02:40,550
Then you're going to need to define some new
features, probably.

33
00:02:40,550 --> 00:02:44,810
This is the feature engineering process, and
it's the key to successful data mining.

34
00:02:44,810 --> 00:02:49,030
Then, finally, you're going to use Weka, of
course.

35
00:02:49,030 --> 00:02:54,860
You might go around this circle a few times
to get a nice algorithm for classification,

36
00:02:54,860 --> 00:03:00,420
and then you're going to need to deploy the
algorithm in the real world.

37
00:03:00,420 --> 00:03:03,340
Each of these processes is difficult.

38
00:03:04,340 --> 00:03:08,340
You need to think about the question that
you want to answer.

39
00:03:08,440 --> 00:03:13,330
"Tell me something cool about this data" is
not a good enough question.

40
00:03:13,330 --> 00:03:17,890
You need to know what you want to know from
the data.

41
00:03:17,890 --> 00:03:19,660
Then you need to gather it.

42
00:03:19,660 --> 00:03:23,110
There's a lot of data around, like I said
at the very beginning, but the trouble is

43
00:03:23,110 --> 00:03:30,110
that we need classified data to use classification
techniques in data mining.

44
00:03:30,290 --> 00:03:36,080
We need expert judgements on the data, expert
classifications, and there's not so much data

45
00:03:36,080 --> 00:03:42,810
around that includes expert classifications,
or correct results.

46
00:03:42,810 --> 00:03:45,680
They say that more data beats a clever algorithm.

47
00:03:45,680 --> 00:03:49,910
So rather than spending time trying to optimize
the exact algorithm you're going to use in

48
00:03:49,910 --> 00:03:53,670
Weka, you might be better off employed in
getting more and more data.

49
00:03:53,670 --> 00:04:00,570
Then you've got to clean it, and like I said
before, real data is very mucky.

50
00:04:00,570 --> 00:04:04,650
That's going to be a painstaking matter of
looking through it and looking for anomalies.

51
00:04:04,650 --> 00:04:08,000
Feature engineering, the next step, is the
key to data mining.

52
00:04:08,000 --> 00:04:12,930
We'll talk about how Weka can help you a little
bit in a minute.

53
00:04:12,930 --> 00:04:16,340
Then you've got to deploy the result.

54
00:04:16,340 --> 00:04:18,490
Implementing it -- well, that's the easy part.

55
00:04:18,490 --> 00:04:24,430
The difficult part is to convince your boss
to use this result from this data mining process

56
00:04:24,430 --> 00:04:29,620
that he probably finds very mysterious and
perhaps doesn't trust very much.

57
00:04:29,620 --> 00:04:36,620
Getting anything actually deployed in the
real world is a pretty tough call.

58
00:04:37,060 --> 00:04:43,370
The key technical part of all this is feature
engineering, and Weka has a lot of [filters]

59
00:04:43,370 --> 00:04:44,200
that will help with this.

60
00:04:44,200 --> 00:04:46,150
Here are just a few of them.

61
00:04:46,150 --> 00:04:53,150
It might be worth while defining a new feature,
a new attribute that's a mathematical expression

62
00:04:54,530 --> 00:04:56,120
involving existing attributes.

63
00:04:56,120 --> 00:04:59,890
Or you might want to modify an existing attribute.

64
00:04:59,890 --> 00:05:05,240
With AddExpression, you can use any kind of
mathematical formula to create a new attribute

65
00:05:05,240 --> 00:05:08,050
from existing ones.

66
00:05:08,050 --> 00:05:13,730
You might want to normalize or center your
data, or standardize it statistically.

67
00:05:13,730 --> 00:05:18,210
Transform a numeric attribute to have a zero
mean -- that's "center".

68
00:05:18,210 --> 00:05:21,830
Or transform it to a given numeric range -- that's
"normalize".

69
00:05:21,830 --> 00:05:28,830
Or give it a zero mean and unit variance,
that's a statistical operation called "standardization".

70
00:05:30,530 --> 00:05:37,500
You might want to take those numeric attributes
and discretize them into nominal values.

71
00:05:37,500 --> 00:05:43,440
Weka has both supervised and unsupervised
attribute discretization filters.

72
00:05:44,790 --> 00:05:46,000
There are a lot of other transformations.

73
00:05:46,000 --> 00:05:51,480
For example, the PrincipalComponents transformation
involves a matrix analysis of the data to

74
00:05:51,480 --> 00:05:54,150
select the principal components in a linear space.

75
00:05:54,150 --> 00:05:58,920
That's mathematical, and Weka contains a good
implementation.

76
00:05:58,920 --> 00:06:04,220
RemoveUseless will remove attributes that
don't vary at all, or vary too much.

77
00:06:04,220 --> 00:06:07,800
Actually, I think we encountered that in one
of our activities.

78
00:06:07,800 --> 00:06:14,800
Then, there are a couple of filters that help
you deal with time series, when your instances

79
00:06:14,830 --> 00:06:17,300
represent a series over time.

80
00:06:17,300 --> 00:06:21,080
You probably want to take the difference between
one instance and the next, or a difference

81
00:06:21,080 --> 00:06:27,680
with some kind of lag -- one instance and
the one 5 before it, or 10 before it.

82
00:06:27,680 --> 00:06:33,650
These are just a few of the filters that Weka
contains to help you with your feature engineering.

83
00:06:33,650 --> 00:06:39,250
The message of this lesson is that Weka is
only a small part of the entire data mining

84
00:06:39,250 --> 00:06:41,810
process, and it's the easiest part.

85
00:06:41,810 --> 00:06:46,310
In this course, we've chosen to tell you about
the easiest part of the process! I'm sorry

86
00:06:46,310 --> 00:06:46,780
about that.

87
00:06:46,780 --> 00:06:50,230
The other bits are, in practice, much more
difficult.

88
00:06:50,230 --> 00:06:56,270
There's an old programmer's blessing: "May
all your problems be technical ones".

89
00:06:56,270 --> 00:07:01,170
It's the other problems -- the political problems
in getting hold of the data, and deploying

90
00:07:01,170 --> 00:07:06,610
the result -- those are the ones that tend
to be much more onerous in the overall data

91
00:07:06,610 --> 00:07:07,330
mining process.

92
00:07:07,330 --> 00:07:09,920
So good luck!

93
00:07:09,920 --> 00:07:12,400
There's some stuff about this in the course
text.

94
00:07:12,400 --> 00:07:17,810
Section 1.3 contains information on Fielded
Applications, all of which have gone through

95
00:07:17,810 --> 00:07:24,480
this kind of process in order to get them
out there and used in the field.

96
00:07:24,480 --> 00:07:26,200
There's an activity associated with this lesson.

97
00:07:26,200 --> 00:07:29,180
Off you go and do it, and we'll see you in
the next lesson.

98
00:07:29,180 --> 00:07:36,180
Bye for now!

