﻿1
00:00:16,160 --> 00:00:19,950
Hi! Welcome to Lesson 5.3 of Data Mining with
Weka.

2
00:00:19,950 --> 00:00:23,369
Before we start, I thought I'd show you where
I live.

3
00:00:23,369 --> 00:00:28,669
I told you before that I moved to New Zealand
many years ago.

4
00:00:28,669 --> 00:00:29,939
I live in a place called Hamilton.

5
00:00:29,939 --> 00:00:35,220
Let me just zoom in and see if we can find
Hamilton in the North Island of New Zealand,

6
00:00:35,220 --> 00:00:37,670
around the center of the North Island.

7
00:00:37,670 --> 00:00:44,030
This is where the University of Waikato is.

8
00:00:44,030 --> 00:00:47,660
Here is the university; this is where I live.

9
00:00:47,660 --> 00:00:52,160
This is my journey to work: I cycle every
morning through the countryside.

10
00:00:52,160 --> 00:00:53,930
As you can see, it's really nice.

11
00:00:53,930 --> 00:00:55,390
I live out here in the country.

12
00:00:55,390 --> 00:01:02,390
I'm a sheep farmer! I've got four sheep, three
in the paddock and one in the freezer.

13
00:01:02,500 --> 00:01:05,780
I cycle in -- it takes about half an hour
-- and I get to the university.

14
00:01:05,780 --> 00:01:11,970
I have the distinction of being able to go
from one week to the next without ever seeing

15
00:01:11,970 --> 00:01:16,090
a traffic light, because I live out on the
same edge of town as the university.

16
00:01:16,090 --> 00:01:21,500
When I get to the campus of the University
of Waikato, it's a very beautiful campus.

17
00:01:21,500 --> 00:01:23,060
We've got three lakes.

18
00:01:23,060 --> 00:01:27,349
There are two of the lakes, and another lake
down here.

19
00:01:27,349 --> 00:01:32,330
It's a really nice place to work! So I'm very
happy here.

20
00:01:32,330 --> 00:01:39,330
Let's move on to talk about data mining and
ethics.

21
00:01:39,530 --> 00:01:46,530
In Europe, they have a lot of pretty stringent
laws about information privacy.

22
00:01:47,000 --> 00:01:51,450
For example, if you're going to collect any
personal information about anyone, a purpose

23
00:01:51,450 --> 00:01:52,860
must be stated.

24
00:01:52,860 --> 00:01:57,750
The information should not be disclosed to
others without consent.

25
00:01:57,750 --> 00:02:01,390
Records kept on individuals must be accurate
and up to date.

26
00:02:01,390 --> 00:02:03,920
People should be able to review data about
themselves.

27
00:02:03,920 --> 00:02:08,110
Data should be deleted when it's no longer
needed.

28
00:02:08,110 --> 00:02:12,690
Personal information must not be transmitted
to other locations.

29
00:02:12,690 --> 00:02:17,390
Some data is too sensitive to be collected,
except in extreme circumstances.

30
00:02:17,390 --> 00:02:20,489
This is true in some countries in Europe,
particularly Scandinavia.

31
00:02:20,489 --> 00:02:24,230
It's not true, of course, in the United States.

32
00:02:24,230 --> 00:02:29,750
Data mining is about collecting and utilizing
recorded information, and it's good to be

33
00:02:29,750 --> 00:02:32,600
aware of some of these ethical issues.

34
00:02:32,600 --> 00:02:39,000
People often try to anonymize data so that
it's safe to distribute for other people to

35
00:02:39,000 --> 00:02:42,790
work on, but anonymization is much harder
than you think.

36
00:02:42,790 --> 00:02:44,760
Here's a little story for you.

37
00:02:44,760 --> 00:02:49,500
When Massachusetts released medical records
summarizing every state employee's hospital

38
00:02:49,500 --> 00:02:54,780
record in the mid-1990's, the Governor gave
a public assurance that it had been anonymized

39
00:02:54,780 --> 00:02:59,950
by removing all identifying information -- name,
address, and social security number.

40
00:02:59,950 --> 00:03:06,040
He was surprised to receive is own health
records (which included a lot of private information)

41
00:03:06,040 --> 00:03:11,040
in the mail shortly afterwards! People could
be re-identified from the information that

42
00:03:11,040 --> 00:03:13,490
was left there.

43
00:03:13,490 --> 00:03:18,220
There's been quite a bit of research done
on re-identification techniques.

44
00:03:18,220 --> 00:03:24,370
For example, using publicly available records
on the internet, 50% of Americans can be identified

45
00:03:24,370 --> 00:03:28,010
from their city, birth date, and sex.

46
00:03:28,010 --> 00:03:34,470
85% can be identified if you include their
zip code as well.

47
00:03:34,470 --> 00:03:40,140
There was some interesting work done on a
movie database.

48
00:03:40,140 --> 00:03:47,140
Netflix released a database of 100 million
records of movie ratings.

49
00:03:47,290 --> 00:03:51,810
They got individuals to rate movies [on the
scale] 1-5, and they had a whole bunch of

50
00:03:51,810 --> 00:03:56,100
people doing this -- a total of 100 million
records.

51
00:03:56,100 --> 00:04:02,060
It turned out that you could identify 99%
of people in the database if you knew their

52
00:04:02,060 --> 00:04:06,420
ratings for 6 movies and approximately when
they saw them.

53
00:04:06,420 --> 00:04:11,650
Even if you only know their ratings for 2
movies, you can identify 70% of people.

54
00:04:11,650 --> 00:04:16,349
This means you can use the database to find
out the other movies that these people watched.

55
00:04:16,349 --> 00:04:19,300
They might not want you to know that.

56
00:04:19,300 --> 00:04:25,500
Re-identification is remarkably powerful,
and it is incredibly hard to anonymize data

57
00:04:25,500 --> 00:04:30,660
effectively in a way that doesn't destroy
the value of the entire dataset for data mining

58
00:04:30,660 --> 00:04:33,310
purposes.

59
00:04:33,310 --> 00:04:37,540
Of course, the purpose of data mining is to
discriminate: that's what we're trying to do! 

60
00:04:37,540 --> 00:04:42,070
We're trying to learn rules that discriminate
one class from another in the data -- who

61
00:04:42,070 --> 00:04:48,000
gets the loan? -- who gets a special offer?
But, of course, certain kinds of discrimination

62
00:04:48,000 --> 00:04:50,720
are unethical, not to mention illegal.

63
00:04:50,720 --> 00:04:56,570
For example, racial, sexual, and religious
discrimination is certainly unethical, and

64
00:04:56,570 --> 00:04:59,550
in most places illegal.

65
00:04:59,550 --> 00:05:01,910
But it depends on the context.

66
00:05:01,910 --> 00:05:06,500
Sexual discrimination is usually illegal ... except for doctors.

67
00:05:06,500 --> 00:05:11,350
Doctors are expected to take gender into account
when they make their make their diagnoses.

68
00:05:11,350 --> 00:05:16,400
They don't want to tell a man that he is pregnant,
for example.

69
00:05:16,400 --> 00:05:20,010
Also, information that appears innocuous may
not be.

70
00:05:20,010 --> 00:05:26,880
For example, area codes -- zip codes in the
US -- correlate strongly with race; membership

71
00:05:26,880 --> 00:05:29,100
of certain organizations correlates with gender.

72
00:05:29,100 --> 00:05:34,260
So although you might have removed the explicit
racial and gender information from you database,

73
00:05:34,260 --> 00:05:37,880
it still might be able to be inferred from
other information that's there.

74
00:05:37,880 --> 00:05:48,550
It's very hard to deal with data: it has a way of revealing secrets about itself in unintended ways.

75
00:05:48,550 --> 00:05:55,550
Another ethical issue concerning data mining
is that correlation does not imply causation.

76
00:05:56,610 --> 00:06:02,169
Here's a classic example: as ice cream sales
increase, so does the rate of drownings.

77
00:06:02,169 --> 00:06:06,970
Therefore, ice cream consumption causes drowning?
Probably not.

78
00:06:06,970 --> 00:06:12,320
They're probably both caused by warmer temperatures
-- people going to beaches.

79
00:06:12,320 --> 00:06:17,800
What data mining reveals is simply correlations,
not causation.

80
00:06:17,800 --> 00:06:20,010
Really, we want causation.

81
00:06:20,010 --> 00:06:25,550
We want to be able to predict the effects
of our actions, but all we can look at using

82
00:06:25,550 --> 00:06:27,919
data mining techniques is correlation.

83
00:06:27,919 --> 00:06:34,919
To understand about causation, you need a
deeper model of what's going on.

84
00:06:36,340 --> 00:06:40,150
I just wanted to alert you to some of the
issues, some of the ethical issues, in data

85
00:06:40,150 --> 00:06:46,790
mining, before you go away and use what you've
learned in this course on your own datasets:

86
00:06:46,790 --> 00:06:51,270
issues about the privacy of personal information;
the fact that anonymization is harder than

87
00:06:51,270 --> 00:06:57,650
you think; re-identification of individuals
from supposedly anonymized data is easier

88
00:06:57,650 --> 00:07:03,699
than you think; data mining and discrimination
-- it is, after all, about discrimination;

89
00:07:03,699 --> 00:07:08,250
and the fact that correlation does not imply
causation.

90
00:07:08,250 --> 00:07:13,729
There's a section in the textbook, Data mining
and ethics, which you can read for more background

91
00:07:13,729 --> 00:07:18,030
information, and there's a little activity
associated with this lesson, which you should

92
00:07:18,030 --> 00:07:20,190
go and do now.

93
00:07:20,190 --> 00:07:23,900
I'll see you in the next lesson, which is
the last lesson of the course.

94
00:07:23,900 --> 00:07:26,500
Bye for now!

