1
00:00:00,000 --> 00:00:07,505
So far, whatever we've done in machine
learning has assumed that data is labeled

2
00:00:07,505 --> 00:00:15,291
in the training set with some amount, some
classes, positive negative, buy or browser

3
00:00:15,291 --> 00:00:21,793
or type of animals, for example.
But, in the real world, there's nobody

4
00:00:21,793 --> 00:00:28,324
always telling you what classes there are.
I mean, in some cases, you can measure it

5
00:00:28,324 --> 00:00:33,488
by, figuring out which users actually
bought something and which users didn't.

6
00:00:33,488 --> 00:00:38,718
But for example, in cases of sentiment,
it's difficult to imagine how one would

7
00:00:38,718 --> 00:00:43,882
get that input, whether a sentence is
positive or negative, without some human

8
00:00:43,882 --> 00:00:49,705
actually doing the labeling.
So the question does arise, how the

9
00:00:49,705 --> 00:00:57,021
classes emerge from the data without any
external human input or in the real world,

10
00:00:57,021 --> 00:01:04,338
how do we figure out that an animal is an
animal, a table is an object, a chair is a

11
00:01:04,338 --> 00:01:11,390
chair etc., without some one explicitly
telling us what are objects what are not

12
00:01:11,390 --> 00:01:14,740
objects.
Well the answer is clustering.

13
00:01:15,340 --> 00:01:24,459
In clustering what happens is groups of
similar users or user queries are clumped

14
00:01:24,459 --> 00:01:28,875
together or clustered together based on
the terms that they contain.

15
00:01:28,875 --> 00:01:33,747
So if a large number of queries contained
red flowers, yellow flowers, cheap

16
00:01:33,747 --> 00:01:38,683
flowers, etcetera then all of these
queries would get clumped into a cluster,

17
00:01:38,683 --> 00:01:42,255
which talked essentially about flowers and
their color.

18
00:01:42,255 --> 00:01:47,115
Okay.
Similarly, comments, which often use the

19
00:01:47,115 --> 00:01:54,609
words like, great, love, excellent would
go together and naturally would form a

20
00:01:54,609 --> 00:02:00,290
cluster which talks about.
Positive sentiment, whereas others which

21
00:02:00,290 --> 00:02:06,512
had words like hate, uncomfortable, very
difficult and stuff like that would get

22
00:02:06,512 --> 00:02:12,105
into another cluster which would hopefully
be those which are negative.

23
00:02:12,105 --> 00:02:18,878
Now those are ideal situations in practice
the clusters that emerge directly from the

24
00:02:18,878 --> 00:02:25,180
data need not be so nicely categorized
into the classes that we actually expect.

25
00:02:27,080 --> 00:02:32,498
In the real world, for example, we might
be looking at observations of animals

26
00:02:32,498 --> 00:02:38,058
having features like legs of the side of
the head and things like that and you

27
00:02:38,058 --> 00:02:43,758
might group them based on which animals
appeared to be similar in terms of their

28
00:02:43,758 --> 00:02:49,599
head size or the number of legs they have
or the noises that they make and perhaps

29
00:02:49,599 --> 00:02:55,300
we'll get clusters which sort of look like
the groups which represent the animal

30
00:02:55,300 --> 00:03:00,930
classes that we would naturally give and
that's probably how we actually assign.

31
00:03:00,930 --> 00:03:04,640
Classes to objects in the real world in
the first place.

32
00:03:05,460 --> 00:03:13,660
An important part of all of these is
figuring out what classes there are from

33
00:03:13,660 --> 00:03:19,865
the data is the fact that we assume that
the features on the w-, basis of which

34
00:03:19,865 --> 00:03:24,892
classes emerge are given.
This is very important to note, because

35
00:03:24,892 --> 00:03:30,154
often clustering is highly dependent on
which features one chooses.

36
00:03:30,154 --> 00:03:35,888
Later on this week, we will see how we
might be able to find the features

37
00:03:35,888 --> 00:03:41,307
themselves from the data itself.
But for the moment, let's assume that

38
00:03:41,307 --> 00:03:48,224
features are given.
So the goal of clustering is to find

39
00:03:48,224 --> 00:03:53,110
regions of our space X.
Remember our space X with features being

40
00:03:53,110 --> 00:03:58,225
the elements of X1 to XN.
You want to find regions that are more

41
00:03:58,225 --> 00:04:03,416
populated than random data.
Now let's, let's look at this statement a

42
00:04:03,416 --> 00:04:08,812
little bit carefully.
Regions means that things which are in the

43
00:04:08,812 --> 00:04:14,748
same region are close together or similar
to each other, just because their values

44
00:04:14,748 --> 00:04:19,092
of features, that is the X values are
close in some respects.

45
00:04:19,092 --> 00:04:25,100
Some agree, some don't agree and even if
they don't agree in some respects they are

46
00:04:25,100 --> 00:04:28,286
close.
For example large and extra large are

47
00:04:28,286 --> 00:04:32,340
closer than perhaps, to each other than
perhaps to small.

48
00:04:33,620 --> 00:04:39,420
Secondly.
These regions are not just, regions which

49
00:04:39,420 --> 00:04:40,360
have.
A few.

50
00:04:40,900 --> 00:04:47,086
Instances in them but are more populated
than one would otherwise expect if the

51
00:04:47,086 --> 00:04:53,581
data were totally random in the sense that
if the X values for each of the features

52
00:04:53,581 --> 00:04:58,840
were chosen at random from all possible
values that they could have.

53
00:05:01,120 --> 00:05:08,097
So what we're trying to see is.
Other regions, where the ratio, of the

54
00:05:08,097 --> 00:05:15,611
probability of X falling, in that
particular region in, given the data that

55
00:05:15,611 --> 00:05:20,080
we have, is larger than the probability
that.

56
00:05:21,140 --> 00:05:26,343
The data would take that value if, data
was generated uniformly.

57
00:05:26,343 --> 00:05:29,977
That means if, it was completely random
data.

58
00:05:29,977 --> 00:05:33,446
Every element of x, being chosen at
random.

59
00:05:33,446 --> 00:05:39,063
Another way of looking at this is, lets
suppose we have all our data.

60
00:05:39,063 --> 00:05:45,924
Which is the black dots in the space here.
And we set y equal to one, so you know, in

61
00:05:45,924 --> 00:05:50,890
clustering we didn't have the output
variable, since we didn't have the class.

62
00:05:50,890 --> 00:05:56,122
But we want to find these classes, so
let's set y equal to one for all the dots

63
00:05:56,122 --> 00:06:01,420
which are actually there in our data.
That means the actual observations that we

64
00:06:01,420 --> 00:06:06,302
experience in, in the real world.
Are all set to one and then let's so

65
00:06:06,302 --> 00:06:12,275
imagine that we added to this data, not
some other data which we chose at random

66
00:06:12,275 --> 00:06:18,397
and we just chose X values at random and
just threw them in to data set and those

67
00:06:18,397 --> 00:06:24,893
were the, the light blue or colored or the
lighter colored dots and we, we threw alot

68
00:06:24,893 --> 00:06:29,000
of them in and we assigned them values y
equal to zero.

69
00:06:29,640 --> 00:06:34,206
Now we want to find out with this
definition of our data.

70
00:06:34,206 --> 00:06:40,695
Now it's no longer just the old data, but
it's the old data plus this random data.

71
00:06:40,695 --> 00:06:44,380
How we could figure out what the clusters
are.

72
00:06:45,200 --> 00:06:52,290
If we now define our F of X just like
before, the expected value of Y given X.

73
00:06:52,290 --> 00:06:57,742
Well, Y is one for the data that we
actually observed, and zero for the data

74
00:06:57,742 --> 00:07:03,412
that we added at random, then the expected
value of Y given X, is just, you know,

75
00:07:03,412 --> 00:07:09,155
the, essentially it'll be probability of X
divided by probability of X plus the

76
00:07:09,155 --> 00:07:13,298
probability of the random value, variable
being chosen.

77
00:07:13,298 --> 00:07:18,750
And if you just work this out, it'll be
the ratio R over one plus R, because

78
00:07:18,750 --> 00:07:24,711
essentially you can divide the numerator
and the denominator by P0, and you'll get

79
00:07:24,711 --> 00:07:32,300
R over one plus R.
This particular function, will have.

80
00:07:32,640 --> 00:07:37,660
Extreme values.
We'd have some areas where it's very

81
00:07:37,660 --> 00:07:41,561
large.
Because lot of the dots in that space have

82
00:07:41,561 --> 00:07:45,702
a one associated with them.
In other areas it might not be large,

83
00:07:45,702 --> 00:07:50,360
there are no black dots in that area but
there are a lot of random dots.

84
00:07:51,180 --> 00:07:57,992
So the fine regions where this data, this
function is large gives us clusters.

85
00:07:57,992 --> 00:08:04,805
Now, this is quite important when we have
big data because in big data, we can

86
00:08:04,805 --> 00:08:11,352
actually afford to do this kind of
clustering which is sometimes possibly

87
00:08:11,352 --> 00:08:18,964
even efficient.
But that's not normally done, because big

88
00:08:18,964 --> 00:08:23,698
data is quite new.
So traditional means of clustering are

89
00:08:23,698 --> 00:08:29,760
k-means clustering, agglomerative or
hierarchical clustering, and even our

90
00:08:29,760 --> 00:08:35,573
locality sensitive hashing that we
discussed in week one is a form of

91
00:08:35,573 --> 00:08:41,028
clustering, because it.
In an unsupervised manner, group similar

92
00:08:41,028 --> 00:08:45,258
items together.
Of course, it doesn't care if the clusters

93
00:08:45,258 --> 00:08:48,991
are big or small.
So, often it doesn't give us great

94
00:08:48,991 --> 00:08:52,625
clusters.
But it can definitely be used as a first

95
00:08:52,625 --> 00:08:58,875
step towards more careful clustering where
we are trying to find areas which actually

96
00:08:58,875 --> 00:09:04,544
have large values of F that is contained
large number of points from our data.

97
00:09:04,544 --> 00:09:10,187
All of which are close together.
Once you have a cluster, one can assign a

98
00:09:10,187 --> 00:09:16,580
class label to each cluster and say, okay,
cluster one is class A, cluster two is

99
00:09:16,580 --> 00:09:20,707
class B.
We can't really give them names because we

100
00:09:20,707 --> 00:09:26,986
don't really know how to name them.
Well, as human beings we figure out how to

101
00:09:26,986 --> 00:09:32,606
name them through language and collective
agreement but, you know, the computer

102
00:09:32,606 --> 00:09:39,020
doesn't know how to do that.
So clustering is one way of.

103
00:09:39,820 --> 00:09:44,924
Classes emerging from data in an
unsupervised manner and we've seen one

104
00:09:44,924 --> 00:09:49,249
means of clustering.
Look how these sensitive hashing even if

105
00:09:49,249 --> 00:09:55,134
it's not giving a great set of clusters.
So we want to really study the other means

106
00:09:55,134 --> 00:09:59,176
in this course.
There are other courses where you can get

107
00:09:59,176 --> 00:10:04,635
into all kinds of clustering algorithms as
well as supervised classification

108
00:10:04,635 --> 00:10:07,330
algorithms.
But we're going to move on.

109
00:10:07,330 --> 00:10:14,133
To look at other kinds of learning, so the
message here is clustering allows us to

110
00:10:14,133 --> 00:10:20,936
get classes from data without having to do
any explicit labeling, but an important

111
00:10:20,936 --> 00:10:27,490
aspect is that one needs to have the
features, because otherwise you don't have

112
00:10:27,490 --> 00:10:31,624
any basis.
On whcih to cluster And if you choose your

113
00:10:31,624 --> 00:10:35,159
features wrong then you'll get wrong
clusters.

114
00:10:35,159 --> 00:10:41,383
Another nice point that I'd like you to
note is that we use the same formulation

115
00:10:41,383 --> 00:10:47,915
of F of X equal to the expected value of Y
given X with an appropriate definition of

116
00:10:47,915 --> 00:10:51,758
our space.
As including not only the original data

117
00:10:51,758 --> 00:10:56,830
points but also some random data, so we
can use the same mechanism.

118
00:10:56,830 --> 00:11:01,295
Of defining the problem in terms of the
function F of X.

119
00:11:01,295 --> 00:11:07,595
While this doesn't normally yield any
practical benefit, it certainly allows us

120
00:11:07,595 --> 00:11:13,256
to understand both classification and
clustering, and some of the other

121
00:11:13,256 --> 00:11:18,360
techniques that we will study very soon
with the same formalism.

122
00:11:18,860 --> 00:11:29,193
It also allows us to imagine a situation
where if one were able to find decision

123
00:11:29,193 --> 00:11:38,033
boundaries, of functions, like F of X, or
find regions where F of X is large or

124
00:11:38,033 --> 00:11:43,053
small, efficiently.
One could solve classification,

125
00:11:43,053 --> 00:11:48,641
clustering, and a whole bunch of other
machine learning techniques with one set

126
00:11:48,641 --> 00:11:53,314
of methods.
However, this remains a research area even

127
00:11:53,314 --> 00:11:59,422
though much work has been done on this
direction in the past, and assumes

128
00:11:59,422 --> 00:12:03,522
especially more importance now with big
datawear.

129
00:12:03,522 --> 00:12:10,538
Wearing the data directly often yields
much better results than if one actually

130
00:12:10,538 --> 00:12:13,160
had very small amounts of data.