1
00:00:00,164 --> 00:00:04,551
[MUSIC]

2
00:00:04,551 --> 00:00:07,186
So far in this module,
we've discussed clustering in

3
00:00:07,186 --> 00:00:11,440
the context of document analysis, but we
see applications of clustering everywhere.

4
00:00:12,840 --> 00:00:18,305
As an example, maybe you want to cluster
images, so in this case that we're

5
00:00:18,305 --> 00:00:23,512
showing here you might want to be able
to group all images associated with

6
00:00:23,512 --> 00:00:29,020
pictures of the ocean or a pink flower,
dog, sunset, clouds, and so on.

7
00:00:29,020 --> 00:00:33,220
And this can be useful for
things like Google Image search,

8
00:00:33,220 --> 00:00:38,512
where a person types a word into
Google Images, performs some query on that

9
00:00:38,512 --> 00:00:43,810
images dataset, and Google wants to
display a set of images for that word.

10
00:00:43,810 --> 00:00:48,730
But maybe not all of our images, and
the data set have been labeled by human.

11
00:00:48,730 --> 00:00:51,370
So if we can discover
groups of related images,

12
00:00:51,370 --> 00:00:55,230
this can really help in image search, and

13
00:00:55,230 --> 00:00:59,230
there are other reasons you might want to
perform clustering on images, too.

14
00:00:59,230 --> 00:01:05,490
So for example, when, again, using
Google Images somebody types in a term.

15
00:01:05,490 --> 00:01:10,638
Like the term cardinal, well,
that term might have multiple meanings and

16
00:01:10,638 --> 00:01:15,048
so if we take all images that
have already been labeled, again,

17
00:01:15,048 --> 00:01:17,677
maybe this was by a human as cardinal.

18
00:01:17,677 --> 00:01:23,277
Then, maybe we want to perform
a clustering on the images associated

19
00:01:23,277 --> 00:01:28,975
with this word because maybe some
of the images correspond to a bird,

20
00:01:28,975 --> 00:01:34,200
but others to a baseball team and
others to religious figures.

21
00:01:35,480 --> 00:01:37,980
So when a person is going and
searching for the word cardinal,

22
00:01:37,980 --> 00:01:42,510
we don't actually know which of
this meanings that person meant, so

23
00:01:42,510 --> 00:01:45,690
when we're displaying images
instead of having all the images

24
00:01:45,690 --> 00:01:49,800
be images of birds but
just different variance of that image.

25
00:01:49,800 --> 00:01:52,090
Maybe we want to have
a diverse set of images,

26
00:01:52,090 --> 00:01:55,320
that cover these set of possible meanings.

27
00:01:55,320 --> 00:01:59,670
So if we can cluster images all
labeled with the same label,

28
00:01:59,670 --> 00:02:04,620
we can present result that might
better explore what the person is

29
00:02:04,620 --> 00:02:07,710
actually intending when they
do a Google Image search.

30
00:02:07,710 --> 00:02:12,070
So in this case we're using
clustering to structure our output.

31
00:02:12,070 --> 00:02:15,240
We can also think about applying
clustering in applications that are very

32
00:02:15,240 --> 00:02:17,220
different than just for
the sake of retrieval.

33
00:02:18,330 --> 00:02:19,202
For example,

34
00:02:19,202 --> 00:02:24,377
maybe we can think about grouping patients
based on exhibited medical conditions.

35
00:02:24,377 --> 00:02:28,489
If we can form this grouping then
we can do things like better

36
00:02:28,489 --> 00:02:32,190
study subpopulations as well as diseases.

37
00:02:32,190 --> 00:02:33,770
So as an example of this,

38
00:02:33,770 --> 00:02:38,680
imagine we have intracranial EEG
recordings from a set of patients.

39
00:02:38,680 --> 00:02:41,320
So here,
we're showing three different brains and

40
00:02:41,320 --> 00:02:44,020
some electrodes placed on these brains.

41
00:02:44,020 --> 00:02:46,810
And from each one of these patients,

42
00:02:47,990 --> 00:02:52,250
we get a set of recordings
of different seizure events.

43
00:02:52,250 --> 00:02:55,360
Then, we can take each one
of these seizure events.

44
00:02:55,360 --> 00:03:00,230
And think about clustering based on
similarities between the events.

45
00:03:00,230 --> 00:03:02,580
And then,
based on these types of clusterings,

46
00:03:02,580 --> 00:03:05,940
we can learn more about
different types of seizures.

47
00:03:05,940 --> 00:03:10,431
We can characterize how many different
types of seizures do we think there are as

48
00:03:10,431 --> 00:03:14,195
well as what are the properties of
each one of these seizure types.

49
00:03:14,195 --> 00:03:18,005
We can also think about clustering
patients based on the types of seizures

50
00:03:18,005 --> 00:03:19,530
that they exhibit.

51
00:03:19,530 --> 00:03:23,820
And better understand different
subpopulations within the class of people

52
00:03:23,820 --> 00:03:25,220
who have seizure disorders.

53
00:03:26,730 --> 00:03:30,600
Another very different application
is looking at products on Amazon.

54
00:03:30,600 --> 00:03:35,520
And maybe here we would like to group
products based on having similar

55
00:03:35,520 --> 00:03:36,380
purchase histories.

56
00:03:36,380 --> 00:03:39,740
And when I talk about purchase
history of a product what I mean is,

57
00:03:39,740 --> 00:03:43,560
if we look at all users that go and
purchase a given item and

58
00:03:43,560 --> 00:03:47,770
look at other items that they purchase
either in that same shopping or

59
00:03:47,770 --> 00:03:53,180
in the recent history and we look at
that as a description of a product.

60
00:03:53,180 --> 00:03:57,140
So this would be aggregated over all
users that purchased this product

61
00:03:57,140 --> 00:04:01,290
then we can think about
clustering similar products.

62
00:04:01,290 --> 00:04:06,796
So for example,
maybe when people go and buy this crib,

63
00:04:06,796 --> 00:04:13,023
we also see that they are buying
a car seat and other baby items.

64
00:04:13,023 --> 00:04:16,968
And so, from this,
we can may be we can lump this crib

65
00:04:16,968 --> 00:04:21,451
in with the set of baby items
that are available on Amazon and

66
00:04:21,451 --> 00:04:27,010
even though let's say a seller labeled
this under a category furniture,

67
00:04:27,010 --> 00:04:32,211
one thing we can do with this clustering
is actually either correct or

68
00:04:32,211 --> 00:04:36,360
add a label which is baby,
that this is a baby product.

69
00:04:38,250 --> 00:04:43,330
And of course, we could also use this type
of structure to do product recommendation,

70
00:04:43,330 --> 00:04:47,226
somebody goes they have a certain
things in their shopping cart and

71
00:04:47,226 --> 00:04:51,076
you want to recommend other products
they might be interested in.

72
00:04:51,076 --> 00:04:53,896
And we can also think about
discovering groups of

73
00:04:53,896 --> 00:04:58,519
users that have similar purchase
behaviors, purchase or viewing behaviors.

74
00:04:59,620 --> 00:05:03,950
So there are lots of interesting things we
can do through this clustering structure

75
00:05:03,950 --> 00:05:06,650
extracted both on users and products.

76
00:05:07,750 --> 00:05:12,440
The less obvious example were clustering
can be useful is in the following task.

77
00:05:12,440 --> 00:05:17,149
So imagine that we want to estimate
housing value a very small

78
00:05:17,149 --> 00:05:18,849
spatial locations.

79
00:05:18,849 --> 00:05:23,130
So here, in this image, we're showing the
City of Seattle broken into senses tracks.

80
00:05:23,130 --> 00:05:26,180
So these are geographically
small regions and

81
00:05:26,180 --> 00:05:29,880
we want to assess value within
each one of these neighborhoods.

82
00:05:31,560 --> 00:05:36,780
But an issue here is the fact that
there are very few house sales

83
00:05:36,780 --> 00:05:39,090
per region at any given point in time.

84
00:05:39,090 --> 00:05:42,420
And we want to be able to asses
the value in that region.

85
00:05:42,420 --> 00:05:44,750
So how do we do it if
there isn't much data?

86
00:05:44,750 --> 00:05:49,170
Well, one thing we can think about
doing is discover clusters of regions

87
00:05:49,170 --> 00:05:51,990
that historically have behaved similarly.

88
00:05:51,990 --> 00:05:54,520
And if we discover this
cluster of regions,

89
00:05:54,520 --> 00:05:57,740
we can think about sharing
information between these regions,

90
00:05:57,740 --> 00:06:02,280
pooling our observations to form
better estimates of value locally.

91
00:06:03,420 --> 00:06:05,270
A structurally related task but

92
00:06:05,270 --> 00:06:09,460
with a very different application
is in forecasting crimes.

93
00:06:09,460 --> 00:06:13,620
So now what this image is
showing is Washington, DC.

94
00:06:13,620 --> 00:06:15,600
Again, broken down into census tracks and

95
00:06:15,600 --> 00:06:20,450
the goal is to be able to forecast
how many crimes are going to

96
00:06:20,450 --> 00:06:24,100
occur in each one of these census
tracks at the next point in time?

97
00:06:25,120 --> 00:06:30,100
And again, here, we can think about
clustering regions to share information

98
00:06:30,100 --> 00:06:35,750
and you can show that forming
predictions based on a discovered

99
00:06:35,750 --> 00:06:41,460
group of regions that behave similarly,
leads to better forecasts of crime rates,

100
00:06:41,460 --> 00:06:44,060
and if we were to treat
each region independently.

101
00:06:46,130 --> 00:06:49,920
So as we see,
there's a wide range of applications for

102
00:06:49,920 --> 00:06:54,720
clustering and the methods that
we described in this module

103
00:06:54,720 --> 00:06:58,220
extend to any one of these applications.

104
00:06:58,220 --> 00:07:03,216
Obviously with some application
specific tweaks to capture what

105
00:07:03,216 --> 00:07:06,754
it means to be a data
object within a cluster and

106
00:07:06,754 --> 00:07:10,949
what are notions of distances
between these objects.

107
00:07:10,949 --> 00:07:15,229
[MUSIC]