1
00:00:00,000 --> 00:00:04,145
[MUSIC]

2
00:00:04,145 --> 00:00:08,822
Okay, well we've talked quite exhaustively
about this notion of clustering for

3
00:00:08,822 --> 00:00:12,208
the sake of doing document retrieval,
but there are lots,

4
00:00:12,208 --> 00:00:15,595
and lots of other examples
where clustering is useful, and

5
00:00:15,595 --> 00:00:19,110
I wanna take some time just
to describe a few of them.

6
00:00:19,110 --> 00:00:23,212
So one application is for image search.

7
00:00:23,212 --> 00:00:27,068
So imagine you're going and
you're searching,

8
00:00:27,068 --> 00:00:31,666
you go on Google Image search and
you type in the word ocean.

9
00:00:31,666 --> 00:00:37,116
Well it would be really helpful if we
could structure all the images we have by

10
00:00:37,116 --> 00:00:42,500
some set of categories like ocean,
pink flower, dog, sunset, clouds.

11
00:00:43,760 --> 00:00:47,740
So clustering is very helpful for
doing structured search.

12
00:00:49,280 --> 00:00:53,580
Another very different application
is maybe we wanna group patients by

13
00:00:53,580 --> 00:00:55,670
their medical condition.

14
00:00:55,670 --> 00:00:58,610
So here a goal might be to better

15
00:00:58,610 --> 00:01:03,350
characterize subpopulations as
well as different diseases.

16
00:01:03,350 --> 00:01:10,550
So as an example, we can look at a whole
bunch of patients that have seizures.

17
00:01:10,550 --> 00:01:14,840
So these three brains represent
three different patients, and

18
00:01:14,840 --> 00:01:19,830
they have different recording setups that
are measuring their seizure activity.

19
00:01:19,830 --> 00:01:24,500
And so for each of these patients
we get a collection of recordings

20
00:01:24,500 --> 00:01:27,250
of different seizures that
they exhibit over time.

21
00:01:27,250 --> 00:01:30,070
So each one of these
colored squares represents

22
00:01:30,070 --> 00:01:31,650
a different recording of a seizure.

23
00:01:33,500 --> 00:01:36,750
And between these different
patients there might be

24
00:01:36,750 --> 00:01:40,780
similar types of seizures that
appear in these different patients.

25
00:01:42,310 --> 00:01:46,090
And so what we can do is we can take all
of these seizure recordings from these

26
00:01:46,090 --> 00:01:49,730
three different patients and
think about clustering them.

27
00:01:49,730 --> 00:01:54,070
And if we identify different
types of seizures in this way,

28
00:01:54,070 --> 00:01:58,430
this can allow us to better
treat the types of patient that

29
00:01:58,430 --> 00:02:03,620
we're observing based on understanding
what types seizures they exhibit.

30
00:02:03,620 --> 00:02:08,860
Well another application is thinking about
doing product recommendation on Amazon.

31
00:02:08,860 --> 00:02:14,030
So, for example, on Amazon there
are a lot of third parties that come and

32
00:02:14,030 --> 00:02:17,480
they post some product to be sold.

33
00:02:17,480 --> 00:02:20,310
And they provide a label
of what that product is.

34
00:02:20,310 --> 00:02:25,050
So, for example,
maybe a person wants to sell a crib and

35
00:02:25,050 --> 00:02:30,180
they label the crib, fairly reasonably,
as being a furniture item.

36
00:02:30,180 --> 00:02:33,975
So maybe we get posted under
the furniture category.

37
00:02:33,975 --> 00:02:38,813
But, if instead,
we look at who purchases this item?

38
00:02:38,813 --> 00:02:41,945
And we look at their purchase history, and

39
00:02:41,945 --> 00:02:46,382
we look at other people with
similar purchase histories, so

40
00:02:46,382 --> 00:02:51,950
maybe the person who purchased this
item also purchased baby car seat, well

41
00:02:51,950 --> 00:02:57,692
then maybe what we can do is maybe we can
infer that a better label for this crib,

42
00:02:57,692 --> 00:03:03,528
which had been labeled furniture is really
to have labeled it as a baby product.

43
00:03:03,528 --> 00:03:09,340
So in addition to discovering groups of
products that are related, that have.

44
00:03:09,340 --> 00:03:13,350
Based on purchase histories
of these items we can also

45
00:03:13,350 --> 00:03:16,990
use that to discover groups
of related users on Amazon.

46
00:03:16,990 --> 00:03:20,019
And that can be used for
targeting products to those users.

47
00:03:22,385 --> 00:03:26,150
And finally we can think about
structuring web search results.

48
00:03:26,150 --> 00:03:27,480
So, for example,

49
00:03:27,480 --> 00:03:31,760
search terms can have multiple
meanings like the word "cardinal".

50
00:03:31,760 --> 00:03:38,440
If I type this in to Google, maybe I
mean I want an article about a cardinal,

51
00:03:38,440 --> 00:03:45,060
the bird, maybe about the baseball team,
or about a cardinal, a religious figure.

52
00:03:46,420 --> 00:03:51,450
So if we can structure out
articles based on their content,

53
00:03:51,450 --> 00:03:54,820
using the same types of ideas
we've talked about in this module,

54
00:03:54,820 --> 00:04:00,680
then I can improve my search
results that I provide to people.

55
00:04:00,680 --> 00:04:03,080
And the list of applications goes on and
on.

56
00:04:03,080 --> 00:04:06,629
Another one that's quite
interesting is thinking about

57
00:04:07,880 --> 00:04:10,020
collections of neighborhoods and

58
00:04:10,020 --> 00:04:15,480
there are a few applications where you
want to discover similar neighborhoods.

59
00:04:15,480 --> 00:04:20,020
One is if we wanna estimate the price
of a house at a very small local

60
00:04:20,020 --> 00:04:21,730
regional level.

61
00:04:21,730 --> 00:04:26,390
So in this case, it challenges
the fact that we only have a few, or

62
00:04:26,390 --> 00:04:30,820
very often, no house sale observations
within a very small neighborhood.

63
00:04:31,860 --> 00:04:35,100
So if we wanna estimate the value
of the house in that neighborhood

64
00:04:35,100 --> 00:04:37,790
at a point in time,
it's very hard to do that

65
00:04:37,790 --> 00:04:42,540
because we have no other houses to base
our estimate off of in that neighborhood.

66
00:04:42,540 --> 00:04:46,310
However, if we can discover
other neighborhoods that have

67
00:04:47,360 --> 00:04:52,590
similar types of house dynamics,
house price dynamics, then

68
00:04:53,630 --> 00:04:58,630
we can come up with a good estimate of
the house in the neighborhood with few or

69
00:04:58,630 --> 00:05:03,290
no sales by leveraging information
from this other neighborhood

70
00:05:03,290 --> 00:05:09,060
that was discovered to be related
to the current neighborhood.

71
00:05:09,060 --> 00:05:13,480
So, the idea is to discover
clusters of neighborhoods, and

72
00:05:13,480 --> 00:05:17,595
then within those clusters we can share
information like these house sales

73
00:05:17,595 --> 00:05:20,030
informations to form better estimates.

74
00:05:21,280 --> 00:05:25,090
So, this is the solution that I'm
describing here, is to cluster regions

75
00:05:25,090 --> 00:05:27,980
with similar trends, and
then share information within a cluster.

76
00:05:29,380 --> 00:05:35,580
And the same idea of discovering
related regions can be used for helping

77
00:05:35,580 --> 00:05:40,930
to forecast violent crimes, to better
task police forces to different regions.

78
00:05:40,930 --> 00:05:45,665
So again once we discover different
neighborhoods that have very

79
00:05:45,665 --> 00:05:50,660
similar crime dynamics, we can form
better predictions of the rates

80
00:05:50,660 --> 00:05:53,933
of violent crimes in
those neighborhoods and

81
00:05:53,933 --> 00:05:58,257
then use that information to
task police to those regions.

82
00:05:58,257 --> 00:06:01,929
[MUSIC]