1
00:00:00,000 --> 00:00:04,578
[MUSIC]

2
00:00:04,578 --> 00:00:07,980
We see clustering arising
in lots of different areas.

3
00:00:07,980 --> 00:00:11,130
So as an example of this,
let’s look at some time series data and

4
00:00:11,130 --> 00:00:12,890
something called the Hidden Markov model.

5
00:00:13,970 --> 00:00:17,650
So, just remember that in
the clustering we’ve looked at so far.

6
00:00:17,650 --> 00:00:22,340
We've looked at methods where
the index of a given data point,

7
00:00:22,340 --> 00:00:26,600
plays no role in the resulting
clustering of that data point.

8
00:00:26,600 --> 00:00:31,540
So, that is, we could simply
permute all our data indices, and

9
00:00:31,540 --> 00:00:36,980
cluster that permuted data, and
we would get out exactly the same results.

10
00:00:36,980 --> 00:00:41,240
But what if the order of the data points
actually mattered, like in time series

11
00:00:41,240 --> 00:00:46,610
where the label, in particular the time
stamp associated with each data point,

12
00:00:46,610 --> 00:00:49,760
is really critical to
the analysis of that data.

13
00:00:49,760 --> 00:00:53,949
So for example here, we're looking
at just a uni varied time series so

14
00:00:53,949 --> 00:00:57,330
the value of the observation
is along the y axis.

15
00:00:57,330 --> 00:01:01,560
And the time stamp of
the observation is along the x axis.

16
00:01:01,560 --> 00:01:04,140
And here, maybe our goal might be to

17
00:01:04,140 --> 00:01:08,340
parse this time series into what
I'll call different dynamic states.

18
00:01:08,340 --> 00:01:12,470
So we can think of this just
as just different clusters

19
00:01:12,470 --> 00:01:14,280
appearing in the data set over time.

20
00:01:14,280 --> 00:01:18,590
So maybe there's a green cluster,
a blue cluster, and a red cluster and

21
00:01:18,590 --> 00:01:22,880
this time series is switching between
each of these different states.

22
00:01:22,880 --> 00:01:26,950
Well if we ignored the time stamp
associated with these data points, so

23
00:01:26,950 --> 00:01:32,230
think about just taking this plot and
scrunching it along the x axis.

24
00:01:32,230 --> 00:01:38,090
So ignoring the time stamp and just
looking at points along a one dimensional

25
00:01:38,090 --> 00:01:43,650
y axis and tried to cluster those
observations, we have a really hard time.

26
00:01:43,650 --> 00:01:47,780
Because the range of values
that the time series

27
00:01:47,780 --> 00:01:52,340
it takes when it's in this green
state overlapped quite a lot.

28
00:01:52,340 --> 00:01:55,490
With the set of values that the time
series it takes when it's in the blue and

29
00:01:55,490 --> 00:01:56,500
the red states.

30
00:01:56,500 --> 00:01:59,160
So all these values would
be very overlapped and

31
00:01:59,160 --> 00:02:02,520
without tons and tons of data,
it would be really hard

32
00:02:02,520 --> 00:02:06,250
to distinguish the fact that there
are three different clusters here.

33
00:02:06,250 --> 00:02:11,160
But, instead, the structure of this
data across time can actually help us

34
00:02:11,160 --> 00:02:14,710
uncover this pattern which appears rather,
immediately,

35
00:02:14,710 --> 00:02:16,730
obvious from looking at this plot.

36
00:02:16,730 --> 00:02:21,270
In particular, we have the fact that
in this data set if we're currently in

37
00:02:21,270 --> 00:02:25,330
a green state it seems like we're very
likely to keep being in a green state.

38
00:02:25,330 --> 00:02:29,170
And then maybe there are certain other
transitions between states, between

39
00:02:29,170 --> 00:02:33,920
green and blue or blue and red that might
be more likely than other transitions.

40
00:02:33,920 --> 00:02:37,200
And this type of information can help us

41
00:02:37,200 --> 00:02:41,860
extract the clustering that we're
desiring from this time series data.

42
00:02:41,860 --> 00:02:45,830
So in the context of time series, we can
think of this as a segmentation task.

43
00:02:45,830 --> 00:02:50,170
To make this more concrete,
let's look at the dance of a honeybee.

44
00:02:50,170 --> 00:02:54,630
Because when a honeybee is in the beehive,
it switches between three different dances

45
00:02:54,630 --> 00:02:58,240
in order to communicate the location of
food sources to other bees in the hive.

46
00:02:58,240 --> 00:03:03,920
So it switches between this waggle dance,
a turn right dance and a turn left dance.

47
00:03:03,920 --> 00:03:08,640
And it keeps repeating specific
patterns of these dances to communicate

48
00:03:08,640 --> 00:03:09,613
with the other bees.

49
00:03:09,613 --> 00:03:14,503
And so here, what we're showing
are three different dance sequences that

50
00:03:14,503 --> 00:03:19,469
were segmented into the waggle dance,
which is the red color, turn right,

51
00:03:19,469 --> 00:03:23,838
which is the green color, and
turn left, which is the blue color.

52
00:03:23,838 --> 00:03:28,570
But, the question is, can we think of
automatically extracting this information,

53
00:03:28,570 --> 00:03:31,924
just from observations of
the honeybee's body position and

54
00:03:31,924 --> 00:03:35,500
head angle as it's moving
around in this hive.

55
00:03:35,500 --> 00:03:39,890
So in particular when we’re looking
at these dances we can make a plot

56
00:03:39,890 --> 00:03:43,790
of what dance the honey bee
is in at every time step.

57
00:03:43,790 --> 00:03:47,570
And this will indicate
the persistence in certain dances and

58
00:03:47,570 --> 00:03:48,990
the switches between them.

59
00:03:48,990 --> 00:03:52,280
Between this waggle,
turn right and turn left dance.

60
00:03:52,280 --> 00:03:55,830
And what we see from this data is
that there are indeed patterns.

61
00:03:55,830 --> 00:03:59,220
We see persistence in
the different dances.

62
00:03:59,220 --> 00:04:01,400
Because if you're currently
doing one type of dance,

63
00:04:01,400 --> 00:04:02,920
you're likely to continue
doing that dance.

64
00:04:02,920 --> 00:04:06,560
And then we also see certain transitions
are more likely than others.

65
00:04:06,560 --> 00:04:08,910
So for example,
if I'm currently in a red dance,

66
00:04:08,910 --> 00:04:13,240
it's very likely I'll go to a green dance,
or from a green dance to a red dance,

67
00:04:13,240 --> 00:04:16,880
though obviously other
transitions are possible as well.

68
00:04:16,880 --> 00:04:21,240
And this type of notion of switching
between different dynamic states appears

69
00:04:21,240 --> 00:04:23,310
in lots of different applications.

70
00:04:23,310 --> 00:04:25,810
Luckily not just the study of honey bees.

71
00:04:25,810 --> 00:04:28,950
So for
example maybe we have conference audio if

72
00:04:28,950 --> 00:04:31,750
people taking turns speaking
in a conference meeting.

73
00:04:31,750 --> 00:04:34,250
And we want to be able
to segment that audio

74
00:04:34,250 --> 00:04:37,620
into who is speaking
when during the meeting.

75
00:04:37,620 --> 00:04:42,920
Or maybe we have observations of a person
doing a set of exercise routines.

76
00:04:42,920 --> 00:04:46,790
And we want to be able to segment out,
okay this person is dong jumping jacks or

77
00:04:46,790 --> 00:04:48,970
running and squats and so on.

78
00:04:48,970 --> 00:04:51,770
And typically when people do
different exercise routines,

79
00:04:51,770 --> 00:04:56,280
they switch between a set of behaviors
again, and again, and again.

80
00:04:56,280 --> 00:04:58,480
And likewise when we're
looking at stock data,

81
00:04:58,480 --> 00:05:02,620
this is a really common type of
approach for thinking about this data.

82
00:05:02,620 --> 00:05:06,130
Where maybe the stock
indices are switching

83
00:05:06,130 --> 00:05:10,210
between regimes of high volatility or
low volatility.

84
00:05:10,210 --> 00:05:14,220
Or a medium volatility, or
obviously different levels of volatility.

85
00:05:14,220 --> 00:05:18,000
So, given the broad set of applications
where we see this type of structure.

86
00:05:18,000 --> 00:05:21,170
Let's spend a little bit of
time talking about a model

87
00:05:21,170 --> 00:05:25,290
that can allow us to extract this
type of information from data.

88
00:05:25,290 --> 00:05:29,950
And this model is called a Hidden Markov
model, or an HMM for short.

89
00:05:29,950 --> 00:05:31,658
And an HMM is very, very,

90
00:05:31,658 --> 00:05:37,500
very similar to the type of mixture models
we described earlier in this course.

91
00:05:37,500 --> 00:05:39,630
So just like in a mixture model,

92
00:05:39,630 --> 00:05:43,440
every observation is associated
with a cluster indicator.

93
00:05:43,440 --> 00:05:46,390
So here we're referring
to things as clusters.

94
00:05:46,390 --> 00:05:51,200
When you talk about HMMs often that
cluster is described as a state.

95
00:05:51,200 --> 00:05:54,810
We talked about these dynamic states so
we'll use the two interchangeably here

96
00:05:54,810 --> 00:05:58,960
just to draw connections with the mixture
models that we described before.

97
00:05:58,960 --> 00:06:03,680
Okay, but the point is every
observation has a cluster assignment

98
00:06:03,680 --> 00:06:09,100
associated with it and that's unknown to
us, all we get is the observed value.

99
00:06:09,100 --> 00:06:13,320
And then, also, just like in our
mixture model, every cluster or

100
00:06:13,320 --> 00:06:17,110
state has a distribution
over observed values.

101
00:06:17,110 --> 00:06:21,600
So remember in our mixture model,
we said that each component would maybe be

102
00:06:21,600 --> 00:06:26,750
a Gaussian with defining a distribution
over the blue intensity unit and image.

103
00:06:26,750 --> 00:06:29,710
And that would be different if we're
thinking about images about clouds,

104
00:06:29,710 --> 00:06:33,070
versus sunsets versus forests.

105
00:06:33,070 --> 00:06:35,510
And so those would be
the different clusters present.

106
00:06:35,510 --> 00:06:38,770
Well here we have different
dynamic states, but

107
00:06:38,770 --> 00:06:43,930
given a specific assignment to
one of these states or clusters.

108
00:06:43,930 --> 00:06:49,280
Then we have the same type of distribution
over the observed value within that state.

109
00:06:49,280 --> 00:06:54,100
But the really critical difference
Is the fact that the probability

110
00:06:54,100 --> 00:06:59,450
of a given cluster assignment,
depends on the value

111
00:06:59,450 --> 00:07:03,180
of the cluster assignment for
the previous observation.

112
00:07:03,180 --> 00:07:06,550
And this is how we're going to
capture the time dependency.

113
00:07:06,550 --> 00:07:10,530
This is how we're going to capture things
like the fact that if we're currently

114
00:07:10,530 --> 00:07:12,360
in a state,

115
00:07:12,360 --> 00:07:17,720
like the waggle dance we're more likely to
be in that state at the next time step.

116
00:07:17,720 --> 00:07:21,710
Or maybe if we're currently in this
red state we're more likely to

117
00:07:21,710 --> 00:07:26,410
be in the green state at the next time
step then a transition to the blue state.

118
00:07:26,410 --> 00:07:27,840
And so

119
00:07:27,840 --> 00:07:31,860
it's through this structure of dependency
in the cluster of assignment variables

120
00:07:31,860 --> 00:07:37,330
across time that we're able to extract
this type of time dependent clustering.

121
00:07:38,750 --> 00:07:42,610
And finally, HMMs have really
clever inference algorithms,

122
00:07:42,610 --> 00:07:45,750
building on the types of methods
we described for mixture models.

123
00:07:45,750 --> 00:07:48,640
But with the added challenge of
the time dependence in the cluster

124
00:07:48,640 --> 00:07:50,000
assignment variables.

125
00:07:50,000 --> 00:07:55,290
So for example, you can compute the
maximum likelihood estimate of your HMM

126
00:07:55,290 --> 00:07:59,990
parameters using an Algorithm,
very similar to what we talked about for

127
00:07:59,990 --> 00:08:01,230
Mixture models.

128
00:08:01,230 --> 00:08:05,680
But in this context it's called the Baum
Welch algorithm for historical reasons.

129
00:08:05,680 --> 00:08:08,460
It was developed separately for HMM and

130
00:08:08,460 --> 00:08:13,290
then later the connection was drawn
with expectation maximization.

131
00:08:13,290 --> 00:08:16,109
And another thing you can do
in HMM is you can compute

132
00:08:17,280 --> 00:08:21,120
the most likely state sequence
given your model parameters.

133
00:08:21,120 --> 00:08:22,990
So you fix your model parameters and

134
00:08:22,990 --> 00:08:27,490
you look at the most likely sequence
of states or cluster assignments.

135
00:08:27,490 --> 00:08:33,050
And for that you can use something
called the Viterbi algorithm,

136
00:08:33,050 --> 00:08:36,360
which is an example of something
called dynamic programming,

137
00:08:36,360 --> 00:08:42,210
which allows you to efficiently do
this search over a set of states.

138
00:08:43,240 --> 00:08:47,630
And you can also use dynamic
programming to form soft assignments

139
00:08:47,630 --> 00:08:52,185
of the cluster variables using something
called the forward-backward algorithm.

140
00:08:52,185 --> 00:08:56,265
And these soft assignments play
a role in the Baum Welch algorithm

141
00:08:56,265 --> 00:09:00,645
just like the soft assignments played
a role in standard Algorithm for

142
00:09:00,645 --> 00:09:02,995
mixture models we described
earlier in the course.

143
00:09:04,045 --> 00:09:08,867
So what we've seen in this section just
using HMMs to kind of motivate this idea

144
00:09:08,867 --> 00:09:13,397
is the fact that the methods that we
learned in this course really lay a very

145
00:09:13,397 --> 00:09:17,639
strong foundation for learning
other very important algorithms and

146
00:09:17,639 --> 00:09:20,151
methods out there in machine learning.

147
00:09:20,151 --> 00:09:24,403
[MUSIC]