1 00:00:00,000 --> 00:00:04,578 [MUSIC] 2 00:00:04,578 --> 00:00:07,980 We see clustering arising in lots of different areas. 3 00:00:07,980 --> 00:00:11,130 So as an example of this, let’s look at some time series data and 4 00:00:11,130 --> 00:00:12,890 something called the Hidden Markov model. 5 00:00:13,970 --> 00:00:17,650 So, just remember that in the clustering we’ve looked at so far. 6 00:00:17,650 --> 00:00:22,340 We've looked at methods where the index of a given data point, 7 00:00:22,340 --> 00:00:26,600 plays no role in the resulting clustering of that data point. 8 00:00:26,600 --> 00:00:31,540 So, that is, we could simply permute all our data indices, and 9 00:00:31,540 --> 00:00:36,980 cluster that permuted data, and we would get out exactly the same results. 10 00:00:36,980 --> 00:00:41,240 But what if the order of the data points actually mattered, like in time series 11 00:00:41,240 --> 00:00:46,610 where the label, in particular the time stamp associated with each data point, 12 00:00:46,610 --> 00:00:49,760 is really critical to the analysis of that data. 13 00:00:49,760 --> 00:00:53,949 So for example here, we're looking at just a uni varied time series so 14 00:00:53,949 --> 00:00:57,330 the value of the observation is along the y axis. 15 00:00:57,330 --> 00:01:01,560 And the time stamp of the observation is along the x axis. 16 00:01:01,560 --> 00:01:04,140 And here, maybe our goal might be to 17 00:01:04,140 --> 00:01:08,340 parse this time series into what I'll call different dynamic states. 18 00:01:08,340 --> 00:01:12,470 So we can think of this just as just different clusters 19 00:01:12,470 --> 00:01:14,280 appearing in the data set over time. 20 00:01:14,280 --> 00:01:18,590 So maybe there's a green cluster, a blue cluster, and a red cluster and 21 00:01:18,590 --> 00:01:22,880 this time series is switching between each of these different states. 22 00:01:22,880 --> 00:01:26,950 Well if we ignored the time stamp associated with these data points, so 23 00:01:26,950 --> 00:01:32,230 think about just taking this plot and scrunching it along the x axis. 24 00:01:32,230 --> 00:01:38,090 So ignoring the time stamp and just looking at points along a one dimensional 25 00:01:38,090 --> 00:01:43,650 y axis and tried to cluster those observations, we have a really hard time. 26 00:01:43,650 --> 00:01:47,780 Because the range of values that the time series 27 00:01:47,780 --> 00:01:52,340 it takes when it's in this green state overlapped quite a lot. 28 00:01:52,340 --> 00:01:55,490 With the set of values that the time series it takes when it's in the blue and 29 00:01:55,490 --> 00:01:56,500 the red states. 30 00:01:56,500 --> 00:01:59,160 So all these values would be very overlapped and 31 00:01:59,160 --> 00:02:02,520 without tons and tons of data, it would be really hard 32 00:02:02,520 --> 00:02:06,250 to distinguish the fact that there are three different clusters here. 33 00:02:06,250 --> 00:02:11,160 But, instead, the structure of this data across time can actually help us 34 00:02:11,160 --> 00:02:14,710 uncover this pattern which appears rather, immediately, 35 00:02:14,710 --> 00:02:16,730 obvious from looking at this plot. 36 00:02:16,730 --> 00:02:21,270 In particular, we have the fact that in this data set if we're currently in 37 00:02:21,270 --> 00:02:25,330 a green state it seems like we're very likely to keep being in a green state. 38 00:02:25,330 --> 00:02:29,170 And then maybe there are certain other transitions between states, between 39 00:02:29,170 --> 00:02:33,920 green and blue or blue and red that might be more likely than other transitions. 40 00:02:33,920 --> 00:02:37,200 And this type of information can help us 41 00:02:37,200 --> 00:02:41,860 extract the clustering that we're desiring from this time series data. 42 00:02:41,860 --> 00:02:45,830 So in the context of time series, we can think of this as a segmentation task. 43 00:02:45,830 --> 00:02:50,170 To make this more concrete, let's look at the dance of a honeybee. 44 00:02:50,170 --> 00:02:54,630 Because when a honeybee is in the beehive, it switches between three different dances 45 00:02:54,630 --> 00:02:58,240 in order to communicate the location of food sources to other bees in the hive. 46 00:02:58,240 --> 00:03:03,920 So it switches between this waggle dance, a turn right dance and a turn left dance. 47 00:03:03,920 --> 00:03:08,640 And it keeps repeating specific patterns of these dances to communicate 48 00:03:08,640 --> 00:03:09,613 with the other bees. 49 00:03:09,613 --> 00:03:14,503 And so here, what we're showing are three different dance sequences that 50 00:03:14,503 --> 00:03:19,469 were segmented into the waggle dance, which is the red color, turn right, 51 00:03:19,469 --> 00:03:23,838 which is the green color, and turn left, which is the blue color. 52 00:03:23,838 --> 00:03:28,570 But, the question is, can we think of automatically extracting this information, 53 00:03:28,570 --> 00:03:31,924 just from observations of the honeybee's body position and 54 00:03:31,924 --> 00:03:35,500 head angle as it's moving around in this hive. 55 00:03:35,500 --> 00:03:39,890 So in particular when we’re looking at these dances we can make a plot 56 00:03:39,890 --> 00:03:43,790 of what dance the honey bee is in at every time step. 57 00:03:43,790 --> 00:03:47,570 And this will indicate the persistence in certain dances and 58 00:03:47,570 --> 00:03:48,990 the switches between them. 59 00:03:48,990 --> 00:03:52,280 Between this waggle, turn right and turn left dance. 60 00:03:52,280 --> 00:03:55,830 And what we see from this data is that there are indeed patterns. 61 00:03:55,830 --> 00:03:59,220 We see persistence in the different dances. 62 00:03:59,220 --> 00:04:01,400 Because if you're currently doing one type of dance, 63 00:04:01,400 --> 00:04:02,920 you're likely to continue doing that dance. 64 00:04:02,920 --> 00:04:06,560 And then we also see certain transitions are more likely than others. 65 00:04:06,560 --> 00:04:08,910 So for example, if I'm currently in a red dance, 66 00:04:08,910 --> 00:04:13,240 it's very likely I'll go to a green dance, or from a green dance to a red dance, 67 00:04:13,240 --> 00:04:16,880 though obviously other transitions are possible as well. 68 00:04:16,880 --> 00:04:21,240 And this type of notion of switching between different dynamic states appears 69 00:04:21,240 --> 00:04:23,310 in lots of different applications. 70 00:04:23,310 --> 00:04:25,810 Luckily not just the study of honey bees. 71 00:04:25,810 --> 00:04:28,950 So for example maybe we have conference audio if 72 00:04:28,950 --> 00:04:31,750 people taking turns speaking in a conference meeting. 73 00:04:31,750 --> 00:04:34,250 And we want to be able to segment that audio 74 00:04:34,250 --> 00:04:37,620 into who is speaking when during the meeting. 75 00:04:37,620 --> 00:04:42,920 Or maybe we have observations of a person doing a set of exercise routines. 76 00:04:42,920 --> 00:04:46,790 And we want to be able to segment out, okay this person is dong jumping jacks or 77 00:04:46,790 --> 00:04:48,970 running and squats and so on. 78 00:04:48,970 --> 00:04:51,770 And typically when people do different exercise routines, 79 00:04:51,770 --> 00:04:56,280 they switch between a set of behaviors again, and again, and again. 80 00:04:56,280 --> 00:04:58,480 And likewise when we're looking at stock data, 81 00:04:58,480 --> 00:05:02,620 this is a really common type of approach for thinking about this data. 82 00:05:02,620 --> 00:05:06,130 Where maybe the stock indices are switching 83 00:05:06,130 --> 00:05:10,210 between regimes of high volatility or low volatility. 84 00:05:10,210 --> 00:05:14,220 Or a medium volatility, or obviously different levels of volatility. 85 00:05:14,220 --> 00:05:18,000 So, given the broad set of applications where we see this type of structure. 86 00:05:18,000 --> 00:05:21,170 Let's spend a little bit of time talking about a model 87 00:05:21,170 --> 00:05:25,290 that can allow us to extract this type of information from data. 88 00:05:25,290 --> 00:05:29,950 And this model is called a Hidden Markov model, or an HMM for short. 89 00:05:29,950 --> 00:05:31,658 And an HMM is very, very, 90 00:05:31,658 --> 00:05:37,500 very similar to the type of mixture models we described earlier in this course. 91 00:05:37,500 --> 00:05:39,630 So just like in a mixture model, 92 00:05:39,630 --> 00:05:43,440 every observation is associated with a cluster indicator. 93 00:05:43,440 --> 00:05:46,390 So here we're referring to things as clusters. 94 00:05:46,390 --> 00:05:51,200 When you talk about HMMs often that cluster is described as a state. 95 00:05:51,200 --> 00:05:54,810 We talked about these dynamic states so we'll use the two interchangeably here 96 00:05:54,810 --> 00:05:58,960 just to draw connections with the mixture models that we described before. 97 00:05:58,960 --> 00:06:03,680 Okay, but the point is every observation has a cluster assignment 98 00:06:03,680 --> 00:06:09,100 associated with it and that's unknown to us, all we get is the observed value. 99 00:06:09,100 --> 00:06:13,320 And then, also, just like in our mixture model, every cluster or 100 00:06:13,320 --> 00:06:17,110 state has a distribution over observed values. 101 00:06:17,110 --> 00:06:21,600 So remember in our mixture model, we said that each component would maybe be 102 00:06:21,600 --> 00:06:26,750 a Gaussian with defining a distribution over the blue intensity unit and image. 103 00:06:26,750 --> 00:06:29,710 And that would be different if we're thinking about images about clouds, 104 00:06:29,710 --> 00:06:33,070 versus sunsets versus forests. 105 00:06:33,070 --> 00:06:35,510 And so those would be the different clusters present. 106 00:06:35,510 --> 00:06:38,770 Well here we have different dynamic states, but 107 00:06:38,770 --> 00:06:43,930 given a specific assignment to one of these states or clusters. 108 00:06:43,930 --> 00:06:49,280 Then we have the same type of distribution over the observed value within that state. 109 00:06:49,280 --> 00:06:54,100 But the really critical difference Is the fact that the probability 110 00:06:54,100 --> 00:06:59,450 of a given cluster assignment, depends on the value 111 00:06:59,450 --> 00:07:03,180 of the cluster assignment for the previous observation. 112 00:07:03,180 --> 00:07:06,550 And this is how we're going to capture the time dependency. 113 00:07:06,550 --> 00:07:10,530 This is how we're going to capture things like the fact that if we're currently 114 00:07:10,530 --> 00:07:12,360 in a state, 115 00:07:12,360 --> 00:07:17,720 like the waggle dance we're more likely to be in that state at the next time step. 116 00:07:17,720 --> 00:07:21,710 Or maybe if we're currently in this red state we're more likely to 117 00:07:21,710 --> 00:07:26,410 be in the green state at the next time step then a transition to the blue state. 118 00:07:26,410 --> 00:07:27,840 And so 119 00:07:27,840 --> 00:07:31,860 it's through this structure of dependency in the cluster of assignment variables 120 00:07:31,860 --> 00:07:37,330 across time that we're able to extract this type of time dependent clustering. 121 00:07:38,750 --> 00:07:42,610 And finally, HMMs have really clever inference algorithms, 122 00:07:42,610 --> 00:07:45,750 building on the types of methods we described for mixture models. 123 00:07:45,750 --> 00:07:48,640 But with the added challenge of the time dependence in the cluster 124 00:07:48,640 --> 00:07:50,000 assignment variables. 125 00:07:50,000 --> 00:07:55,290 So for example, you can compute the maximum likelihood estimate of your HMM 126 00:07:55,290 --> 00:07:59,990 parameters using an Algorithm, very similar to what we talked about for 127 00:07:59,990 --> 00:08:01,230 Mixture models. 128 00:08:01,230 --> 00:08:05,680 But in this context it's called the Baum Welch algorithm for historical reasons. 129 00:08:05,680 --> 00:08:08,460 It was developed separately for HMM and 130 00:08:08,460 --> 00:08:13,290 then later the connection was drawn with expectation maximization. 131 00:08:13,290 --> 00:08:16,109 And another thing you can do in HMM is you can compute 132 00:08:17,280 --> 00:08:21,120 the most likely state sequence given your model parameters. 133 00:08:21,120 --> 00:08:22,990 So you fix your model parameters and 134 00:08:22,990 --> 00:08:27,490 you look at the most likely sequence of states or cluster assignments. 135 00:08:27,490 --> 00:08:33,050 And for that you can use something called the Viterbi algorithm, 136 00:08:33,050 --> 00:08:36,360 which is an example of something called dynamic programming, 137 00:08:36,360 --> 00:08:42,210 which allows you to efficiently do this search over a set of states. 138 00:08:43,240 --> 00:08:47,630 And you can also use dynamic programming to form soft assignments 139 00:08:47,630 --> 00:08:52,185 of the cluster variables using something called the forward-backward algorithm. 140 00:08:52,185 --> 00:08:56,265 And these soft assignments play a role in the Baum Welch algorithm 141 00:08:56,265 --> 00:09:00,645 just like the soft assignments played a role in standard Algorithm for 142 00:09:00,645 --> 00:09:02,995 mixture models we described earlier in the course. 143 00:09:04,045 --> 00:09:08,867 So what we've seen in this section just using HMMs to kind of motivate this idea 144 00:09:08,867 --> 00:09:13,397 is the fact that the methods that we learned in this course really lay a very 145 00:09:13,397 --> 00:09:17,639 strong foundation for learning other very important algorithms and 146 00:09:17,639 --> 00:09:20,151 methods out there in machine learning. 147 00:09:20,151 --> 00:09:24,403 [MUSIC]