[MUSIC] We see clustering arising in lots of different areas. So as an example of this, let’s look at some time series data and something called the Hidden Markov model. So, just remember that in the clustering we’ve looked at so far. We've looked at methods where the index of a given data point, plays no role in the resulting clustering of that data point. So, that is, we could simply permute all our data indices, and cluster that permuted data, and we would get out exactly the same results. But what if the order of the data points actually mattered, like in time series where the label, in particular the time stamp associated with each data point, is really critical to the analysis of that data. So for example here, we're looking at just a uni varied time series so the value of the observation is along the y axis. And the time stamp of the observation is along the x axis. And here, maybe our goal might be to parse this time series into what I'll call different dynamic states. So we can think of this just as just different clusters appearing in the data set over time. So maybe there's a green cluster, a blue cluster, and a red cluster and this time series is switching between each of these different states. Well if we ignored the time stamp associated with these data points, so think about just taking this plot and scrunching it along the x axis. So ignoring the time stamp and just looking at points along a one dimensional y axis and tried to cluster those observations, we have a really hard time. Because the range of values that the time series it takes when it's in this green state overlapped quite a lot. With the set of values that the time series it takes when it's in the blue and the red states. So all these values would be very overlapped and without tons and tons of data, it would be really hard to distinguish the fact that there are three different clusters here. But, instead, the structure of this data across time can actually help us uncover this pattern which appears rather, immediately, obvious from looking at this plot. In particular, we have the fact that in this data set if we're currently in a green state it seems like we're very likely to keep being in a green state. And then maybe there are certain other transitions between states, between green and blue or blue and red that might be more likely than other transitions. And this type of information can help us extract the clustering that we're desiring from this time series data. So in the context of time series, we can think of this as a segmentation task. To make this more concrete, let's look at the dance of a honeybee. Because when a honeybee is in the beehive, it switches between three different dances in order to communicate the location of food sources to other bees in the hive. So it switches between this waggle dance, a turn right dance and a turn left dance. And it keeps repeating specific patterns of these dances to communicate with the other bees. And so here, what we're showing are three different dance sequences that were segmented into the waggle dance, which is the red color, turn right, which is the green color, and turn left, which is the blue color. But, the question is, can we think of automatically extracting this information, just from observations of the honeybee's body position and head angle as it's moving around in this hive. So in particular when we’re looking at these dances we can make a plot of what dance the honey bee is in at every time step. And this will indicate the persistence in certain dances and the switches between them. Between this waggle, turn right and turn left dance. And what we see from this data is that there are indeed patterns. We see persistence in the different dances. Because if you're currently doing one type of dance, you're likely to continue doing that dance. And then we also see certain transitions are more likely than others. So for example, if I'm currently in a red dance, it's very likely I'll go to a green dance, or from a green dance to a red dance, though obviously other transitions are possible as well. And this type of notion of switching between different dynamic states appears in lots of different applications. Luckily not just the study of honey bees. So for example maybe we have conference audio if people taking turns speaking in a conference meeting. And we want to be able to segment that audio into who is speaking when during the meeting. Or maybe we have observations of a person doing a set of exercise routines. And we want to be able to segment out, okay this person is dong jumping jacks or running and squats and so on. And typically when people do different exercise routines, they switch between a set of behaviors again, and again, and again. And likewise when we're looking at stock data, this is a really common type of approach for thinking about this data. Where maybe the stock indices are switching between regimes of high volatility or low volatility. Or a medium volatility, or obviously different levels of volatility. So, given the broad set of applications where we see this type of structure. Let's spend a little bit of time talking about a model that can allow us to extract this type of information from data. And this model is called a Hidden Markov model, or an HMM for short. And an HMM is very, very, very similar to the type of mixture models we described earlier in this course. So just like in a mixture model, every observation is associated with a cluster indicator. So here we're referring to things as clusters. When you talk about HMMs often that cluster is described as a state. We talked about these dynamic states so we'll use the two interchangeably here just to draw connections with the mixture models that we described before. Okay, but the point is every observation has a cluster assignment associated with it and that's unknown to us, all we get is the observed value. And then, also, just like in our mixture model, every cluster or state has a distribution over observed values. So remember in our mixture model, we said that each component would maybe be a Gaussian with defining a distribution over the blue intensity unit and image. And that would be different if we're thinking about images about clouds, versus sunsets versus forests. And so those would be the different clusters present. Well here we have different dynamic states, but given a specific assignment to one of these states or clusters. Then we have the same type of distribution over the observed value within that state. But the really critical difference Is the fact that the probability of a given cluster assignment, depends on the value of the cluster assignment for the previous observation. And this is how we're going to capture the time dependency. This is how we're going to capture things like the fact that if we're currently in a state, like the waggle dance we're more likely to be in that state at the next time step. Or maybe if we're currently in this red state we're more likely to be in the green state at the next time step then a transition to the blue state. And so it's through this structure of dependency in the cluster of assignment variables across time that we're able to extract this type of time dependent clustering. And finally, HMMs have really clever inference algorithms, building on the types of methods we described for mixture models. But with the added challenge of the time dependence in the cluster assignment variables. So for example, you can compute the maximum likelihood estimate of your HMM parameters using an Algorithm, very similar to what we talked about for Mixture models. But in this context it's called the Baum Welch algorithm for historical reasons. It was developed separately for HMM and then later the connection was drawn with expectation maximization. And another thing you can do in HMM is you can compute the most likely state sequence given your model parameters. So you fix your model parameters and you look at the most likely sequence of states or cluster assignments. And for that you can use something called the Viterbi algorithm, which is an example of something called dynamic programming, which allows you to efficiently do this search over a set of states. And you can also use dynamic programming to form soft assignments of the cluster variables using something called the forward-backward algorithm. And these soft assignments play a role in the Baum Welch algorithm just like the soft assignments played a role in standard Algorithm for mixture models we described earlier in the course. So what we've seen in this section just using HMMs to kind of motivate this idea is the fact that the methods that we learned in this course really lay a very strong foundation for learning other very important algorithms and methods out there in machine learning. [MUSIC]