So in this sequence of videos, we're going to apply the greedy algorithm design paradigm to a fundamental graph problem, the problem of computing minimum spanning trees. The MST problem is a really fun playground for greedy algorithm design, because it's the singular problem in which pretty much any greedy algorithm you come up with seems to work. So we'll talk about a couple of the famous ones, show why they're correct, and show how they can be implemented using suitable data structures to be blazingly fast. So, I'll give you the formal problem definition on the next slide but first let me just say informally what it is we're trying to accomplish. Essentially, what we want do is connect a bunch of points together as cheaply as possible. And, as usual with an abstract problem the objects can mean something very literal. So maybe the points we're trying to connect are servers in some computer network, or it could represent something more abstract. Like maybe we have a model of documents like Web Pages where we represent them as points in space. And we want to somehow connect those together. Now the main reason I'm going to spend time on the minimum expenditure problem is pedagogical. It's just a great problem for sharpening your skills with greedy algoritum design and proof of correctness. It'll also give us another opportunity to see the beautiful interplay between data structures and fast limitation of graph algorithms. That said that minimum expenditure problem does have applications. One very cool one is in clustering, and that I'll talk about in detail in a later video, it also comes up in networking. So if you do a web search on spanning tree protocol you'll also find some information about that. So as I said at the beginning the minimum spanning tree problem is remarkable in that it doesn't just admit one greedy algorithm that's correct, but in fact it admits multiple greedy algorithms that are correct. we're going to talk about two of them, the two most well known ones. But there are even some others believe it or not. So the first one we're going to discuss beginning in the next video is Prim's MST algorithm. This dates back over 50 years to 1957. in fact as you'll see Prim's algorithm shows a remarkable number of similarities with Dijkstra's shortest path algorithm. So you might not be surprised to know that Dijkstra also independently had discovered this algorithm a couple of years later. But in fact it was only noticed much later that this exact same algorithm had been first discovered over 25 years earlier by a mathematician named Jarnick. For that reason you'll sometimes hear this called Jarnick's algorithm or the Prim-Jarnick algorithm. for gravity and to be consistent with some of the main text books in the area I'm just going to call this Prim's algorithm throughout the lectures. The other algorithm we're going to cover which is also rightfully famous is Kruskal's MST algorithm. As far as I know this was indeed first discovered by Kruskal roughly the same time as Prim was doing his algorithm in the mid 50s. And in what sense do I say these algorithms are blazingly fast? Well, they run in almost linear time, linear in the number of edges of the graph. Specifically we'll see how using appropriate data structures will get each of them to run in time big O of M log N, where M is the number of edges in the graph, and N is the number of vertices in the graph. We'll employ data structures to speed up Prim's algorithm in exactly the same way we did for Dijkstra's algorithm, that is we'll be using the heap data structure, One thing that's cool about Crystal's algorithm is it'll give us an opportunity to study a new data structure, mainly the union fine data structure and that's a lot of fun to think about, in its own right, as you'll see. So to put this amazing running time on perspective I want to emphasize that only is it awesome in the sense it's you know, barely, it's almost linear. It takes almost barely more time to compute the spanning tree than it does to read the input graph. Reading the input graph alone, remember would take linear time. O of M time. But more over, graphs can have an enormous number of spanning trees. An exponential number. So some of these algorithms are honing in really quickly on a needle in a haystack. There's no way they have time to look at all these spanning tees, and yet they find the one which is the best which is optimal amongst all of them. How do these seemingly magical algorithms do it? Well, to discuss the details let's start by formalizing the Minimum Spanning Tree, or MST problem on the next slot. So in the MSD problem this is a graph problem so the main part of the input is a graph comprising verticies and edges. I do want to emphasize for the MST problem we are be considering only undirected graphs. This is different notice, than when we discussed shortest-path problems in Part one of the course. There we worked with directed graphs. There is an analogous problem to the [INAUDIBLE] signature problem for directed graphs. It's often called the optimal branching problem. And there are fast algorithms for it, but those algorithms are just slightly beyond the scope of this course. So we're not going to cover it. We're going to discuss only undirected graphs, and then minimum spanning trees for them. Now, whenever you talk about graph problems, you need to talk about, how is the graph actually represented. So that's something we discussed at length in part one. If you don't remember, I suggest going back and reviewing the video on graph representations. For the MST problem, we're going to assume that the graph is given as an adjacency list. That means, we're given an array of vertices, an array of edges. And we have pointers, wiring vertices to their incident edges and wiring edges back to their two endpoints. In addition to the graph of self the input includes a cost, for each of the edges, we're going to use the notation C sebies of note the cost of a edge, E. And in another contrast, to are discussion of shortest path problems, we're actually not going to care if the edge cost are positive or negative, they can be any number whatsoever. So no prizes for guessing what the outputs supposed to be, it's right there in the problem definition, the output is supposed to be a minimum cost spanning tree of the graph, but let's drill down and explain exactly what we mean by that. So first of all what do we mean by the cost of a tree or generally the cost of a sub graph, as a subset of the edges. Well we're just going to be looking at summing up the edges in the tree that we output. Now the other question is what do I mean by a tree that spans all vertices? So let me tell you exactly what this means, the sub graph T should have two properties, first of all there can not be any cycles, there can not be any loops in this tree. And by spanning all vertices, what I mean is that this sub graph is what's called connected. That is, there's a path, using the edges and t, from any vertex of the graph to any other vertex. That's what it means to span all of the vertices. So for example, consider the following graph with four vertices and five edges. I've labeled each of the five edges with a cost, which in this case, is just an integer between one and five. So, let's look at some example subgraphs, let's start with the three edges, A, B, B, D and CD. This sub-graph satisfies properties one and two. That is, it has no cycles, there's no loops and it spans all of the vertices. If you start at any one of these four vertices, you can get to any of the other four vertices by using only red edges. So in that sense, this red sub-graph is a spanning tree. However, it is not the minimum cost spanning tree. There is another spanning tree which is even cheaper, has a smaller sum of edge costs, namely the edges AC, AB, and BD. This also has no cycles and it's also connected but the sum of the edge cost is only seven, smaller than the eight of the previous spanning tree. In fact, this pixograph is the unique minimum spanning tree of this graph. There is a sub graph that has three edges which has an even smaller sum, of edge costs, namely the triangle AB, BD and AD. But this light blue sub graph, this triangle, is not a spanning tree. In fact, it fails on both counts. It does obviously have a cycle. It has a loop. That's, what it is by definition. It's also not connected, so there's no way to get from C, the vertex, to any of the other three vertices by following only light blue edges. It's disconnected, and so it fails property one as well. So the MST problem in general is you're given it under a graph, like, for example, this four note, five edge graph, or presumably. something much larger and an interesting problem and your suppose to quickly identify the minimum spanding tree like in this example the pink subgraph. So what I want to do next is something you're probably quite accustomed to me doing by this point, is I want to make a couple of mild simplifying assumptions just among friends. So these assumptions are not important in the sense that all of the conclusions of these lectures will remain true, will remain valid even if these assumptions are violated but it'll make the lectures a little bit easier. It'll allow us to focus on the main points and not get distracted by less relevant details so here are the two assumptions that we're going to make throughout all of the lectures on minimum spanning trees. The first assumption we're going to make is that the input graph G is itself connected. That is G contains a path from any vertex to any other vertex. So why am I making this assumption? Well if this assumptions violated then the problem isn't even well defined. If the graph isn't connected then certainly none of it's subgraphs are connected so it has no spanning trees and it's not clear what we're trying to do. So, those of you who still remember the stuff we covered in part one in particular, graph search. Should recognize that this condition's easy to check in a pre-processing step. Just run something like breadth first search or depth first search. Remember, we know how to implement those in linear time. And those will, in particular, tell you whether or not the input graph is connected. Now, another thing you might be wondering is, suppose it was disconnected. Then what? Should be really just sort of throw up our hands and give up? You can define a version of the minimum spanning tree problem. A more general one called minimum spanning forest. Where, basically you want the minimum cost sub graph that spans as much stuff as possible. Essentially, it's responsible for computing a spanning tree within each of the connected components of the original graph. And using the algorithms I'll show you here, Prim's algorithm, Kruskal's algorithm, they're easily modified to solve the more general problem with disconnected input graphs as well. But again, for simplicity among friends, let's just focus on the connected graph case that contains all of the main ideas. Our second standing assumption throughout all of the minimum of spanning tree lectures will be that in the input graph the edge costs are distinct. So you're already use to this sort of no ties kind of assumption from our foray into scheduling algorithms, and we're going to do something similar here. Now again this assumption is not important in the sense that the algorithms that we cover prims algorithm crustgrals algorithm. They remain correct even if the input has equal cost edges, irrespective of how ties are broken. So the algorithms are correct as widely as you would want. That's it. I'm not going to actually prove for you that they are correct with ties. Remember we had our scheduling, application it was a little bit easier to get a proof of correctness without ties, I gave you that, and then optionally there was a slightly more complicated argument that handled ties. You can do the same thing here, but I'm just not going to give it to you. I'll leave that for the keen viewer to work out for themselves.