1 00:00:04,040 --> 00:00:08,006 In this module, we will see a really useful model, 2 00:00:08,006 --> 00:00:11,185 called latent dirichlet allocation. 3 00:00:11,185 --> 00:00:15,325 It is used for topic modeling. 4 00:00:15,325 --> 00:00:19,970 In this video, we will see what topic modeling is. 5 00:00:19,970 --> 00:00:23,660 For example, you want to build a recommender system. 6 00:00:23,660 --> 00:00:25,065 You want to recommend books. 7 00:00:25,065 --> 00:00:28,140 For example, if I read the Sherlock Holmes books, 8 00:00:28,140 --> 00:00:30,325 the system would recommend me, 9 00:00:30,325 --> 00:00:34,090 for example, equatorial books. 10 00:00:34,090 --> 00:00:35,985 Let's see how we can do this. 11 00:00:35,985 --> 00:00:40,195 We would like to extract features from the book. 12 00:00:40,195 --> 00:00:45,510 For example, we would like the features to correspond to topics. 13 00:00:45,510 --> 00:00:48,660 We would have a topic about detectives, 14 00:00:48,660 --> 00:00:51,925 we can have a topic about adventures, and horror. 15 00:00:51,925 --> 00:00:54,900 I would try to decompose the book into topics. 16 00:00:54,900 --> 00:01:00,090 For example, the Sherlock Holmes book would be 60% detective book, 17 00:01:00,090 --> 00:01:06,585 30% adventure, and maybe a 10% horror if you read the Mysteries Dark book. 18 00:01:06,585 --> 00:01:12,740 We can define the document as a distributional topics. 19 00:01:12,740 --> 00:01:16,200 We will have a document and for each, 20 00:01:16,200 --> 00:01:21,310 we will assign the probability to meet the topic. 21 00:01:21,310 --> 00:01:26,310 For example, here we'll meet a detective topic with 22 00:01:26,310 --> 00:01:32,540 probability of 0.6 and adventure topic with probability 0.3. 23 00:01:32,540 --> 00:01:35,385 Now, let's see what topics are. 24 00:01:35,385 --> 00:01:39,100 For example, if we have a topic related to sports, 25 00:01:39,100 --> 00:01:41,935 we would expect there be words like football, hockey, 26 00:01:41,935 --> 00:01:43,545 golf, score, and so on, 27 00:01:43,545 --> 00:01:46,520 to appear the most in this topic. 28 00:01:46,520 --> 00:01:49,920 You will have a topic about the economy and we expect Money, 29 00:01:49,920 --> 00:01:52,380 Dollar, Euro, Bank and so on, 30 00:01:52,380 --> 00:01:55,470 to be the most popular words, and finally, 31 00:01:55,470 --> 00:01:57,240 for politics, we'll have President, USA, Union, 32 00:01:57,240 --> 00:01:59,570 Law, you name it. 33 00:01:59,570 --> 00:02:04,800 Let's see how we can use these topics to generate a text. 34 00:02:04,800 --> 00:02:06,640 For example, we want to generate the sentence, 35 00:02:06,640 --> 00:02:10,870 football player from USA has salary in dollars. 36 00:02:10,870 --> 00:02:14,913 The words football, clearly is from the topic sports, 37 00:02:14,913 --> 00:02:22,065 the words USA is from the topic politics and the word dollars is from the topic economy. 38 00:02:22,065 --> 00:02:24,720 We will define a topic, 39 00:02:24,720 --> 00:02:26,930 as a distribution over words. 40 00:02:26,930 --> 00:02:29,740 For each word in the vocabulary, 41 00:02:29,740 --> 00:02:35,355 we will assign a probability to meet this exact word in a sentence. 42 00:02:35,355 --> 00:02:38,790 For example, here are the words football in the topic sports, 43 00:02:38,790 --> 00:02:46,125 happens in 20 percent of cases and all other words have lower probability. 44 00:02:46,125 --> 00:02:49,905 This definition is actually useful for interpreting topics. 45 00:02:49,905 --> 00:02:55,270 The matter that we will see and does not generate the labels for topics. 46 00:02:55,270 --> 00:02:59,400 It will just generate the distribution over words. 47 00:02:59,400 --> 00:03:03,370 If we will see the most frequent words, 48 00:03:03,370 --> 00:03:07,495 we will be able to assign the labels to the topics. 49 00:03:07,495 --> 00:03:13,728 For example, if I didn't show you that the first words are from the topic of sports, 50 00:03:13,728 --> 00:03:20,310 you can clearly say that since the words football and hockey are very frequent, 51 00:03:20,310 --> 00:03:24,150 the topic is probably about sports. 52 00:03:24,150 --> 00:03:31,265 After we would find the topics in the document, 53 00:03:31,265 --> 00:03:34,450 we will try to convey some similarity to say whether 54 00:03:34,450 --> 00:03:38,610 this book is similar to the one that you read, or not. 55 00:03:38,610 --> 00:03:42,150 We will assign a vector for each book. 56 00:03:42,150 --> 00:03:43,990 For example, for the Sherlock Holmes book, 57 00:03:43,990 --> 00:03:46,180 will have a vector with 0.6, 0.3, 58 00:03:46,180 --> 00:03:49,306 and 0.1 corresponding to the probabilities 59 00:03:49,306 --> 00:03:52,887 of the topics in this document. We will call this vector A. 60 00:03:52,887 --> 00:03:58,060 Similar way, we can compute the vector for the equatorial book. 61 00:03:58,060 --> 00:03:59,440 Now, we have two vectors, 62 00:03:59,440 --> 00:04:03,550 we can compute the distance or the similarity between them. 63 00:04:03,550 --> 00:04:07,965 We can use, for example, Euclidean distance or Cosine similarity. 64 00:04:07,965 --> 00:04:13,300 What to do next, is you rank the books according to similarity for example, 65 00:04:13,300 --> 00:04:15,610 and recommend the most similar books. 66 00:04:15,610 --> 00:04:21,855 In our case, we'll have recommendations for the people who read the Sherlock Holmes book, 67 00:04:21,855 --> 00:04:24,970 we will recommend them a equatorial book, for example. 68 00:04:24,970 --> 00:04:28,030 We have two goals. 69 00:04:28,030 --> 00:04:30,965 The first is to construct topics. 70 00:04:30,965 --> 00:04:33,630 So, from the collection of the documents, 71 00:04:33,630 --> 00:04:37,615 we want to find which topics are present in them. 72 00:04:37,615 --> 00:04:39,460 We want to do this automatically, 73 00:04:39,460 --> 00:04:41,560 in fully unsupervised way. 74 00:04:41,560 --> 00:04:43,360 Just from the collection of the text, 75 00:04:43,360 --> 00:04:45,340 we want to find the topics, 76 00:04:45,340 --> 00:04:47,740 and the probabilities in them. 77 00:04:47,740 --> 00:04:49,270 This is our first goal, 78 00:04:49,270 --> 00:04:52,255 and the second goal is to assign the topics to the texts. 79 00:04:52,255 --> 00:04:58,630 We would like to decompose an arbitrary book into the distribution over topics. 80 00:04:58,630 --> 00:05:01,207 For example here, we do compose the Sherlock Holmes 81 00:05:01,207 --> 00:05:04,635 into three topics with such probabilities. 82 00:05:04,635 --> 00:05:08,620 This is exactly what we will do, throughout this module.