1 00:00:00,000 --> 00:00:05,087 So what we're, trying to discern. Is intent. 2 00:00:06,062 --> 00:00:16,032 What meaning is trying to be conveyed by the writer, the speaker or the actor? 3 00:00:17,055 --> 00:00:23,884 Well, we'll return to, the business of trying to figure out what, meaning is 4 00:00:23,884 --> 00:00:28,904 trying to be conveyed. Next week, when we look at extracting 5 00:00:28,904 --> 00:00:35,140 this, information from documents. But for the moment, let's turn to trying 6 00:00:35,140 --> 00:00:42,024 to find out other predictors of intent. Which might not have anything to do with 7 00:00:42,024 --> 00:00:45,099 the content of a document, which is being read. 8 00:00:45,099 --> 00:00:51,058 But, by the actions taken to retrieve the document by searching for it. 9 00:00:51,058 --> 00:00:57,082 For example, suppose we are searching for flower, red, gift, cheap, I mean you are 10 00:00:57,082 --> 00:01:04,021 searching using some combination of these keyword, and the web property, which is 11 00:01:04,021 --> 00:01:09,089 Google for example, needs to decide whether or not to show you some ads. 12 00:01:09,089 --> 00:01:13,056 In other words, are you a surfer or a shopper? 13 00:01:13,056 --> 00:01:18,080 Are you, interested in buying something or are you just browsing? 14 00:01:18,080 --> 00:01:23,013 Let's see what I get. When I search for cheap flowers. 15 00:01:23,013 --> 00:01:27,470 Clearly, it thinks that I'm trying to find some flowers to buy. 16 00:01:27,470 --> 00:01:34,671 On the other hand, if I just search for flowers and red, it shows me a whole bunch 17 00:01:34,671 --> 00:01:40,025 of red flowers, and no hats. Somehow, Google has figured out that if 18 00:01:40,025 --> 00:01:46,374 I'm searching for red flowers, I'm most likely trying to find out some information 19 00:01:46,374 --> 00:01:49,819 about flowers, rather than buy some flowers. 20 00:01:49,819 --> 00:01:56,617 How might it have figured this out? One way is by looking at past history. 21 00:01:56,617 --> 00:02:01,549 What the people searching about red flowers normally do? 22 00:02:01,549 --> 00:02:07,587 Do they buy stuff or do they not? On the other hand, people searching for 23 00:02:07,587 --> 00:02:13,109 cheap flowers or gift using flowers, do they buy or don't they? 24 00:02:13,109 --> 00:02:19,648 Learning from past experiences is something that we do all our lives. 25 00:02:19,648 --> 00:02:26,894 The field of machine learning is all about teaching computers, how to learn form 26 00:02:26,894 --> 00:02:33,136 their past experience or from past data and as you all know this is the heart of 27 00:02:33,136 --> 00:02:38,675 web intelligence and is what big data analytics is really all about. 28 00:02:38,675 --> 00:02:44,594 We'll introduce machine learning very simply, in a very basic manner. 29 00:02:44,594 --> 00:02:51,643 By looking at the simple example of how one might guess whether or not to put an 30 00:02:51,643 --> 00:02:57,488 ad, based on the past behavior of many, many searchers using just these four 31 00:02:57,488 --> 00:03:01,407 keywords. So let's suppose we have all the 32 00:03:01,407 --> 00:03:05,820 historical data. For example, somebody who used the 33 00:03:05,820 --> 00:03:11,326 keywords red, flower, gift, and not cheap, and did not buy something. 34 00:03:11,326 --> 00:03:18,037 On the other hand, somebody used the terms red and cheap but did buy something. 35 00:03:18,096 --> 00:03:25,044 And at the same time there may be people using exactly the same combination such as 36 00:03:25,044 --> 00:03:30,258 red flower gift. But these, they bought something where as 37 00:03:30,258 --> 00:03:35,581 in this case they didn't buy anything. So you have all this data from which you 38 00:03:35,581 --> 00:03:41,125 want to learn whether or not you should put an ad given the fact that somebody is 39 00:03:41,125 --> 00:03:45,680 searching with some combination of these four keywords. 40 00:03:45,680 --> 00:03:54,052 In the language of probability theory, what we really want to figure out is the 41 00:03:54,052 --> 00:04:01,538 probability of a by-action, given values yes or no, for whether or not the words, 42 00:04:01,538 --> 00:04:07,241 red, flowers, gift and cheap are present or absent in the query. 43 00:04:07,241 --> 00:04:13,710 In other words, we're trying to find the conditional probability of a buy given the 44 00:04:13,710 --> 00:04:19,914 values of these four random variables. Let's see what we really need to do. 45 00:04:19,914 --> 00:04:26,369 For each combination of keywords being present or absent. 46 00:04:26,369 --> 00:04:32,998 For example, all yes, three yes and a no. Two no and two yes. 47 00:04:32,998 --> 00:04:40,662 We have the probability that there is a buy and we would also like to figure out 48 00:04:40,662 --> 00:04:46,093 the probability for the same combination that there is not a buy. 49 00:04:48,074 --> 00:04:55,090 Let's see how we might do this, using the data that we have from past history. 50 00:04:55,090 --> 00:05:03,034 Notice that this is a summary table with one entry, for each combination, whereas 51 00:05:03,034 --> 00:05:10,022 our historical data probably has millions of entries for each combination. 52 00:05:10,022 --> 00:05:17,038 So this probability is computed by adding up, appropriately, the data from our 53 00:05:17,038 --> 00:05:23,086 historical transactions. Let's look at this pictorially. 54 00:05:23,086 --> 00:05:30,031 Suppose we have N instances or N historical queries. 55 00:05:30,031 --> 00:05:42,007 In, our cases they had, the keyword red. In F cases, they had the keyword flower. 56 00:05:42,007 --> 00:05:52,014 Similarly G for gift and C for cheap. And for K cases there was actually a buy 57 00:05:52,014 --> 00:05:57,035 action. And for N - K cases there was not a buy 58 00:05:57,035 --> 00:06:02,240 action. Well, let's try to find out what the 59 00:06:02,240 --> 00:06:12,072 conditional probability of a buy action is given that query had all the keywords 60 00:06:12,072 --> 00:06:17,870 present. So, the query lies in this piece of this 61 00:06:17,870 --> 00:06:27,045 diagram, because all these ovals are overlapping for I cases. 62 00:06:27,045 --> 00:06:39,265 The denominator is the set of all transactions which actually had a yes for 63 00:06:39,265 --> 00:06:43,857 R, F, G, and C. So, it's R + F + G + C. 64 00:06:43,857 --> 00:06:53,440 So, this becomes the conditional probability of a buy given R, F, G, and C 65 00:06:53,440 --> 00:06:59,054 are all yes. Similarly, this part of the diagram J 66 00:06:59,054 --> 00:07:09,004 indicates those transactions that the number of transactions which had a no for 67 00:07:09,004 --> 00:07:14,424 exactly the same combination R, F, G, and C being yes. 68 00:07:14,424 --> 00:07:23,420 So J / R + F + G + C is the conditional probability that buy = no given that 69 00:07:23,420 --> 00:07:29,841 combination all yes. So it appears all we need to figure out is 70 00:07:29,841 --> 00:07:35,934 all these values for every possible combination and we should be able to 71 00:07:35,934 --> 00:07:41,883 decide whether or not to put an add in front of a particular query sequence. 72 00:07:41,883 --> 00:07:48,058 The trouble is the number of such combinations can be quite large. 73 00:07:48,058 --> 00:07:53,081 How many do you think there are for just four keywords? 74 00:07:54,025 --> 00:08:02,055 Obviously, just sixteen. But suppose we had 1000 key words that 75 00:08:02,055 --> 00:08:07,025 suddenly becomes a very, very large number. 76 00:08:07,081 --> 00:08:15,013 Even for a few 100 key words it becomes extremely difficult to compute all these 77 00:08:15,013 --> 00:08:21,069 conditional probabilities. More importantly, when the number of 78 00:08:21,069 --> 00:08:28,047 keywords gets really large, there will be many combinations for which, you never 79 00:08:28,047 --> 00:08:34,015 actually have any history. For example, you're quite unlikely to find 80 00:08:34,015 --> 00:08:40,655 a query, which has red flowers and say map produce where intelligence all in one 81 00:08:40,655 --> 00:08:47,015 query you'll just not have such history. Nobody would be searching for this 82 00:08:47,015 --> 00:08:51,155 combination. So even if you had infinite computing 83 00:08:51,155 --> 00:08:56,895 power, you simply don't have enough history to compute all entries in this 84 00:08:56,895 --> 00:08:59,098 conditional probability table.