1 00:00:00,000 --> 00:00:07,849 Let's return to technology now and ask about search in a different, private 2 00:00:07,849 --> 00:00:15,176 context, such as searching one's own desktop, one's own email, and other 3 00:00:15,176 --> 00:00:21,174 private data sets. Clearly indexing the way we described it 4 00:00:21,174 --> 00:00:25,480 in the very beginning of this course will work fine. 5 00:00:26,680 --> 00:00:33,752 But what about relevance. Normally we don't have links between 6 00:00:33,752 --> 00:00:41,039 different documents on our Desktop or hyper-linked emails, so we can't directly 7 00:00:41,039 --> 00:00:46,911 [inaudible] track. And we need to use other associations. 8 00:00:46,911 --> 00:00:53,620 For example, we need to link documents that talk about the same people of the 9 00:00:53,620 --> 00:01:00,504 same places or we might use relevance feedback by tracking our own behavior to 10 00:01:00,504 --> 00:01:07,214 see which documents we actually use in response to a bunch of source results, 11 00:01:07,214 --> 00:01:13,575 very similar to our page rank is being improved by our own use of search 12 00:01:13,575 --> 00:01:19,268 everyday. But there are even more problems with 13 00:01:19,268 --> 00:01:24,000 private data. Most of the time, each document has 14 00:01:24,000 --> 00:01:30,954 multiply versions, or different formant for the same documents like power point 15 00:01:30,954 --> 00:01:35,531 and pdf's. And, many versions of the same document as 16 00:01:35,531 --> 00:01:40,748 it undergoes editing. So detecting duplicates and handling them 17 00:01:40,748 --> 00:01:48,820 appropriately is very important. Lastly, is search the only paradigm for 18 00:01:48,820 --> 00:01:53,033 finding stuff? And this take us to. 19 00:01:53,033 --> 00:01:58,291 Areas such as, topic mining. Activity mining, and contextual 20 00:01:58,291 --> 00:02:02,914 suggestions. We'll return to some of these advanced 21 00:02:02,914 --> 00:02:07,991 topics very soon. But before that let's, make things even 22 00:02:07,991 --> 00:02:10,710 more difficult. And talk about. 23 00:02:10,710 --> 00:02:14,971 Data bases which are used in large enterprises. 24 00:02:14,971 --> 00:02:19,867 And enterprise search. Using such databases, as well as. 25 00:02:19,867 --> 00:02:28,111 A lot of unstructured, textual data. Enterprise search poses all the challenges 26 00:02:28,111 --> 00:02:33,226 of private search that we discussed on the previous chart. 27 00:02:33,226 --> 00:02:38,659 And more. For example, the results of a search could 28 00:02:38,659 --> 00:02:44,597 depend on the context in which somebody is forming that search. 29 00:02:44,597 --> 00:02:49,310 And people play multiple roles in an organization. 30 00:02:49,310 --> 00:02:54,966 Sometimes, I'm acting as a researcher. Sometimes as a teacher. 31 00:02:54,966 --> 00:03:00,115 Sometimes as an executive and so on. Next. 32 00:03:00,845 --> 00:03:04,857 How do you classify. Large sets of documents? 33 00:03:04,857 --> 00:03:09,599 Each one of us. Faces challenges classifying our own 34 00:03:09,599 --> 00:03:11,970 documents. On our desktops. 35 00:03:11,970 --> 00:03:16,976 The problem becomes even more. Complicated when you have to classify 36 00:03:16,976 --> 00:03:20,216 documents. Used by 100's or 1000's of people. 37 00:03:20,216 --> 00:03:25,664 What kind of classification works? Should it be manually done, by a central 38 00:03:25,664 --> 00:03:28,389 team? Or can it be done automatically? 39 00:03:28,389 --> 00:03:32,585 Can you have many different classifications depending on. 40 00:03:32,585 --> 00:03:36,120 How you want to view. A whole bunch of documents? 41 00:03:37,260 --> 00:03:41,395 What about security? Not everybody's allowed to access every 42 00:03:41,395 --> 00:03:44,909 document, or every piece of data in an organization. 43 00:03:44,909 --> 00:03:48,080 Some things are secret, and some highly secret. 44 00:03:49,580 --> 00:03:55,340 And lastly, what about structured data? The kind that's found in databases. 45 00:03:56,020 --> 00:04:05,104 Unfortunately, sequel is not the answer. For example, text inside structured 46 00:04:05,104 --> 00:04:11,421 records is not easily searched using sequel, as we'll explain shortly. 47 00:04:11,421 --> 00:04:18,562 Next, linking unstructured documents to structured documents is also important, 48 00:04:18,562 --> 00:04:25,547 and not possible easily. Finally just searching structured records 49 00:04:25,547 --> 00:04:31,496 and getting a list of related records grouped together as objects is a huge 50 00:04:31,496 --> 00:04:37,523 challenge, which is simply not been satisfactory resolved yet and that's what 51 00:04:37,523 --> 00:04:40,420 we'll talk about in our next example.