1
00:00:00,000 --> 00:00:07,849
Let's return to technology now and ask
about search in a different, private

2
00:00:07,849 --> 00:00:15,176
context, such as searching one's own
desktop, one's own email, and other

3
00:00:15,176 --> 00:00:21,174
private data sets.
Clearly indexing the way we described it

4
00:00:21,174 --> 00:00:25,480
in the very beginning of this course will
work fine.

5
00:00:26,680 --> 00:00:33,752
But what about relevance.
Normally we don't have links between

6
00:00:33,752 --> 00:00:41,039
different documents on our Desktop or
hyper-linked emails, so we can't directly

7
00:00:41,039 --> 00:00:46,911
[inaudible] track.
And we need to use other associations.

8
00:00:46,911 --> 00:00:53,620
For example, we need to link documents
that talk about the same people of the

9
00:00:53,620 --> 00:01:00,504
same places or we might use relevance
feedback by tracking our own behavior to

10
00:01:00,504 --> 00:01:07,214
see which documents we actually use in
response to a bunch of source results,

11
00:01:07,214 --> 00:01:13,575
very similar to our page rank is being
improved by our own use of search

12
00:01:13,575 --> 00:01:19,268
everyday.
But there are even more problems with

13
00:01:19,268 --> 00:01:24,000
private data.
Most of the time, each document has

14
00:01:24,000 --> 00:01:30,954
multiply versions, or different formant
for the same documents like power point

15
00:01:30,954 --> 00:01:35,531
and pdf's.
And, many versions of the same document as

16
00:01:35,531 --> 00:01:40,748
it undergoes editing.
So detecting duplicates and handling them

17
00:01:40,748 --> 00:01:48,820
appropriately is very important.
Lastly, is search the only paradigm for

18
00:01:48,820 --> 00:01:53,033
finding stuff?
And this take us to.

19
00:01:53,033 --> 00:01:58,291
Areas such as, topic mining.
Activity mining, and contextual

20
00:01:58,291 --> 00:02:02,914
suggestions.
We'll return to some of these advanced

21
00:02:02,914 --> 00:02:07,991
topics very soon.
But before that let's, make things even

22
00:02:07,991 --> 00:02:10,710
more difficult.
And talk about.

23
00:02:10,710 --> 00:02:14,971
Data bases which are used in large
enterprises.

24
00:02:14,971 --> 00:02:19,867
And enterprise search.
Using such databases, as well as.

25
00:02:19,867 --> 00:02:28,111
A lot of unstructured, textual data.
Enterprise search poses all the challenges

26
00:02:28,111 --> 00:02:33,226
of private search that we discussed on the
previous chart.

27
00:02:33,226 --> 00:02:38,659
And more.
For example, the results of a search could

28
00:02:38,659 --> 00:02:44,597
depend on the context in which somebody is
forming that search.

29
00:02:44,597 --> 00:02:49,310
And people play multiple roles in an
organization.

30
00:02:49,310 --> 00:02:54,966
Sometimes, I'm acting as a researcher.
Sometimes as a teacher.

31
00:02:54,966 --> 00:03:00,115
Sometimes as an executive and so on.
Next.

32
00:03:00,845 --> 00:03:04,857
How do you classify.
Large sets of documents?

33
00:03:04,857 --> 00:03:09,599
Each one of us.
Faces challenges classifying our own

34
00:03:09,599 --> 00:03:11,970
documents.
On our desktops.

35
00:03:11,970 --> 00:03:16,976
The problem becomes even more.
Complicated when you have to classify

36
00:03:16,976 --> 00:03:20,216
documents.
Used by 100's or 1000's of people.

37
00:03:20,216 --> 00:03:25,664
What kind of classification works?
Should it be manually done, by a central

38
00:03:25,664 --> 00:03:28,389
team?
Or can it be done automatically?

39
00:03:28,389 --> 00:03:32,585
Can you have many different
classifications depending on.

40
00:03:32,585 --> 00:03:36,120
How you want to view.
A whole bunch of documents?

41
00:03:37,260 --> 00:03:41,395
What about security?
Not everybody's allowed to access every

42
00:03:41,395 --> 00:03:44,909
document, or every piece of data in an
organization.

43
00:03:44,909 --> 00:03:48,080
Some things are secret, and some highly
secret.

44
00:03:49,580 --> 00:03:55,340
And lastly, what about structured data?
The kind that's found in databases.

45
00:03:56,020 --> 00:04:05,104
Unfortunately, sequel is not the answer.
For example, text inside structured

46
00:04:05,104 --> 00:04:11,421
records is not easily searched using
sequel, as we'll explain shortly.

47
00:04:11,421 --> 00:04:18,562
Next, linking unstructured documents to
structured documents is also important,

48
00:04:18,562 --> 00:04:25,547
and not possible easily.
Finally just searching structured records

49
00:04:25,547 --> 00:04:31,496
and getting a list of related records
grouped together as objects is a huge

50
00:04:31,496 --> 00:04:37,523
challenge, which is simply not been
satisfactory resolved yet and that's what

51
00:04:37,523 --> 00:04:40,420
we'll talk about in our next example.