1
00:00:02,070 --> 00:00:07,940
Today, we will cover one main idea of statistical machine translations.

2
00:00:07,940 --> 00:00:10,720
Imagine you have a sentence, let's say,

3
00:00:10,720 --> 00:00:14,290
in French or in some other foreign language and then,

4
00:00:14,290 --> 00:00:17,925
you want to have its translation to English.

5
00:00:17,925 --> 00:00:21,400
How do you do this? Well, you can try to compute

6
00:00:21,400 --> 00:00:26,530
the probability of the English sentence given your French sentence.

7
00:00:26,530 --> 00:00:29,785
And then, you want to maximize this probability and

8
00:00:29,785 --> 00:00:34,050
take the sentence that gives you this maximum probability,

9
00:00:34,050 --> 00:00:37,055
right? Sounds very intuitively.

10
00:00:37,055 --> 00:00:40,970
Now, let us apply base rule here.

11
00:00:40,970 --> 00:00:45,270
So let us say that instead of computing the probabilities of E given F,

12
00:00:45,270 --> 00:00:48,755
we would better compute probabilities of F given E.

13
00:00:48,755 --> 00:00:53,125
And multiply it by some probability of the English sentence.

14
00:00:53,125 --> 00:00:57,075
And also, normalize it by some denominator.

15
00:00:57,075 --> 00:00:59,194
Now, do you have any idea?

16
00:00:59,194 --> 00:01:02,360
Can we further simplify this formula?

17
00:01:02,360 --> 00:01:04,480
Well, actually, we can.

18
00:01:04,480 --> 00:01:09,060
So, the denominator doesn't depend on the English sentence,

19
00:01:09,060 --> 00:01:12,665
which means that we can just get rid of it, okay.

20
00:01:12,665 --> 00:01:15,840
Now, we have this formula and now,

21
00:01:15,840 --> 00:01:18,450
the question is, why is that easier?

22
00:01:18,450 --> 00:01:22,365
Why we like it more than the original formula?

23
00:01:22,365 --> 00:01:25,495
This slide is going to explain why.

24
00:01:25,495 --> 00:01:28,365
So, we have two models now.

25
00:01:28,365 --> 00:01:33,870
We have decoupled our complicated problem to two more simple problems.

26
00:01:33,870 --> 00:01:36,435
One problem is language modeling.

27
00:01:36,435 --> 00:01:38,740
And actually, you know a lot about it.

28
00:01:38,740 --> 00:01:45,465
So, this is how to produce some meaningful probability of the sentence of words.

29
00:01:45,465 --> 00:01:48,645
Now, the other problem is translation model.

30
00:01:48,645 --> 00:01:53,190
And this model doesn't think about some coherent sentences.

31
00:01:53,190 --> 00:01:57,570
It just thinks about some good translation of E to F,

32
00:01:57,570 --> 00:02:03,990
so that you do not end up with something that is not related to your source sentence.

33
00:02:03,990 --> 00:02:10,500
So, you have two models about language and about adequacy of the translation.

34
00:02:10,500 --> 00:02:14,430
And then you have argmax to perform the search in

35
00:02:14,430 --> 00:02:20,805
your space and find the sentence in English that gives you the best probability.

36
00:02:20,805 --> 00:02:24,625
Now, I have one more interpretation for you.

37
00:02:24,625 --> 00:02:28,320
The Noisy Channel is a super popular idea,

38
00:02:28,320 --> 00:02:31,010
so you definitely need to know about it.

39
00:02:31,010 --> 00:02:33,550
And it is actually super simple.

40
00:02:33,550 --> 00:02:38,155
So, you have your source sentence and you have some probability of this source sentence.

41
00:02:38,155 --> 00:02:41,255
And then, it goes through the noisy channel.

42
00:02:41,255 --> 00:02:45,710
The noisy channel is represented by the conditional probability of

43
00:02:45,710 --> 00:02:50,515
what you get as the output given your input for the channel.

44
00:02:50,515 --> 00:02:52,030
So, as the output,

45
00:02:52,030 --> 00:02:55,150
you obtain your French sentence.

46
00:02:55,150 --> 00:02:58,000
So, let's say that your source sentence was

47
00:02:58,000 --> 00:03:03,080
spoilt with the channel and now you obtained it in French.

48
00:03:03,080 --> 00:03:10,380
Now, the rest of the video is about how to model these two probabilities,

49
00:03:10,380 --> 00:03:14,035
the probability of the sentence and the probability of

50
00:03:14,035 --> 00:03:18,490
the translation given some sentence.

51
00:03:18,490 --> 00:03:21,075
Okay. First, about the language model.

52
00:03:21,075 --> 00:03:25,585
You know a lot about it so we covered this in the week two.

53
00:03:25,585 --> 00:03:30,050
So, I will have just one slide to have a recap for you.

54
00:03:30,050 --> 00:03:33,565
So, we need to compute the probability of a sentence of words.

55
00:03:33,565 --> 00:03:37,680
We apply chain rule and then we know that we can factorize

56
00:03:37,680 --> 00:03:42,625
it into the probabilities of the next word given some previous history.

57
00:03:42,625 --> 00:03:47,485
You can use Markov assumption and then end up with n-gram language models.

58
00:03:47,485 --> 00:03:53,950
Or you can use some neural language models such as LSTM to produce the next word,

59
00:03:53,950 --> 00:03:55,635
you will need previous words.

60
00:03:55,635 --> 00:03:57,832
Now, translation model.

61
00:03:57,832 --> 00:03:59,770
Well, it is not so easy.

62
00:03:59,770 --> 00:04:03,940
So, imagine you have a sequence of words in one language and you need to

63
00:04:03,940 --> 00:04:08,625
produce the probability of a sequence or words in some other language.

64
00:04:08,625 --> 00:04:11,670
For example, this is foreign language,

65
00:04:11,670 --> 00:04:13,990
like Russian and English language,

66
00:04:13,990 --> 00:04:15,780
and these two sentences.

67
00:04:15,780 --> 00:04:18,825
How do you produce these probabilities?

68
00:04:18,825 --> 00:04:21,865
Well, it is not obvious for me.

69
00:04:21,865 --> 00:04:24,655
So, let us start with words level.

70
00:04:24,655 --> 00:04:30,760
We can understand something for the level of separate words in these sentences.

71
00:04:30,760 --> 00:04:33,035
Okay. What can we do?

72
00:04:33,035 --> 00:04:35,065
We can have a translation table.

73
00:04:35,065 --> 00:04:41,395
So, here, I have the probabilities of Russian words given some English words.

74
00:04:41,395 --> 00:04:43,505
And they are normalized, right.

75
00:04:43,505 --> 00:04:47,110
So, each row in this matrix is normalized into one.

76
00:04:47,110 --> 00:04:50,530
And this are just translations that I learn or

77
00:04:50,530 --> 00:04:53,960
that I look up in the dictionary or built somehow.

78
00:04:53,960 --> 00:04:56,015
Okay, it's doable.

79
00:04:56,015 --> 00:04:59,170
Now, how do I build the probability of

80
00:04:59,170 --> 00:05:05,010
the whole sentence given these separate probabilities?

81
00:05:05,010 --> 00:05:07,400
We need some word alignments.

82
00:05:07,400 --> 00:05:12,710
So, the problem is that we can have some reorderings in the language like here,

83
00:05:12,710 --> 00:05:19,145
or even worse, we can have some one to many or many to one correspondence.

84
00:05:19,145 --> 00:05:24,335
For example, the word appetit here corresponds to the appetite.

85
00:05:24,335 --> 00:05:28,910
And the word with here corresponds to two Russian words [FOREIGN]

86
00:05:28,910 --> 00:05:34,475
It means that we need some model to build those alignments.

87
00:05:34,475 --> 00:05:39,555
Now, another example would be words that can appear or disappear.

88
00:05:39,555 --> 00:05:46,235
For example, some articles or some auxiliary words can happen in one language and then,

89
00:05:46,235 --> 00:05:48,975
they can't just vanish in some other language.

90
00:05:48,975 --> 00:05:51,910
This is a very unique word alignment models

91
00:05:51,910 --> 00:05:54,950
and this is the topic will fall when next video.