1
00:00:00,000 --> 00:00:03,645
One of the most exciting developments were sequence-to-sequence

2
00:00:03,645 --> 00:00:08,780
models has been the rise of very accurate speech recognition.

3
00:00:08,780 --> 00:00:10,140
We're nearing the end of the course,

4
00:00:10,140 --> 00:00:13,800
we want to take just a couple of videos to give you a sense of how

5
00:00:13,800 --> 00:00:19,195
these sequence-to-sequence models are applied to audio data, such as the speech.

6
00:00:19,195 --> 00:00:22,110
So, what is the speech recognition problem?

7
00:00:22,110 --> 00:00:24,505
You're given an audio clip, x,

8
00:00:24,505 --> 00:00:31,058
and your job is to automatically find a text transcript, y.

9
00:00:31,058 --> 00:00:32,565
So, an audio clip,

10
00:00:32,565 --> 00:00:34,649
if you plot it looks like this,

11
00:00:34,649 --> 00:00:37,211
the horizontal axis here is time,

12
00:00:37,211 --> 00:00:43,313
and what a microphone does is it really measures minuscule changes in air pressure,

13
00:00:43,313 --> 00:00:46,470
and the way you're hearing my voice right now is that

14
00:00:46,470 --> 00:00:50,130
your ear is detecting little changes in air pressure,

15
00:00:50,130 --> 00:00:55,065
probably generated either by your speakers or by a headset.

16
00:00:55,065 --> 00:01:01,145
And some audio clips like this plots with the air pressure against time.

17
00:01:01,145 --> 00:01:06,192
And, if this audio clip is of me saying,

18
00:01:06,192 --> 00:01:08,835
"the quick brown fox", then hopefully,

19
00:01:08,835 --> 00:01:14,175
a speech recognition algorithm can input that audio clip and output that transcript.

20
00:01:14,175 --> 00:01:18,398
And because even the human ear doesn't process raw wave forms,

21
00:01:18,398 --> 00:01:21,090
but the human ear has physical structures that

22
00:01:21,090 --> 00:01:25,183
measures the amounts of intensity of different frequencies,

23
00:01:25,183 --> 00:01:29,745
there is, a common pre-processing step for audio data

24
00:01:29,745 --> 00:01:34,985
is to run your raw audio clip and generate a spectrogram.

25
00:01:34,985 --> 00:01:39,100
So, this is the plots where the horizontal axis is time,

26
00:01:39,100 --> 00:01:42,390
and the vertical axis is frequencies,

27
00:01:42,390 --> 00:01:46,820
and intensity of different colors shows the amount of energy.

28
00:01:46,820 --> 00:01:51,385
So, how loud is the sound at different frequencies? At different times?

29
00:01:51,385 --> 00:01:55,390
And so, these types of spectrograms,

30
00:01:55,390 --> 00:01:58,810
or you might also hear people talk about false back outputs,

31
00:01:58,810 --> 00:02:01,870
is often commonly applied

32
00:02:01,870 --> 00:02:06,470
pre-processing step before audio is pass into in the running algorithm.

33
00:02:06,470 --> 00:02:12,990
And the human ear does a computation pretty similar to this pre-processing step.

34
00:02:12,990 --> 00:02:18,164
So, one of the most exciting trends in speech recognition is that,

35
00:02:18,164 --> 00:02:19,735
once upon a time,

36
00:02:19,735 --> 00:02:27,072
speech recognition systems used to be built using phonemes and this where,

37
00:02:27,072 --> 00:02:31,135
I want to say hand-engineered basic units of cells.

38
00:02:31,135 --> 00:02:34,175
So, the quick brown fox represented as phonemes.

39
00:02:34,175 --> 00:02:36,074
I'm going to simplify a bit, let say,

40
00:02:36,074 --> 00:02:38,960
"The" has a "de" and "e" sound and Quick,

41
00:02:38,960 --> 00:02:42,578
has a "ku" and "wu", "ik", "k" sound,

42
00:02:42,578 --> 00:02:46,270
and linguist used to write off these basic units of sound,

43
00:02:46,270 --> 00:02:50,293
and try the Greek language down to these basic units of sound.

44
00:02:50,293 --> 00:02:52,670
So, brown, this aren't

45
00:02:52,670 --> 00:02:57,400
the official phonemes which are written with more complicated notation,

46
00:02:57,400 --> 00:03:02,620
but linguists use to hypothesize that writing down audio in terms of

47
00:03:02,620 --> 00:03:04,475
these basic units of sound called

48
00:03:04,475 --> 00:03:08,290
phonemes would be the best way to do speech recognition.

49
00:03:08,290 --> 00:03:10,776
But with end-to-end deep learning,

50
00:03:10,776 --> 00:03:15,505
we're finding that phonemes representations are no longer necessary.

51
00:03:15,505 --> 00:03:21,010
But instead, you can built systems that input an audio clip and directly

52
00:03:21,010 --> 00:03:27,795
output a transcript without needing to use hand-engineered representations like these.

53
00:03:27,795 --> 00:03:33,420
One of the things that made this possible was going to much larger data sets.

54
00:03:33,420 --> 00:03:43,425
So, academic data sets on speech recognition might be as a 300 hours, and in academia,

55
00:03:43,425 --> 00:03:49,220
3000 hour data sets of transcribed audio would be considered reasonable size,

56
00:03:49,220 --> 00:03:50,965
so lot of research has been done,

57
00:03:50,965 --> 00:03:56,065
a lot of research papers that are written on data sets there are several thousand voice.

58
00:03:56,065 --> 00:03:59,890
But, the best commercial systems are now trains

59
00:03:59,890 --> 00:04:04,681
on over 10,000 hours and sometimes over a 100,000 hours of audio.

60
00:04:04,681 --> 00:04:10,325
And, it's really moving to a much larger audio data sets,

61
00:04:10,325 --> 00:04:12,925
transcribe audio data sets were both x and y,

62
00:04:12,925 --> 00:04:15,487
together with deep learning algorithm,

63
00:04:15,487 --> 00:04:18,745
that has driven a lot of progress is speech recognition.

64
00:04:18,745 --> 00:04:22,310
So, how do you build a speech recognition system?

65
00:04:22,310 --> 00:04:23,634
In the last video,

66
00:04:23,634 --> 00:04:25,870
we're talking about the attention model.

67
00:04:25,870 --> 00:04:28,768
So, one thing you could do is actually do that,

68
00:04:28,768 --> 00:04:30,655
where on the horizontal axis,

69
00:04:30,655 --> 00:04:34,604
you take in different time frames of the audio input,

70
00:04:34,604 --> 00:04:38,905
and then you have an attention model try to output the transcript like,

71
00:04:38,905 --> 00:04:40,975
"the quick brown fox",

72
00:04:40,975 --> 00:04:42,873
or what it was said.

73
00:04:42,873 --> 00:04:47,933
One other method that seems to work well is to use the CTC cost for speech recognition.

74
00:04:47,933 --> 00:04:53,540
CTC stands for Connection is Temporal Classification and is due to Alex Graves,

75
00:04:53,540 --> 00:04:57,220
Santiago Fernandes, Faustino Gomez, and Jürgen Schmidhuber.

76
00:04:57,220 --> 00:05:01,398
So, here's the idea. Let's say the audio clip was someone saying,

77
00:05:01,398 --> 00:05:02,915
"the quick brown fox".

78
00:05:02,915 --> 00:05:07,090
We're going to use a new network structured

79
00:05:07,090 --> 00:05:11,460
like this with an equal number of input x's and output y's,

80
00:05:11,460 --> 00:05:17,350
and I have drawn a simple of what uni-directional for the RNN for this, but in practice,

81
00:05:17,350 --> 00:05:20,518
this will usually be a bidirectional LSP and

82
00:05:20,518 --> 00:05:23,847
bidirectional GIU and usually, a deeper model.

83
00:05:23,847 --> 00:05:30,280
But notice that the number of time steps here is very large and in speech recognition,

84
00:05:30,280 --> 00:05:32,950
usually the number of input time steps is much

85
00:05:32,950 --> 00:05:35,790
bigger than the number of output time steps.

86
00:05:35,790 --> 00:05:39,940
So, for example, if you have 10 seconds of audio and your features

87
00:05:39,940 --> 00:05:44,437
come at a 100 hertz so 100 samples per second,

88
00:05:44,437 --> 00:05:48,750
then a 10 second audio clip would end up with a thousand inputs.

89
00:05:48,750 --> 00:05:51,050
Right, so it's 100 hertz times 10 seconds,

90
00:05:51,050 --> 00:05:53,553
and so with a thousand inputs.

91
00:05:53,553 --> 00:05:56,828
But your output might not have a thousand alphabets,

92
00:05:56,828 --> 00:05:59,910
might not have a thousand characters.

93
00:05:59,910 --> 00:06:01,790
So, what do you do?

94
00:06:01,790 --> 00:06:08,788
The CTC cost function allows the RNN to generate an output like this ttt,

95
00:06:08,788 --> 00:06:10,810
there's a special character called

96
00:06:10,810 --> 00:06:12,860
the blank character, which we're going to write as an underscore here,

97
00:06:12,860 --> 00:06:22,645
h_eee___, and then maybe a space,

98
00:06:22,645 --> 00:06:32,995
we're going to write like this, so that a space and then ___ qqq__.

99
00:06:32,995 --> 00:06:40,615
And, this is considered a correct output for the first parts of the space,

100
00:06:40,615 --> 00:06:42,345
quick with the Q,

101
00:06:42,345 --> 00:06:46,460
and the basic rule for the CTC cost function is to collapse

102
00:06:46,460 --> 00:06:51,488
repeated characters not separated by "blank".

103
00:06:51,488 --> 00:06:52,956
So, to be clear,

104
00:06:52,956 --> 00:06:55,135
I'm using this underscore to denote

105
00:06:55,135 --> 00:07:01,620
a special blank character and that's different than the space character.

106
00:07:01,620 --> 00:07:05,181
So, there is a space here between the and quick,

107
00:07:05,181 --> 00:07:07,025
so I should output a space.

108
00:07:07,025 --> 00:07:09,605
But, by collapsing repeated characters,

109
00:07:09,605 --> 00:07:11,025
not separated by blank,

110
00:07:11,025 --> 00:07:18,142
it actually collapse the sequence into t, h,

111
00:07:18,142 --> 00:07:20,686
e, and then space, and q,

112
00:07:20,686 --> 00:07:26,400
and this allows your network

113
00:07:26,400 --> 00:07:31,380
to have a thousand outputs by repeating characters allow the times.

114
00:07:31,380 --> 00:07:34,830
So, inserting a bunch of blank characters and still ends

115
00:07:34,830 --> 00:07:38,990
up with a much shorter output text transcript.

116
00:07:38,990 --> 00:07:42,150
So, this phrase here "the quick brown fox" including spaces

117
00:07:42,150 --> 00:07:46,140
actually has 19 characters, and if somehow,

118
00:07:46,140 --> 00:07:48,090
the newer network is forced upwards of

119
00:07:48,090 --> 00:07:52,714
a thousand characters by allowing the network to insert blanks and

120
00:07:52,714 --> 00:07:55,171
repeated characters and can still represent

121
00:07:55,171 --> 00:08:00,480
this 19 character upwards with this 1000 outputs of values of Y.

122
00:08:00,480 --> 00:08:03,615
So, this paper by Alex Grace,

123
00:08:03,615 --> 00:08:08,323
as well as by those deep speech recognition system,

124
00:08:08,323 --> 00:08:10,120
which I was involved in,

125
00:08:10,120 --> 00:08:14,190
used this idea to build effective Speech recognition systems.

126
00:08:14,190 --> 00:08:19,935
So, I hope that gives you a rough sense of how speech recognition models work.

127
00:08:19,935 --> 00:08:23,415
Attention like models work and CTC models work and

128
00:08:23,415 --> 00:08:27,430
present two different options of how to go about building these systems.

129
00:08:27,430 --> 00:08:30,400
Now, today, building effective where

130
00:08:30,400 --> 00:08:33,330
production skills speech recognition system is

131
00:08:33,330 --> 00:08:37,585
a pretty significant effort and requires a very large data set.

132
00:08:37,585 --> 00:08:40,688
But, what I like to do in the next video is share you,

133
00:08:40,688 --> 00:08:43,610
how you can build a trigger word detection system,

134
00:08:43,610 --> 00:08:47,115
where keyword detection system which is actually much easier and can

135
00:08:47,115 --> 00:08:50,995
be done with even a smaller or more reasonable amount of data.

136
00:08:50,995 --> 00:08:53,000
So, let's talk about that in the next video.